Voice separation and recognition using machine learning and deep learning a review paper

Authors

  • Zaineb h. ibrahemm Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq
  • Ammar I. Shihab Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

DOI:

https://doi.org/10.29304/jqcm.2023.15.3.1262

Keywords:

Voice Isolation, Deep Neural Networks, Speech Recognition, Speaker Identification, Frequency Domain, Time Domain

Abstract

 Voice isolation, a prominent research area in the field of speech processing, has garnered a great deal of attention due to its prospective implications in numerous domains. Deep neural networks (DNNs) have emerged as a potent instrument for addressing the challenges associated with vocal isolation. This paper presents a comprehensive study on the use of DNNs for voice isolation, focusing on speech recognition and speaker identification tasks. The proposed method uses frequency domain and time domain techniques to improve the separation of target utterances from background noise. The experimental results demonstrate the efficacy of the proposed method, revealing substantial improvements in voice isolation precision and robustness. This study's findings contribute to the increasing corpus of research on voice isolation techniques and provide valuable insights into the application of DNNs to improve speech processing tasks.

Downloads

Download data is not yet available.

References

Hershey, J. R., Chen, Z., Le Roux, J. and Watanabe, S., "Deep clustering: Discriminative embeddings for segmentation and separation," vol. 1, no. 2, pp. 31-35, 2016.
Kolbæk, M., Yu, D., Tan, Z. H. and Jensen, J., "Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901-1913, 2017.
Graves, A., Fernández, S., Gomez, F., Schmidhuber and J., "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," In Proceedings of the 23rd international conference on Machine learning, vol. 2, no. 11, pp. 369-376, 2006.
Chan, W., Jaitly, N., Le, Q., Vinyals and O. , "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), vol. 3, no. 5, pp. 4960-4964, 2017.
Michelsanti, D., Tan, Z. H., Zhang, S. X., Xu, Y., Yu, M., Jensen, J. and Yu, D., "An overview of deep-learning-based audio-visual speech enhancement and separation," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, no. 3, pp. 1368-1396, 2021.
Subramanian, A. S., Weng, C., Yu, M., Zhang, S. X., Xu, Y., Watanabe, S. and Yu, D. , "Far-field location guided target speech extraction using end-to-end speech recognition objectives," In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 19, no. 7, pp. 7299-7303, 2019.
Malik, M., Malik, M. K., Mehmood, K. and Makhdoom, I., "Automatic speech recognition: a survey," Multimedia Tools and Applications, vol. 80, no. 3, pp. 9411-9457, 2021.
Koteswararao, Y. V. and Rao, C. R., "Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks," Multimedia Systems, vol. 27, no. 2, pp. 271-286, 2021.
Panayotov, V., Chen, G., Povey, D. and Khudanpur, S., "Librispeech: an asr corpus based on public domain audio books," In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), vol. 2, no. 9, pp. 5206-5210, 2016.
Wichern, G., Antognini, J., Flynn, M., Zhu, L. R., McQuinn, E., Crow, D., ... and Roux, J. L., "Wham!: Extending speech separation to noisy environments," arXiv preprint arXiv, pp. 1907-1913, 2020.
Snyder, D., Chen, G. and Povey, D., "Musan: A music, speech, and noise corpus," arXiv preprint, pp. 105-112, 2015.
Nagrani, A., Chung, J. S. and Zisserman, A., "Voxceleb: a large-scale speaker identification dataset," arXiv, vol. 3, no. 11, pp. 1706-1711, 2017.
Zhang, Y., Liu, Y. and Wang, D., " Complex ratio masking for singing voice separation," ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 12, no. 3, pp. 41-45, 2021.
Lee, D. and Seung, H. S., "Algorithms for non-negative matrix factorization," Advances in neural information processing systems, vol. 13, no. 1, pp. 22-27, 2001.
Hyvärinen, A. and Oja, E. , "Independent component analysis: algorithms and applications," Neural networks, vol. 13, no. 4, pp. 411-430, 2000.
Luo, Y., Chen, Z., Hershey, J. R., Le Roux and J., & Mesgarani, N., "Deep clustering and conventional networks for music separation: Stronger together.," 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), vol. 2, no. 3, pp. 61-65, 2018.
Luo, Y. and Mesgarani, N., "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation," IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256-1266, 2019.
Lin, K. W. E., , Balamurali, B. T., Koh, E., Lui, S. and Herremans, D., "Singing voice separation using a deep convolutional neural network trained by ideal binary mask and cross entropy," Neural Computing and Applications, vol. 32, no. 4, pp. 1037-1050, 2021.
Yu, D., Kolbæk, M., Tan, Z. H. and Jensen, J. , "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 21, no. 5, pp. 241-245, 2017.
Subakan, C., Ravanelli, M., Cornell, S., Bronzi, M. and Zhong, J., "Attention is all you need in speech separation," ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, no. 12, pp. 21-25, 2021.
Tzinis, E., Venkataramani, S., Wang, Z., Subakan, C. and Smaragdis, P. , "Two-step sound source separation: Training on learned latent targets," ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 16, no. 3, pp. 31-35, 2010.
Drude, L., von Neumann and T., Haeb-Umbach, R., "Deep attractor networks for speaker re-identification and blind source separation," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 13, no. 8, pp. 11-15, 2018.
Chang, X., Zhang, W., Qian, Y., Le Roux, J. and Watanabe, S., "End-to-end multi-speaker speech recognition with transformer," ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 13, no. 2, pp. 6134-6138, 2020.
Yousefi, M., Khorram, S. and Hansen, J. H. , "Probabilistic permutation invariant training for speech separation," arvix, vol. 12, no. 5, pp. 132-139, 2019.
Narayanaswamy, V., Thiagarajan, J. J., Anirudh, R. and Anirudh, R., " Unsupervised audio source separation using generative priors," Unsupervised audio source separation using generative priors., vol. 11, no. 5, pp. 1367-1372, 2020.
Luo, Y. and Mesgarani, N. , "TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 12, no. 3, pp. 696-700, 2018.
Jiang, Y., Qiu, Y., Shen, X., Shen, X. and Liu, H. , "SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability," Applied Sciences, pp. 112-118, 2022.
Xu, Y., Zhang, Z., Yu, M., Zhang, S. X., Chen, L. and Yu, D., " Generalized RNN beamformer for target speech separation," CoRR, vol. 2, pp. 112-117, 2021.
Macartney, C. and Weyde, T., "Improved Speech Enhancement with the Wave-U-Net," ArXiv, vol. 11, no. 7, pp. 1811-1817, 2018.
Gao, Z., Zhang, S., Mcloughlin and I., Yan, Z. , "Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition," Interspeech, vol. 4, no. 2, pp. 167-173, 2022.
Chang, X., Qian, Y., Yu, K. and Watanabe, S., "End-to-end monaural multi-speaker ASR system without pretraining," ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 34, no. 10, pp. 6256-6260, 2019.
Ye, M. and Wan, H., "Improved Transformer-Based Dual-Path Network with Amplitude and Complex Domain Feature Fusion for Speech Enhancement," Entropy, vol. 25, no. 2, pp. 228-232, 2023.

Hassan, H. S., "Hybrid Filter for Enhancing Input Microphone-Based Discriminative Model," Iraqi Journal of Science, pp. 2434-2439, 2020.
Togami, M., "Joint training of deep neural networks for multi-channel dereverberation and speech source separation," ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , vol. 23, no. 1, pp. 3032-3036, 2020.
Maciejewski, M., Shi, J., Watanabe, S. and Khudanpur, S., "Training noisy single-channel speech separation with noisy oracle sources: A large gap and a small step," ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 12, no. 6, pp. 5774-5778, 2021.
Wang, Z., Le Roux, J., Wang, D., Hershey and J.R., "End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction," arvix, vol. 6, no. 10, pp. 1804-1809, 2018.
Zhu, C., Huang, D., Huang, D., Chen, Y., Lin, J. and Jiang, D., " A Robust Unsupervised Method for the Single Channel Speech Separation," 2019 15th International Conference on Computational Intelligence and Security (CIS), vol. 35, no. 5, pp. 387-390, 2019.

Downloads

Published

2023-09-30

How to Cite

ibrahemm, Z. h., & Shihab, A. I. (2023). Voice separation and recognition using machine learning and deep learning a review paper. Journal of Al-Qadisiyah for Computer Science and Mathematics, 15(3), Comp Page 11–34. https://doi.org/10.29304/jqcm.2023.15.3.1262

Issue

Section

Computer Articles