Improved Affective Computing via CNN and Bat Algorithm Optimization: A Case Study on IEMOCAP and TESS
DOI:
https://doi.org/10.29304/jqcsm.2025.17.22199Keywords:
Speech Emotion Recognition, Convolutional Neural Network, Bat Algorithm, Feature Extraction, Deep Learning, MFCCAbstract
This study presents an enhanced Speech Emotion Recognition (SER) model that integrates Convolutional Neural Networks (CNN) with the Bat Algorithm (BA), a nature-inspired metaheuristic optimization technique. The objective is to improve the accuracy and generalizability of emotion classification from speech by optimizing the neural network architecture. The model utilizes handcrafted acoustic features—pitch, energy, zero-crossing rate (ZCR), and Mel-Frequency Cepstral Coefficients (MFCCs)—as inputs. Before being input into a deep neural network whose hyperparameters are modified using the Bat Algorithm, these characteristics are preprocessed and normalized. The last model uses a Gamma Classifier (GC) and Error Correcting Output Codes (ECOC) to guarantee strong categorization. Experimental findings employing benchmark datasets like IEMOCAP, EMO-DB, and Berlin DB show better performance, with validation accuracy climbing as high as 98.44%. This hybrid design offers a consistent method for practical emotion identification systems and outperforms traditional methods.
Downloads
References
A. Q. Al-Dujaili, A. J. Humaidi, L. S. Al-Zubaidi, et al., "Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions," Journal of Big Data, vol. 8, no. 1, p. 53, 2021. DOI: https://doi.org/10.1186/s40537-021-00436-x
A. Q. Al-Dujaili, A. J. Humaidi, Z. G. Hadi, and A. R. Ajel, "Comparison Between Convolutional Neural Network (CNN) and SVM in Skin Cancer Images Recognition," Journal of Techniques, vol. 3, no. 4, pp. 15–22, 2021. DOI: https://doi.org/10.51173/jt.v3i4.390
L. S. Al-Zubaidi, Y. Duan, A. Q. Al-Dujaili, et al., "Deepening into the Suitability of Using Pre-trained Models of ImageNet vs. a Lightweight CNN in Medical Imaging," PeerJ Computer Science, vol. 7, p. e715, 2021. DOI: https://doi.org/10.7717/peerj-cs.715
Abdelhamid, A. A., El-Kenawy, E. M., Alotaibi, B., Amer, G. M., Abdelkader, M. Y., Ibrahim, A., & Eid, M. M. (2022). Robust speech emotion recognition using CNN+LSTM based on stochastic fractal search optimization algorithm. IEEE Access, 10, 49265–49283. https://doi.org/10.1109/ACCESS.2022.3172954
F. Daneshfar, S. J. Kabudian, and A. Neekabadi, "Speech emotion recognition using hybrid spectral–prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier," Applied Acoustics, vol. 166, p. 107360, 2020. DOI: 10.1016/j.apacoust.2020.107360
P. Rajasekhar and M. Kamaraju, "Emotion speech recognition based on adaptive fractional Deep Belief Network and mean-updated PSO-WOA optimization," Data Technologies & Applications, vol. 54, no. 3, pp. 297–322, 2020. DOI: https://doi.org/10.1108/DTA-07-2019-0120
Y. Zhao and X. Shu, "Speech emotion analysis using convolutional neural network (CNN) and gamma classifier-based error correcting output codes (ECOC)," Scientific Reports, vol. 13, article 20398, 2023. DOI: https://doi.org/10.1038/s41598-023-47118-4
A. Verma, P. Bajaj, and S. Jain, "Hybrid deep learning with optimal feature selection for speech emotion recognition," Knowledge-Based Systems, vol. 257, 108659, 2022. DOI: https://doi.org/10.1016/j.knosys.2022.108659
Y. Zhao and X. Shu, "Speech emotion analysis using convolutional neural network (CNN) and gamma classifier‑based error correcting output codes (ECOC)," Scientific Reports, vol. 13, art. 20398, 2023. DOI: 10.1038/s41598-023-47118-4
P. Yenigalla et al., "Speech Emotion Recognition Using Spectrogram & Phoneme Embedding," in Proc. Interspeech, 2018, pp. 3688–3692. DOI: 10.21437/Interspeech.2018-1811
A. Verma, P. Bajaj, S. Jain, "Hybrid Deep Learning with Optimal Feature Selection for Speech Emotion Recognition," Knowledge-Based Systems, vol. 257, 108659, 2022. DOI: 10.1016/j.knosys.2022.108659
R. V. Sharan, C. Mascolo, B. W. Schuller, "Emotion Recognition from Speech Signals by Mel‑Spectrogram and a CNN‑RNN," in Proc. IEEE EMBC, Jul. 2024, pp. 1–4. DOI: 10.1109/EMBC53108.2024.10782952
S. Samad, Y. Zhang, J. Du et al., "Attention Based Fully Convolutional Network for Speech Emotion Recognition," *arXiv:1806.01506*, 2018. DOI: 10.48550/arXiv.1806.01506
Mustaqeem, & Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), 2133. https://doi.org/10.3390/math8122133
N. Penumajji, "Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms," *arXiv:2503.19677*, 2025. DOI: 10.48550/arXiv.2503.19677
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
Tiwari, V. (2010). MFCC and its applications in speaker recognition. International Journal on Emerging Technologies, 1(1), 19–22.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Yang, X.-S. (2010). A new metaheuristic bat-inspired algorithm. In Nature Inspired Cooperative Strategies for Optimization (pp. 65–74). Springer. https://doi.org/10.1007/978-3-642-12433-4_6
Busso, C., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.
Burkhardt, F., et al. (2005). A database of German emotional speech. Interspeech, 1517–1520.
Dupuis, A., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). University of Toronto.
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807–814.
Krishna, K. M., & Jadon, M. S. (2021). A survey of evaluation metrics used for speech emotion recognition systems. IEEE Access, 9, 50784–50795. https://doi.org/10.1109/ACCESS.2021.3068591
Young, S., et al. (2006). The HTK Book. Cambridge University Engineering Department.
Rabiner, L. R., & Schafer, R. W. (2007). Digital processing of speech signals. Pearson Education.
Jurafsky, D., & Martin, J. H. (2023). Speech and language processing (3rd ed.). Prentice Hall.
Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhancement. In Proceedings of Interspeech (pp. 3229–3233).
Mustaqeem, & Kwon, S. (2021). Optimal feature selection based speech emotion recognition using two‐stream deep convolutional neural network. International Journal of Intelligent Systems, 36, 1–20. https://doi.org/10.1002/int.22505.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Sawsan J. Muhammed, Mohamed I. Shujaa, Ahmed B. A. Alwahhab

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.