Improved Affective Computing via CNN and Bat Algorithm Optimization: A Case Study on IEMOCAP and TESS

Authors

  • Sawsan J. Muhammed Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq.
  • Mohamed I. Shujaa Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq
  • Ahmed B. A. Alwahhab Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq

DOI:

https://doi.org/10.29304/jqcsm.2025.17.22199

Keywords:

Speech Emotion Recognition, Convolutional Neural Network, Bat Algorithm, Feature Extraction, Deep Learning, MFCC

Abstract

This study presents an enhanced Speech Emotion Recognition (SER) model that integrates Convolutional Neural Networks (CNN) with the Bat Algorithm (BA), a nature-inspired metaheuristic optimization technique. The objective is to improve the accuracy and generalizability of emotion classification from speech by optimizing the neural network architecture. The model utilizes handcrafted acoustic features—pitch, energy, zero-crossing rate (ZCR), and Mel-Frequency Cepstral Coefficients (MFCCs)—as inputs. Before being input into a deep neural network whose hyperparameters are modified using the Bat Algorithm, these characteristics are preprocessed and normalized.  The last model uses a Gamma Classifier (GC) and Error Correcting Output Codes (ECOC) to guarantee strong categorization.  Experimental findings employing benchmark datasets like IEMOCAP, EMO-DB, and Berlin DB show better performance, with validation accuracy climbing as high as 98.44%.  This hybrid design offers a consistent method for practical emotion identification systems and outperforms traditional methods.

Downloads

Download data is not yet available.

References

A. Q. Al-Dujaili, A. J. Humaidi, L. S. Al-Zubaidi, et al., "Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions," Journal of Big Data, vol. 8, no. 1, p. 53, 2021. DOI: https://doi.org/10.1186/s40537-021-00436-x

A. Q. Al-Dujaili, A. J. Humaidi, Z. G. Hadi, and A. R. Ajel, "Comparison Between Convolutional Neural Network (CNN) and SVM in Skin Cancer Images Recognition," Journal of Techniques, vol. 3, no. 4, pp. 15–22, 2021. DOI: https://doi.org/10.51173/jt.v3i4.390

L. S. Al-Zubaidi, Y. Duan, A. Q. Al-Dujaili, et al., "Deepening into the Suitability of Using Pre-trained Models of ImageNet vs. a Lightweight CNN in Medical Imaging," PeerJ Computer Science, vol. 7, p. e715, 2021. DOI: https://doi.org/10.7717/peerj-cs.715

Abdelhamid, A. A., El-Kenawy, E. M., Alotaibi, B., Amer, G. M., Abdelkader, M. Y., Ibrahim, A., & Eid, M. M. (2022). Robust speech emotion recognition using CNN+LSTM based on stochastic fractal search optimization algorithm. IEEE Access, 10, 49265–49283. https://doi.org/10.1109/ACCESS.2022.3172954

F. Daneshfar, S. J. Kabudian, and A. Neekabadi, "Speech emotion recognition using hybrid spectral–prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier," Applied Acoustics, vol. 166, p. 107360, 2020. DOI: 10.1016/j.apacoust.2020.107360

P. Rajasekhar and M. Kamaraju, "Emotion speech recognition based on adaptive fractional Deep Belief Network and mean-updated PSO-WOA optimization," Data Technologies & Applications, vol. 54, no. 3, pp. 297–322, 2020. DOI: https://doi.org/10.1108/DTA-07-2019-0120

Y. Zhao and X. Shu, "Speech emotion analysis using convolutional neural network (CNN) and gamma classifier-based error correcting output codes (ECOC)," Scientific Reports, vol. 13, article 20398, 2023. DOI: https://doi.org/10.1038/s41598-023-47118-4

A. Verma, P. Bajaj, and S. Jain, "Hybrid deep learning with optimal feature selection for speech emotion recognition," Knowledge-Based Systems, vol. 257, 108659, 2022. DOI: https://doi.org/10.1016/j.knosys.2022.108659

Y. Zhao and X. Shu, "Speech emotion analysis using convolutional neural network (CNN) and gamma classifier‑based error correcting output codes (ECOC)," Scientific Reports, vol. 13, art. 20398, 2023. DOI: 10.1038/s41598-023-47118-4

P. Yenigalla et al., "Speech Emotion Recognition Using Spectrogram & Phoneme Embedding," in Proc. Interspeech, 2018, pp. 3688–3692. DOI: 10.21437/Interspeech.2018-1811

A. Verma, P. Bajaj, S. Jain, "Hybrid Deep Learning with Optimal Feature Selection for Speech Emotion Recognition," Knowledge-Based Systems, vol. 257, 108659, 2022. DOI: 10.1016/j.knosys.2022.108659

R. V. Sharan, C. Mascolo, B. W. Schuller, "Emotion Recognition from Speech Signals by Mel‑Spectrogram and a CNN‑RNN," in Proc. IEEE EMBC, Jul. 2024, pp. 1–4. DOI: 10.1109/EMBC53108.2024.10782952

S. Samad, Y. Zhang, J. Du et al., "Attention Based Fully Convolutional Network for Speech Emotion Recognition," *arXiv:1806.01506*, 2018. DOI: 10.48550/arXiv.1806.01506

Mustaqeem, & Kwon, S. (2020). CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics, 8(12), 2133. https://doi.org/10.3390/math8122133

N. Penumajji, "Deep Learning for Speech Emotion Recognition: A CNN Approach Utilizing Mel Spectrograms," *arXiv:2503.19677*, 2025. DOI: 10.48550/arXiv.2503.19677

Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420

Tiwari, V. (2010). MFCC and its applications in speaker recognition. International Journal on Emerging Technologies, 1(1), 19–22.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

Yang, X.-S. (2010). A new metaheuristic bat-inspired algorithm. In Nature Inspired Cooperative Strategies for Optimization (pp. 65–74). Springer. https://doi.org/10.1007/978-3-642-12433-4_6

Busso, C., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.

Burkhardt, F., et al. (2005). A database of German emotional speech. Interspeech, 1517–1520.

Dupuis, A., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). University of Toronto.

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807–814.

Krishna, K. M., & Jadon, M. S. (2021). A survey of evaluation metrics used for speech emotion recognition systems. IEEE Access, 9, 50784–50795. https://doi.org/10.1109/ACCESS.2021.3068591

Young, S., et al. (2006). The HTK Book. Cambridge University Engineering Department.

Rabiner, L. R., & Schafer, R. W. (2007). Digital processing of speech signals. Pearson Education.

Jurafsky, D., & Martin, J. H. (2023). Speech and language processing (3rd ed.). Prentice Hall.

Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhancement. In Proceedings of Interspeech (pp. 3229–3233).

Mustaqeem, & Kwon, S. (2021). Optimal feature selection based speech emotion recognition using two‐stream deep convolutional neural network. International Journal of Intelligent Systems, 36, 1–20. https://doi.org/10.1002/int.22505.

Downloads

Published

2025-06-30

How to Cite

J. Muhammed, S., I. Shujaa, M., & B. A. Alwahhab, A. (2025). Improved Affective Computing via CNN and Bat Algorithm Optimization: A Case Study on IEMOCAP and TESS. Journal of Al-Qadisiyah for Computer Science and Mathematics, 17(2), Comp. 241–260. https://doi.org/10.29304/jqcsm.2025.17.22199

Issue

Section

Computer Articles