Improving the Reliability and Accuracy of Image Captioning Systems Using Ensemble of FC, Softmax, and LSTM Deep Decoders

Ghadeer Abdulrasool Mohammed; Raidah S. Khudeyer; Maytham Alabbas

doi:10.29304/jqcsm.2026.18.12573

Authors

Ghadeer Abdulrasool Mohammed University of Basra College of Computer Science and Information Technology
Raidah S. Khudeyer University of Basra College of Computer Science and Information Technology
Maytham Alabbas University of Basra College of Computer Science and Information Technology

DOI:

https://doi.org/10.29304/jqcsm.2026.18.12573

Keywords:

Image Captioning, Deep Learning, Ensemble Learning, CNN–LSTM Networks

Abstract

In this work, a deep system for automatic image description is presented, which aims to produce fluent, meaningful, and structurally coherent sentences for input images. The proposed architecture is based on an encoder-decoder framework, in which high-level image features are first extracted by an Inception-v3 deep convolutional network and then fed as a compressed image representation to an LSTM-based language decoder to produce a word-by-word sentence. On this basic structure, a voting-based ensemble learning framework is designed, in which three deep paths, including a fully connected (FC) network, a Softmax linear model, and a sequence-oriented LSTM decoder, are trained independently, and the word probability vectors at the output level are combined with a maximum voting mechanism. The evaluation is performed on the standard Flickr8k database and using BLEU-1 to BLEU-4, METEOR, and ROUGE-L metrics. The results show that the best single LSTM model achieves values of 0.64, 0.39, 0.23, and 0.16 for BLEU-1 to BLEU-4, and 0.22 and 0.50 for METEOR and ROUGE-L, respectively, while the Ensemble model improves the values to 0.74, 0.50, 0.35, and 0.22 for BLEU-1 to BLEU-4, 0.475 for METEOR, and 0.55 for ROUGE-L; such that the relative improvements in BLEU-3 and BLEU-4 are 54% and 41%, respectively. The paired t-test also shows that the difference in Ensemble performance with single models is significant at the 95% confidence level, and compared to the existing methods on Flickr8k, competitive results are obtained and, in some measures, superior.

Downloads

Download data is not yet available.

References

T. Ghandi, H. Pourreza, and H. Mahyar, “Deep learning approaches on image captioning: A review,” ACM Computing Surveys, vol. 56, no. 3, pp. 1-39, 2023.

A. Rehman, M. Harouni, F. Zogh, T. Saba, M. Karimi, F. S. Alamri, and G. Jeon, “Detection of Lungs Tumors in CT Scan Images Using Convolutional Neural Networks,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 21, no. 4, pp. 769-777, 2024.

M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From show to tell: A survey on deep learning-based image captioning,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 539-559, 2022.

V. De Silva, and T. Sumanathilaka, "A Survey on Image Captioning Using Object Detection and NLP." pp. 270-275.

M. Karimi, Z. Karimi, M. Khosravi, Z. Delaram, M. H. Dehsheikhim, S. A. Najafabadi, M. A. Aliabadi, and N. Tavakoli, “Feature selection methods in big medical databases: a comprehensive survey,” International Journal of Theoretical & Applied Computational Intelligence, pp. 181-209, 2025.

M. Bhalekar, and M. Bedekar, “D-CNN: a new model for generating image captions with text extraction using deep learning for visually challenged individuals,” Engineering, Technology & Applied Science Research, vol. 12, no. 2, pp. 8366-8373, 2022.

S. C. Gupta, N. R. Singh, T. Sharma, A. Tyagi, and R. Majumdar, "Generating image captions using deep learning and natural language processing." pp. 1-4.

A. A. Alaidany, and A. Lakizadeh, “Improving the Accuracy of Cancer Driver Gene Identification based on Dimensionality Reduction Using Deep AutoEncoders,” International Journal of Intelligent Engineering & Systems, vol. 18, no. 9, 2025.

S. S. Seyed Abolghasemi, M. Emadi, and M. Karimi, “Accuracy improvement of breast tumor detection based on dimension reduction in the spatial and edge features and edge structure in the image,” Majlesi Journal of Electrical Engineering, vol. 18, no. 1, pp. 33-44, 2024.

D. I. Lee, J. H. Lee, S. H. Jang, S. J. Oh, and I. C. Doo, “Crop disease diagnosis with deep learning-based image captioning and object detection,” Applied Sciences, vol. 13, no. 5, pp. 3148, 2023.

M. Harouni, M. Karimi, and S. Rafieipour, “Precise segmentation techniques in various medical images,” Artificial Intelligence and Internet of Things, pp. 117-166, 2021.

A. Ali A, M. Ali K, M. Marwah M, and F. Tibah, “A REVIEW OF MACHINE LEARNING IN BANKING RISK MANAGEMENT AND POSSIBLE RESEARCH TOPICS,” Journal of Engineering, Mechanics and Modern Architecture, vol. 4, no. 1, pp. 50-57, 2025.

S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault, "Image captioning through image transformer."

J. Wang, W. Wang, L. Wang, Z. Wang, D. D. Feng, and T. Tan, “Learning visual relationship and context-aware attention for image captioning,” Pattern Recognition, vol. 98, pp. 107075, 2020.

M. Humaira, P. Shimul, M. A. R. K. Jim, A. S. Ami, and F. M. Shah, “A hybridized deep learning method for Bengali image captioning,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 2, 2021.

I. Azhar, I. Afyouni, and A. Elnagar, "Facilitated deep learning models for image captioning." pp. 1-6.

R. Castro, I. Pineda, W. Lim, and M. E. Morocho-Cayamcela, “Deep learning approaches based on transformer architectures for image captioning tasks,” IEEE Access, vol. 10, pp. 33679-33694, 2022.

P. J. Chun, T. Yamane, and Y. Maemura, “A deep learning‐based image captioning method to automatically generate comprehensive explanations of bridge damage,” Computer‐Aided Civil and Infrastructure Engineering, vol. 37, no. 11, pp. 1387-1401, 2022.

R. Beddiar, and M. Oussalah, "Explainability in medical image captioning," Explainable Deep Learning AI, pp. 239-261: Elsevier, 2023.

A. M. Rinaldi, C. Russo, and C. Tommasino, “Automatic image captioning combining natural language processing and deep neural networks,” Results in Engineering, vol. 18, pp. 101107, 2023.

R. Farkh, G. Oudinet, and Y. Foued, “Image Captioning Using Multimodal Deep Learning Approach,” Computers, Materials & Continua, vol. 81, no. 3, 2024.

T. Liu, Q. Cai, C. Xu, B. Hong, J. Xiong, Y. Qiao, and T. Yang, “Image Captioning in news report scenario,” arXiv preprint arXiv:2403.16209, 2024.

M. J. Parseh, and S. Ghadiri, “Graph-based image captioning with semantic and spatial features,” Signal Processing: Image Communication, vol. 133, pp. 117273, 2025.

A. Khan, and J. Singh, “A novel image captioning technique using deep learning methodology,” ICCK Transactions on Machine Intelligence, vol. 1, no. 2, pp. 52-68, 2025.

A. Saouabe, S. Tkatek, M. Mazar, and I. Mourtaji, "Evolution of Image Captioning Models: An Overview." pp. 1-5.

A. Alsayed, M. Arif, T. M. Qadah, and S. Alotaibi, “A systematic literature review on using the encoder-decoder models for image captioning in English and Arabic languages,” Applied Sciences, vol. 13, no. 19, pp. 10894, 2023.

J.-F. Yeh, K.-M. Lin, and C.-C. Chen, “Image Captioning Using Topic Faster R-CNN-LSTM Networks,” Information, vol. 16, no. 9, pp. 726, 2025.

A. Karpathy, and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions." pp. 3128-3137.

T. Jiang, Z. Zhang, and Y. Yang, “Modeling coverage with semantic embedding for image caption generation,” The Visual Computer, vol. 35, no. 11, pp. 1655-1665, 2019.

A. Patel, and A. Varier, “Hyperparameter analysis for image captioning,” arXiv preprint arXiv:2006.10923, 2020.

H. Katpally, and A. Bansal, "Ensemble learning on deep neural networks for image caption generation." pp. 61-68.

J. Bineeshia, "Image caption generation using cnn-lstm based approach." p. 352.

F. H. Dahri, A. A. Chandio, N. A. Dahri, and M. A. Soomro, “Image caption generator using convolutional recurrent neural network feature fusion,” Journal of Xi’an Shiyou University, Natural Science Edition, vol. 9, pp. 1088-1095, 2023.

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image captioning with semantic attention." pp. 4651-4659.

Alaidany, A. A., & Mahdi, M. M. A Review of IoT-Based Wearable Sensor Systems for Healthcare Monitoring.‏

Improving the Reliability and Accuracy of Image Captioning Systems Using Ensemble of FC, Softmax, and LSTM Deep Decoders

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

indexed

Make a Submission

Information

Developed By

journaldetails

details

Journal Details

Journal Policy

Aims and Scope

About Paper Review

Review Process

Abstracting and Indexing

Feedback

guidelines

Guidelines for Authors

Instruction for Authors

Copyright Agreement

DECLARATION FORM

Example of Published Paper

Licenses and Copyright

Publishing Fees:

Current Issue

Journal of Al-Qadisiyah for computer science and mathematics (JQCSM)

ISSN 2521-3504 (Online), ISSN 2074-0204 (Print)

It is scientific journal issued by College of computer Science and IT / University of Al-Qadisiyah