Image Captioning Generation Using Inception V3 and Attention Mechanism
DOI:
https://doi.org/10.29304/jqcm.2023.15.2.1228Keywords:
Image Captioning Generation, Inception V3, LSTM, Attention MechanismAbstract
Captioning an image is the process of using a visual comprehension system with a model of language, by which we can construct sentences that are meaningful and syntactically accurate for an image. The goal is to train a deep learning model to learn the correspondence between an image and its textual description. This is a challenging task due to the inherent complexity and subjectivity of language, as well as the visual variability of images. Computer vision and natural language processing are both used in the difficult task of image captioning. In this paper, an end to end deep learning-based image captioning system using Inception V3 and Long-Short Term Memory (LSTM) with an attention mechanism is implemented. Expansive experimentation has been realized on one of the benchmark datasets named MS COCO, and the experiential results signify that this intended system is capable of surpassing diverse related systems concerning the extensively utilized measures of evaluation, and the accomplished results were 0.543, 0.87, 0.66, 0.51, 0.42 for Meteor and BLEU(B1-B4), respectively.
Downloads
References
[2] H. Parvin, A. R. Naghsh-Nilchi, H. M. Mohammadi, "Transformer-based local-global guidance for image captioning", Expert Systems with Applications, vol. 223, 119774, (2023).
[3] R. Padate, A. Jain, M. Kalla, A. Sharma, "Image caption generation using a dual attention mechanism", Engineering Applications of Artificial Intelligence, vol. 123, Part A, 106112, (2023).
[4] S. Mohsen, A. Elkaseer and S. G. Scholz, "Industry 4.0-Oriented Deep Learning Models for Human Activity Recognition," in IEEE Access, vol. 9, pp. 150508-150521, (2021).
[5] M. H. Abdul-Hadi, J. Waleed, "Human Speech and Facial Emotion Recognition Technique Using SVM," 2020 International Conference on Computer Science and Software Engineering (CSASE), Duhok, Iraq, (2020), pp. 191-196.
[6] Saad Albawi, Muhanad Hameed Arif, Jumana Waleed, "Skin cancer classification dermatologist-level based on deep learning model", Acta Scientiarum. Technology, vol. 45, pp. e61531-e61531, 2023. doi: 10.4025/actascitechnol.v45i1.61531.
[7] M. F. Asghar, M. H. Ali, J. Waleed, "Pedestrian Attributes and Activity Recognition Using Deep Learning: A Comprehensive Survey", Al-Iraqia Journal for Scientific Engineering Research, vol. 2, no.1, pp. 40-56, (2022).
[8] F. Xiao, X. Gong, Y. Zhang, Y. Shen, J. Li, X. Gao, "DAA: Dual LSTMs with adaptive attention for image captioning", Neurocomputing, vol. 364, pp. 322-329, (2019).
[9] Z. Deng, Z. Jiang, R. Lan, W. Huang, X. Luo, "Image captioning using DenseNet network and adaptive attention", Signal Processing: Image Communication, vol. 85, 115836, (2020).
[10] P. Tian, H. Mo, L. Jiang, "Image Caption Generation Using Multi-Level Semantic Context Information", Symmetry, vol. 13, no. 7, 1184, (2021).
[11] C. Wang, X. Gu, "Local-global visual interaction attention for image captioning", Digital Signal Processing, vol. 130, 103707, (2022).
[12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, "Rethinking the Inception Architecture for Computer Vision," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, (2016), pp. 2818-2826.
[13] X. Xia, C. Xu and B. Nan, "Inception-v3 for flower classification," 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, (2017), pp. 783-787.
[14] M. Lukoševičius, H. Jaeger, "Reservoir computing approaches to recurrent neural network training", Computer Science Review, vol. 3, no. 3, pp. 127-149, (2009).
[15] H. S. Gill, O. I. Khalaf, Y. Alotaibi, S. Alghamdi, F. Alassery, "Multi-Model CNN-RNN-LSTM Based Fruit Recognition and Classification", Intelligent Automation and Soft Computing, vol. 33, no. 1, pp.637-650, (2022).
[16] A. Jaffar, N. M. Thamrin, M. Syahirul Amin, M. Ali, M. Farid Misnan, A. Ihsan Mohd Yassin, "WATER QUALITY PREDICTION USING LSTM-RNN: A REVIEW", Penerbit UMT Journal of Sustainability Science and Management, vol.17, pp. 205-226, (2022).
[17] A. Tsantekidis, N. Passalis, A. Tefas, "Chapter 5 - Recurrent neural networks", Deep Learning for Robot Perception and Cognition, Academic Press, pp. 101-115, (2022).
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, "Microsoft COCO: Common Objects in Context", Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol. 8693. Springer, Cham, (2014).