An Improved Image Generation Conditioned on Text Using Stable Diffusion Model
DOI:
https://doi.org/10.29304/jqcsm.2024.16.41772Keywords:
Artificial Intelligence,, Text-to-image generation,, Stable Diffusion model,, Realistic Images,, Generative modelAbstract
One technique for creating visuals that correspond to textual descriptions is called "text-to-image generation." It affects a wide range of applications and research fields (e.g., photo-editing, photo-searching, art-making, computer-aided design, image reconstruction, captioning, and portrait drawing). With the development of text-to-image generation models, artificial intelligence (AI) has reached a turning point where robots are now able to convert human language into aesthetically beautiful and coherent images, creating new opportunities for creativity and innovation. The creation of stable diffusion models is one of this field's most noteworthy developments. These models provide a strong framework for producing realistic images that are semantically linked with the given textual descriptions. But even with their remarkable abilities, conventional text-to-image models frequently have serious shortcomings, especially when it comes to training timeframes and computing costs. These models can be costly and time-consuming to train because they usually need large amounts of processing power and long training times. The main goal of this work is to develop a better Stable Diffusion model to overcome these shortcomings and produce high-quality images from text. The suggested model will drastically cut down on training durations and processing needs without sacrificing the quality of the output photos. The proposed method shows that the fine-tuning of the Stable Diffusion model results in a considerable improvement in producing images that are more akin to the original. The results of the improved model denoted a lower FID score (212.52) when contrasted with the base model (251.22).
Downloads
References
M. Żelaszczyk and J. Mańdziuk, “Text-to-Image Cross-Modal Generation: A Systematic Review,” arXiv Prepr. arXiv2401.11631, 2024.
S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text-to-image synthesis: A review,” Neural Networks, vol. 144, pp. 187–209, 2021.
M. M. Hashim, H. J. Alhamdane, A. H. Herez, and M. S. Taha, “Based on Competitive Marketing: A New Framework mechanism in Social Media,” in IOP Conference Series: Materials Science and Engineering, IOP Publishing, 2020, p. 12121.
T. Sousa, J. Corre-ia, V. Pereira, and M. Rocha, “Generative deep learning for targeted compound design,” J. Chem. Inf. Model., vol. 61, no. 11, pp. 5343–5361, 2021.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning, PMLR, 2016, pp. 1060–1069.
M. Dubova, “Building human-like communicative intelligence: A grounded perspective,” Cogn. Syst. Res., vol. 72, pp. 63–79, 2022.
N. S. Ali, A. F. Alsafo, H. D. Ali, and M. S. Taha, “An Effective Face Detection and Recognition Model Based on Improved YOLO v3 and VGG 16 Networks,” J. homepage http//iieta. org/journals/ijcmem, vol. 12, no. 2, pp. 107–119, 2024.
M. Elasri, O. Elharrouss, S. Al-Maadeed, and H. Tairi, “Image generation: A review,” Neural Process. Lett., vol. 54, no. 5, pp. 4609–4646, 2022.
T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2337–2346.
C. Bodnar, “Text to image synthesis using generative adversarial networks,” arXiv Prepr. arXiv1805.00676, 2018.
M. Otani et al., “Toward verifiable and reproducible human evaluation for text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14277–14286.
R. Gal et al., “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv Prepr. arXiv2208.01618, 2022.
Z. Jin, X. Shen, B. Li, and X. Xue, “Training-free diffusion model adaptation for variable-sized text-to-image synthesis,” Adv. Neural Inf. Process. Syst., vol. 36, 2024.
S. Ramzan, M. M. Iqbal, and T. Kalsum, “Text-to-Image generation using deep learning,” Eng. Proc., vol. 20, no. 1, p. 16, 2022.
H. Tibebu, A. Malik, and V. De Silva, “Text to image synthesis using stacked conditional variational autoencoders and conditional generative adversarial networks,” in Science and Information Conference, Springer, 2022, pp. 560–580.
K. D. Kumar, S. Srang, and D. Valy, “A Review of Generative Adversarial Networks (GANs) for Technology-Assisted Learning: Solving Teaching and Learning Challenges,” in 2022 International Conference on Automation, Computing and Renewable Systems (ICACRS), IEEE, 2022, pp. 820–826.
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 8780–8794, 2021.
A. Singh and S. Agrawal, “CanvasGAN: A simple baseline for text to image generation by incrementally patching a canvas,” arXiv Prepr. arXiv1810.02833, 2018.
X. Ouyang, X. Zhang, D. Ma, and G. Agam, “Generating image sequence from description with LSTM conditional GAN,” in 2018 24th International Conference on Pattern Recognition (ICPR), IEEE, 2018, pp. 2456–2461.
T. Qiao, J. Zhang, D. Xu, and D. Tao, “Learn, imagine and create: Text-to-image generation from prior knowledge,” Adv. Neural Inf. Process. Syst., vol. 32, 2019.
A. A.-A. Hadad, H. N. Khalid, Z. S. Naser, and M. S. Taha, “A Robust Color Image Watermarking Scheme Based on Discrete Wavelet Transform Domain and Discrete Slantlet Transform Technique,” Ing. des Syst. d’Information, vol. 27, no. 2, p. 313, 2022.
G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2426–2435.
S. Gu et al., “Vector quantized diffusion model for text-to-image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10696–10706.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Adv. Neural Inf. Process. Syst., vol. 30, 2017.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Adv. Neural Inf. Process. Syst., vol. 29, 2016.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, Ieee, 2009, pp. 248–255.
S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal voc,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 41–48.
J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” Int. J. Comput. Vis., vol. 119, pp. 3–22, 2016.
S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3558–3568.
M.-E. Nilsback, “An automatic visual flora-segmentation and classification of flower images.” Oxford University, 2009.
Z. S. Naser, H. N. Khalid, A. S. Ahmed, M. S. Taha, and M. M. Hashim, “Artificial Neural Network-Based Fingerprint Classification and Recognition.,” Rev. d’Intelligence Artif., vol. 37, no. 1, 2023.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Sara Faez Abdylgahni, Aahwan Anwer Abdulmunem
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.