Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

descriptionPublicationkeyboard_double_arrow_right Article 01 Dec 2025 English Publisher:Kyushu UniversityJournal:Evergreen, volume 12, pages 1,840-1,866 (issn: 2189-0420, eissn: 2432-5953,

Copyright policy )

Authors: Bhandari, Sravya; Kumar, Abhishek; Batta, Priya; Shambhu, Shankar;

doi: 10.5109/7402620 , 10.5281/zenodo.18373531 , 10.5281/zenodo.18373532

Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

- Summary
- Subjects
- Metrics

Abstract

Image captioning is a significant area of application for artificial intelligence techniques. When a machine can interpret an image similar to humans, it indicates a higher intelligence level and comprehension of the image. This research displays advancements in real-time image collection and labeling systems using a triad of computer vision, natural language processing, and classification. The approach employs three deep learning models to generate human-level natural language descriptors, resulting in a user-friendly system. The model comprises a multimodal pipeline of deep learning architectures, enabling the extraction of probabilistic features for each object category. Our model surpasses other image captioning models, achieving a CIDEr score of 37.93% on the common MS-COCO Captioning task test baseline, thereby exhibiting superior syntactical saliency when integrated with advanced object features. Additionally, we observed that incorporating an intermediate step of clustering objects before classification enhances the final model's performance. By implementing these methodologies, we have developed a more capable and accurate model, proficient in object classification and generating informative image descriptions. Such capabilities can significantly augment human comprehension and decision-making across various applications, particularly in advancing sustainable cities and communities, fostering quality education through improved accessibility of visual content, promoting industry, innovation, and infrastructure with cutting-edge AI technologies.

Published in Evergreen, Volume 12, Issue 04. Citation formats available via DOI link.

Related Organizations

Chandigarh University
India
Amity University
India
Chitkara University
India
Liverpool John Moores University
United Kingdom

Keywords

Deep Learning, POS tagging, KNN classification, K-Means Clustering, YOLO, MS COCO, Multimodality

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

gold