A Review of Transformer-Based Approaches for Image Captioning
In: Applied Sciences, Vol 13, Iss 11103, Jg. 13 (2023), Heft 11103, p 11103
Online
academicJournal
Zugriff:
Visual understanding is a research area that bridges the gap between computer vision and natural language processing. Image captioning is a visual understanding task in which natural language descriptions of images are automatically generated using vision-language models. The transformer architecture was initially developed in the context of natural language processing and quickly found application in the domain of computer vision. Its recent application to the task of image captioning has resulted in markedly improved performance. In this paper, we briefly look at the transformer architecture and its genesis in attention mechanisms. We more extensively review a number of transformer-based image captioning models, including those employing vision-language pre-training, which has resulted in several state-of-the-art models. We give a brief presentation of the commonly used datasets for image captioning and also carry out an analysis and comparison of the transformer-based captioning models. We conclude by giving some insights into challenges as well as future directions for research in this area.
Titel: |
A Review of Transformer-Based Approaches for Image Captioning
|
---|---|
Autor/in / Beteiligte Person: | Ondeng, Oscar ; Ouma, Heywood ; Akuon, Peter |
Link: | |
Zeitschrift: | Applied Sciences, Vol 13, Iss 11103, Jg. 13 (2023), Heft 11103, p 11103 |
Veröffentlichung: | MDPI AG, 2023 |
Medientyp: | academicJournal |
ISSN: | 2076-3417 (print) |
DOI: | 10.3390/app131911103 |
Schlagwort: |
|
Sonstiges: |
|