A Survey on Image-text Multimodal Models. (arXiv:2309.15857v1 [cs.CL])

Amidst the evolving landscape of artificial intelligence, the convergence of
visual and textual information has surfaced as a crucial frontier, leading to
the advent of image-text multimodal models. This paper provides a comprehensive
review of the evolution and current state of image-text multimodal models,
exploring their application value, challenges, and potential research
trajectories. Initially, we revisit the basic concepts and developmental
milestones of these models, introducing a novel classification that segments
their evolution into three distinct phases, based on their time of introduction
and subsequent impact on the discipline. Furthermore, based on the tasks’
significance and prevalence in the academic landscape, we propose a
categorization of the tasks associated with image-text multimodal models into
five major types, elucidating the recent progress and key technologies within
each category. Despite the remarkable accomplishments of these models, numerous
challenges and issues persist. This paper delves into the inherent challenges
and limitations of image-text multimodal models, fostering the exploration of
prospective research directions. Our objective is to offer an exhaustive
overview of the present research landscape of image-text multimodal models and
to serve as a valuable reference for future scholarly endeavors. We extend an
invitation to the broader community to collaborate in enhancing the image-text
multimodal model community, accessible at:

Source: https://arxiv.org/abs/2309.15857


