Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation. (arXiv:2111.12727v1 [cs.CV])

While captioning models have obtained compelling results in describing
natural images, they still do not cover the entire long-tail distribution of
real-world concepts. In this paper, we address the task of generating
human-like descriptions with in-the-wild concepts by training on web-scale
automatically collected datasets. To this end, we propose a model which can
exploit noisy image-caption pairs while maintaining the descriptive style of
traditional human-annotated datasets like COCO. Our model separates content
from style through the usage of keywords and stylistic tokens, employing a
single objective of prompt language modeling and being simpler than other
recent proposals. Experimentally, our model consistently outperforms existing
methods in terms of caption quality and capability of describing long-tail
concepts, also in zero-shot settings. According to the CIDEr metric, we obtain
a new state of the art on both COCO and nocaps when using external data.

Source: https://arxiv.org/abs/2111.12727


Related post