Latent Variable Models for Visual Question Answering. (arXiv:2101.06399v1 [cs.CV])

Conventional models for Visual Question Answering (VQA) explore deterministic
approaches with various types of image features, question features, and
attention mechanisms. However, there exist other modalities that can be
explored in addition to image and question pairs to bring extra information to
the models. In this work, we propose latent variable models for VQA where extra
information (e.g. captions and answer categories) are incorporated as latent
variables to improve inference, which in turn benefits question-answering
performance. Experiments on the VQA v2.0 benchmarking dataset demonstrate the
effectiveness of our proposed models in that they improve over strong
baselines, especially those that do not rely on extensive language-vision



Related post