Does a Technique for Building Multimodal Representation Matter? — Comparative Analysis. (arXiv:2206.06367v1 [cs.LG])

Creating a meaningful representation by fusing single modalities (e.g., text,
images, or audio) is the core concept of multimodal learning. Although several
techniques for building multimodal representations have been proven successful,
they have not been compared yet. Therefore it has been ambiguous which
technique can be expected to yield the best results in a given scenario and
what factors should be considered while choosing such a technique. This paper
explores the most common techniques for building multimodal data
representations — the late fusion, the early fusion, and the sketch, and
compares them in classification tasks. Experiments are conducted on three
datasets: Amazon Reviews, MovieLens25M, and MovieLens1M datasets. In general,
our results confirm that multimodal representations are able to boost the
performance of unimodal models from 0.919 to 0.969 of accuracy on Amazon
Reviews and 0.907 to 0.918 of AUC on MovieLens25M. However, experiments on both
MovieLens datasets indicate the importance of the meaningful input data to the
given task. In this article, we show that the choice of the technique for
building multimodal representation is crucial to obtain the highest possible
model’s performance, that comes with the proper modalities combination. Such
choice relies on: the influence that each modality has on the analyzed machine
learning (ML) problem; the type of the ML task; the memory constraints while
training and predicting phase.



Related post