Latent Alignment of Procedural Concepts in Multimodal Recipes. (arXiv:2101.04727v1 [cs.CL])

We propose a novel alignment mechanism to deal with procedural reasoning on a
newly released multimodal QA dataset, named RecipeQA. Our model is solving the
textual cloze task which is a reading comprehension on a recipe containing
images and instructions. We exploit the power of attention networks,
cross-modal representations, and a latent alignment space between instructions
and candidate answers to solve the problem. We introduce constrained
max-pooling which refines the max-pooling operation on the alignment matrix to
impose disjoint constraints among the outputs of the model. Our evaluation
result indicates a 19% improvement over the baselines.



