The Multimodal And Modular Ai Chef: Complex Recipe Generation From Imagery. (arXiv:2304.02016v1 [cs.CL])
The AI community has embraced multi-sensory or multi-modal approaches to
advance this generation of AI models to resemble expected intelligent
understanding. Combining language and imagery represents a familiar method for
specific tasks like image captioning or generation from descriptions. This
paper compares these monolithic approaches to a lightweight and specialized
method based on employing image models to label objects, then serially
submitting this resulting object list to a large language model (LLM). This use
of multiple Application Programming Interfaces (APIs) enables better than 95%
mean average precision for correct object lists, which serve as input to the
latest Open AI text generator (GPT-4). To demonstrate the API as a modular
alternative, we solve the problem of a user taking a picture of ingredients
available in a refrigerator, then generating novel recipe cards tailored to
complex constraints on cost, preparation time, dietary restrictions, portion
sizes, and multiple meal plans. The research concludes that monolithic
multimodal models currently lack the coherent memory to maintain context and
format for this task and that until recently, the language models like GPT-2/3
struggled to format similar problems without degenerating into repetitive or
non-sensical combinations of ingredients. For the first time, an AI chef or
cook seems not only possible but offers some enhanced capabilities to augment
human recipe libraries in pragmatic ways. The work generates a 100-page recipe
book featuring the thirty top ingredients using over 2000 refrigerator images
as initializing lists.
Source: https://arxiv.org/abs/2304.02016