Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods Abstract Focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare results with the SOTA methods.
Introduction Multimodal learning models: generate comprehensible but concise and grammatically well-formed descriptions of the visual content, or vice versa by generating the visual content for a given textual description in a natural language of choice, identify objets in the visual content and infer their relationships to reson about, or answer arbitrary questions about them, navigate through and environment by leveraging input from both vision and natural language instructions, translate textual content from one language to another while leveraging the visual content for sense disambiguation, generate stories about the visual content, and so on....