University of Cambridge > Talks.cam > Language Technology Lab Seminars > ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Panagiotis Fytas.

Vision-and-Language (VL) models should be able to address shortcomings of Large Language Models (LLMs) (e.g., lack of symbol grounding, reporting bias affecting commonsense knowledge) by transferring information encoded in the visual domain to the language domain, and vice versa, by successfully modeling a cross-modal space. Fueled by the recent successes of VL models integrating text with images, the community has started researching VL models integrating text with video sequences. Integrating language with temporal video sequences should provide (i) models with better grounding capabilities as well as (ii) the ability to capitalize on an even bigger amount of tacit knowledge, such as presuppositions, consequences, or temporal reasoning. Despite promising results on multimodal tasks (such as Image Captioning, Visual Question Answering, Image-Text Retrieval etc.), recent literature has shown that models integrating image and text are highly susceptible to statistical bias present in large-scale training data, enabling them to solve multi-modal tasks without actually leveraging multi-modal signals. Analogously, we focus our analysis on Video-and-Language models (VidLMs) and construct VILMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, VILMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. VILMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs’ grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity