University of Cambridge > > Language Technology Lab Seminars > End-to-End Fine-grained Multi-modal Understanding

End-to-End Fine-grained Multi-modal Understanding

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Panagiotis Fytas.

Previously, multi-modal reasoning systems relied on a pre-trained object detector to extract regions of interest from the image. However, this crucial module was typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This made it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this talk, I will first discuss MDETR , an end-to-end modulated detector that detects objects in an image, conditioned on a raw text query like a caption or a question. The model is trained on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. Next, we will explore further developments in architecture design that employ fusion between the visual and textual modalities deeper in the model, achieving state of the art results when coupled with a coarse-to-fine pre-training strategy. Finally, I will discuss a novel fine-grained visual understanding task and evaluation benchmark which shows that existing benchmarks overestimate VL model’s ability to understand and reason over complex visual scenes leaving substantial room for improvement.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity