University of Cambridge > Talks.cam > Data Intensive Science Seminar Series > Learning generalizable models on large-scale multi-modal data

Learning generalizable models on large-scale multi-modal data

Add to your list(s) Download to your calendar using vCal

  • UserYutian Chen - DeepMind World_link
  • ClockWednesday 18 October 2023, 14:00-15:00
  • HouseMaxwell Centre.

If you have a question about this talk, please contact James Fergusson.

The abundant spectrum of multi-modal data provides a significant opportunity for augmenting the training of foundational models beyond mere text. In this talk, I will introduce two lines of work that leverage large-scale models, trained on Internet-scale multi-modal datasets, to achieve good generalization performance. The first work trains an audio-visual model on YouTube datasets of videos and enables automatic video translation and dubbing. The model is able to learn the correspondence between audio and visual features, and use this knowledge to translate videos from one language to another. The second work trains a multi-modal, multi-task, multi-embodiment generalist policy on a massive collection of simulated control tasks, vision, language, and robotics. The model is able to learn to perform a variety of tasks, including controlling a robot arm, playing a game, and translating text. Both lines of work exhibit the potential future trajectory of foundational models, highlighting the transformative power of integrating multi-modal inputs and outputs.

This talk is part of the Data Intensive Science Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity