University of Cambridge > Talks.cam > Cambridge ML Systems Seminar Series > Interpreting and Controlling Intermediate Representations in Large Language Models

Interpreting and Controlling Intermediate Representations in Large Language Models

Add to your list(s) Download to your calendar using vCal

  • UserNicola Cancedda -Meta's Fundamental AI Research (FAIR) team.
  • ClockTuesday 26 November 2024, 14:00-15:00
  • HouseComputer Lab, LT1.

If you have a question about this talk, please contact Sally Matthews.

Large Language Models (LLMs) reshaped the AI landscape and invited themselves as dinner-table topics, yet we do not really understand how they work. Propelled by this consideration, the field of AI interpretability is enjoying a revival. In this talk I will introduce some fundamental interpretability concepts and discuss how insights from studying the internal activations of models led us to develop a training framework that significantly increases the robustness of LLMs to ‘jailbreaking’ attacks. I will also illustrate some explorations of the internal workings of transformer-based autoregressive LLMs that unexpectedly led to explaining ‘attention sinking’, a necessary mechanism for their proper functioning. I will finally offer my perspective on interesting future directions.

Nicola Cancedda is a researcher with Meta’s Fundamental AI Research (FAIR) team. His current focus is on better understanding how Large Language Models realize complex behaviors to make them more capable, safer, and more efficient. He is an alumnus of the University of Rome “La Sapienza”, and has held applied and fundamental research and management positions at Meta, Xerox, and Microsoft, pushing the state of the art in Machine Learning, Machine Translation, and Natural Language Processing, and leading the transfer of research results to large-scale production environment.

This talk is part of the Cambridge ML Systems Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity