Model Interpretability: from Illusions to Opportunities
- đ€ Speaker: Dr. Asma Ghandeharioun (Google DeepMind)
- đ Date & Time: Thursday 12 June 2025, 14:00 - 15:00
- đ Venue: https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
Abstract
Abstract
While the capabilities of todayâs large language models (LLMs) are reachingâand even surpassingâwhat was once thought impossible, concerns remain regarding their misalignment, such as generating misinformation or harmful text, which continues to be an open area of research. Understanding LLMsâ internal representations can help explain their behavior, verify their alignment with human values, and mitigate instances where they produce errors. In this talk, I begin by challenging common misconceptions about the connections between LLMs’ hidden representations and their downstream behavior, highlighting several âinterpretability illusions.â
Next, I introduce Patchscopes, a framework we developed that leverages the model itself to explain its internal representations in natural language. Iâll show how it can be used to answer a wide range of questions about an LLM ’s computation. Beyond unifying prior inspection techniques, Patchscopes opens up new possibilities, such as using a more capable model to explain the representations of a smaller model. I show how patchscope can be used as a tool for inspection, discovery, and even error correction. Some examples include fixing multihop reasoning errors, the interaction between user personas and latent misalignment, and understanding why different classes of contextualization errors happen.
I hope by the end of this talk, the audience shares my excitement in appreciating the beauty of the internal mechanisms of AI systems, understands the nuances of model interpretability and why some observations might lead to illusions, and takes away Patchscope, a powerful tool for qualitative analysis of how and why LLMs work and fail in different scenarios.
Bio
Asma Ghandeharioun, Ph.D., is a senior research scientist with the People + AI Research team at Google DeepMind. She works on aligning AI with human values through better understanding and controlling (language) models, uniquely by demystifying their inner workings and correcting collective misconceptions along the way. While her current research is mostly focused on machine learning interpretability, her previous work spans conversational AI, affective computing, and, more broadly, human-centered AI. She holds a doctorate and masterâs degree from MIT and a bachelorâs degree from the Sharif University of Technology. She has been trained as a computer scientist/engineer and has research experience at MIT , Google Research, Microsoft Research, Ecole Polytechnique FĂ©dĂ©rale de Lausanne (EPFL), to name a few.
Her work has been published in premier peer-reviewed machine learning venues such as NeurIPS, ICLR , ICML, NAACL , EMNLP, AAAI , ACII, and AISTATS . She has received awards at NeurIPS and her work has been featured in Quanta Magazine, Wired, Wall Street Journal, and New Scientist.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Dr. Asma Ghandeharioun (Google DeepMind)
Thursday 12 June 2025, 14:00-15:00