University of Cambridge > Talks.cam > Language Technology Lab Seminars > Actionable Interpretability for AI Safety

Actionable Interpretability for AI Safety

Download to your calendar using vCal

If you have a question about this talk, please contact Lucas Resck .

Abstract: Interpretability research for large language models (LLMs) has advanced rapidly in recent years. Yet a central open question remains: how can these insights can be transformed into practical tools for improving AI safety? In this talk, I present ongoing efforts to leverage interpretability for both immediate and long-term safety goals. First, I show how disentangling model parameters enables precise knowledge erasure, achieving finer-grained and more robust control than common fine-tuning and editing methods. Next, I introduce a scalable approach for decomposing residual stream activations through their local geometry, demonstrating its advantages for localizing and steering model behavior. Lastly, I turn to the increasingly debated question of AI consciousness, using interpretability to test a neuroscientifically inspired indicator of agency and meta-cognitive monitoring in LLMs.

Bio: Mor Geva is an Assistant Professor at the School of Computer Science and AI at Tel Aviv University. Her research focuses on understanding the inner workings of large language models to increase their transparency and efficiency, control their operation, and improve their reasoning abilities. Mor completed a Ph.D. in Computer Science at Tel Aviv University, was a postdoctoral researcher at Google DeepMind and the Allen Institute for AI, and worked as a Research Scientist at Google Research. She is a recipient of Intel’s Rising Star Faculty Award (2024), the Alon Scholarship for Outstanding Faculty (2024), EMNLP Best Paper Award (2024), EACL Outstanding Paper Award (2023), MIT Rising Star in EECS nomination (2021), and the Dan David Prize for Graduate Students in the field of AI (2020).

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Š 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity