University of Cambridge > Talks.cam > Language Technology Lab Seminars > Actionable Interpretability for AI Safety

Log in

University Account

External (via Google)

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Actionable Interpretability for AI Safety

Download to your calendar using vCal

Prof. Mor Geva (Tel Aviv University)
Thursday 26 February 2026, 11:00-12:00
https://cam-ac-uk.zoom.us/j/86890624365?pwd=oYGWpY7d5r3JOaUCaJXTD0sRECFxab.1.

If you have a question about this talk, please contact Lucas Resck .

Abstract: Interpretability research for large language models (LLMs) has advanced rapidly in recent years. Yet a central open question remains: how can these insights can be transformed into practical tools for improving AI safety? In this talk, I present ongoing efforts to leverage interpretability for both immediate and long-term safety goals. First, I show how disentangling model parameters enables precise knowledge erasure, achieving finer-grained and more robust control than common fine-tuning and editing methods. Next, I introduce a scalable approach for decomposing residual stream activations through their local geometry, demonstrating its advantages for localizing and steering model behavior. Lastly, I turn to the increasingly debated question of AI consciousness, using interpretability to test a neuroscientifically inspired indicator of agency and meta-cognitive monitoring in LLMs.

Bio: Mor Geva is an Assistant Professor at the School of Computer Science and AI at Tel Aviv University. Her research focuses on understanding the inner workings of large language models to increase their transparency and efficiency, control their operation, and improve their reasoning abilities. Mor completed a Ph.D. in Computer Science at Tel Aviv University, was a postdoctoral researcher at Google DeepMind and the Allen Institute for AI, and worked as a Research Scientist at Google Research. She is a recipient of Intel’s Rising Star Faculty Award (2024), the Alon Scholarship for Outstanding Faculty (2024), EMNLP Best Paper Award (2024), EACL Outstanding Paper Award (2023), MIT Rising Star in EECS nomination (2021), and the Dan David Prize for Graduate Students in the field of AI (2020).

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Actionable Interpretability for AI Safety

📅 Download to calendar (vCal)

👤 Speaker: Prof. Mor Geva (Tel Aviv University) 🔗 Website
📅 Date & Time: Thursday 26 February 2026, 11:00 - 12:00
📍 Venue: https://cam-ac-uk.zoom.us/j/86890624365?pwd=oYGWpY7d5r3JOaUCaJXTD0sRECFxab.1

Questions? Contact Lucas Resck

Abstract

Series This talk is part of the Language Technology Lab Seminars series.

Included in Lists

Note: Ex-directory lists are not shown.

Log in

🔐 Log In

Information on

ℹ️ Information

Actionable Interpretability for AI Safety

This talk is included in these lists:

Actionable Interpretability for AI Safety

Abstract

Included in Lists

Log in

🔐 Log In

Information on

ℹ️ Information

Actionable Interpretability for AI Safety

This talk is included in these lists:

Other lists

Other talks

Actionable Interpretability for AI Safety

Abstract

Included in Lists