University of Cambridge > Talks.cam > Machine Learning Reading Group @ CUED > AI Control

AI Control

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact .

Zoom link available upon request (it is sent out on our mailing list, eng-mlg-rcc [at] lists.cam.ac.uk). Sign up to our mailing list for easier reminders via lists.cam.ac.uk.

How do we avoid harms from future AI systems that are deployed autonomously and are potentially more capable than humans? Most efforts so far have focused on alignment: getting the AI to want what we want. In this presentation, we focus on an orthogonal approach: developing protocols to deploy AIs safely even if they are misaligned. This approach is called AI control. The two main ingredients in AI control are

Control protocols – AI deployment protocols designed to be robust to models acting against us (e.g. monitoring the AI’s outputs with a less capable AI system)

Methods to evaluate control protocols

In our presentation we will 1) present a paper [1] on using AI control in a setting where a smart but misaligned LLM sometimes tries to insert backdoors into code, and 2) discuss the broader control agenda and the associated challenges. [1] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI Control: Improving safety despite intentional subversion. https://arxiv.org/abs/2312.06942

This talk is part of the Machine Learning Reading Group @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity