Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

AI Control

Add to your list(s) Download to your calendar using vCal

Dmitrii Krasheninnikov, Lauro Langosco, University of Cambridge
Wednesday 17 April 2024, 11:00-12:30
Cambridge University Engineering Department, CBL Seminar room BE4-38..

If you have a question about this talk, please contact .

Zoom link available upon request (it is sent out on our mailing list, eng-mlg-rcc [at] lists.cam.ac.uk). Sign up to our mailing list for easier reminders via lists.cam.ac.uk.

How do we avoid harms from future AI systems that are deployed autonomously and are potentially more capable than humans? Most efforts so far have focused on alignment: getting the AI to want what we want. In this presentation, we focus on an orthogonal approach: developing protocols to deploy AIs safely even if they are misaligned. This approach is called AI control. The two main ingredients in AI control are

Control protocols – AI deployment protocols designed to be robust to models acting against us (e.g. monitoring the AI’s outputs with a less capable AI system)

Methods to evaluate control protocols

In our presentation we will 1) present a paper [1] on using AI control in a setting where a smart but misaligned LLM sometimes tries to insert backdoors into code, and 2) discuss the broader control agenda and the associated challenges. [1] Greenblatt, R., Shlegeris, B., Sachan, K., & Roger, F. (2023). AI Control: Improving safety despite intentional subversion. https://arxiv.org/abs/2312.06942

This talk is part of the Machine Learning Reading Group @ CUED series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

AI Control

This talk is included in these lists:

Other lists

Other talks