University of Cambridge > > Engineering Safe AI > Counterargument to CIRL, and Safely Interruptible Agents

Counterargument to CIRL, and Safely Interruptible Agents

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Adrià Garriga Alonso.

Cooperative Inverse Reinforcement Learning (CIRL) is a game with a robot R and human H, in which R tries to maximise H’s reward while not knowing it. R is incentivised to shut down on H’s suggestion, since that provides information about the H’s reward function. However, Carey (2017) shows that, if R and H do not share the same prior for the reward, R may remain incorrigible. Carey then makes a case for forced interruptibility. We will talk about Carey’s examples and the strength of the case for forced interruptibility.

Orseau and Armstrong (2016) provide a formal notion of satisfactory learning under forced interruptions. Then they show how Q-learning satisfies it, and SARSA and AIXI -with-exploration can be modified to satisfy it. We will go over the proof outlines and discuss their implications for corrigibility.

Reading list:

Ryan Carey. 2017. “Incorrigibility in the CIRL Framework.” arXiv:1709.06275 [cs.AI].

Laurent Orseau and Stuart Armstrong. 2016. “Safely Interruptible Agents.” Paper presented at the 32nd Conference on Uncertainty in Artificial Intelligence.


This talk is part of the Engineering Safe AI series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2023, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity