University of Cambridge > Talks.cam > Engineering Safe AI > Counterargument to CIRL, and Safely Interruptible Agents

Counterargument to CIRL, and Safely Interruptible Agents

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Adrià Garriga Alonso.

Cooperative Inverse Reinforcement Learning (CIRL) is a game with a robot R and human H, in which R tries to maximise H’s reward while not knowing it. R is incentivised to shut down on H’s suggestion, since that provides information about the H’s reward function. However, Carey (2017) shows that, if R and H do not share the same prior for the reward, R may remain incorrigible. Carey then makes a case for forced interruptibility. We will talk about Carey’s examples and the strength of the case for forced interruptibility.

Orseau and Armstrong (2016) provide a formal notion of satisfactory learning under forced interruptions. Then they show how Q-learning satisfies it, and SARSA and AIXI -with-exploration can be modified to satisfy it. We will go over the proof outlines and discuss their implications for corrigibility.

Reading list:

Ryan Carey. 2017. “Incorrigibility in the CIRL Framework.” arXiv:1709.06275 [cs.AI].

Laurent Orseau and Stuart Armstrong. 2016. “Safely Interruptible Agents.” Paper presented at the 32nd Conference on Uncertainty in Artificial Intelligence.

Slides: https://valuealignment.ml/talks/2017-12-06-interruptibility.pdf

This talk is part of the Engineering Safe AI series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity