University of Cambridge > Talks.cam > Computer Laboratory Security Seminar > How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Hridoy Sankar Dutta.

Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. LLMs might “lie”, for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM ’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM ’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting—prompting GPT -3.5 to lie about factual questions—the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

https://cam-ac-uk.zoom.us/j/88053652228?pwd=NG1LTDdUc2VkV3pGdlpSdHZ5N3h0Zz09

Meeting ID: 880 5365 2228 Passcode: 081966

RECORDING : Please note, this event will be recorded and will be available after the event for an indeterminate period under a CC BY -NC-ND license. Audience members should bear this in mind before joining the webinar or asking questions.

NOTE : Please do not post URLs for the talk, and especially Zoom links to Twitter because automated systems will pick them up and disrupt our meeting.

This talk is part of the Computer Laboratory Security Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity