Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Add to your list(s) Download to your calendar using vCal

Lorenzo Pacchiardi, University of Cambridge
Tuesday 05 March 2024, 14:00-15:00
Webinar & FW11, Computer Laboratory, William Gates Building..

If you have a question about this talk, please contact Hridoy Sankar Dutta.

Large language models (LLMs) can “lie”, which we define as outputting false statements despite “knowing” the truth in a demonstrable sense. LLMs might “lie”, for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM ’s activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM ’s yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting—prompting GPT -3.5 to lie about factual questions—the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

https://cam-ac-uk.zoom.us/j/88053652228?pwd=NG1LTDdUc2VkV3pGdlpSdHZ5N3h0Zz09

Meeting ID: 880 5365 2228 Passcode: 081966

RECORDING : Please note, this event will be recorded and will be available after the event for an indeterminate period under a CC BY -NC-ND license. Audience members should bear this in mind before joining the webinar or asking questions.

NOTE : Please do not post URLs for the talk, and especially Zoom links to Twitter because automated systems will pick them up and disrupt our meeting.

This talk is part of the Computer Laboratory Security Seminar series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

This talk is included in these lists:

Other lists

Other talks