University of Cambridge > Talks.cam > Artificial Intelligence Research Group Talks (Computer Laboratory) > AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

Download to your calendar using vCal

If you have a question about this talk, please contact Mateja Jamnik .

NOTE unusual place (LT1)

As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system β€” of the kind an AI provider might produce to demonstrate their compliance with AIR ; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA . The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts’ compliance labels, provides a starting point to understand the opportunities and limitations of LLM -based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

You can also join us on Zoom

This talk is part of the Artificial Intelligence Research Group Talks (Computer Laboratory) series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Β© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity