BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:AIReg-Bench: Benchmarking Language Models That Assess AI Regulatio
 n Compliance - William Marino (University of Cambridge)
DTSTART:20251021T120000Z
DTEND:20251021T130000Z
UID:TALK239710@talks.cam.ac.uk
CONTACT:Mateja Jamnik
DESCRIPTION:As governments move to regulate AI\, there is growing interest
  in using Large Language Models (LLMs) to assess whether or not an AI syst
 em complies with a given AI Regulation (AIR). However\, there is presently
  no way to benchmark the performance of LLMs at this task. To fill this vo
 id\, we introduce AIReg-Bench: the first benchmark dataset designed to tes
 t how well LLMs can assess compliance with the EU AI Act (AIA). We created
  this dataset through a two-step process: (1) by prompting an LLM with car
 efully structured instructions\, we generated 120 technical documentation 
 excerpts (samples)\, each depicting a fictional\, albeit plausible\, AI sy
 stem — of the kind an AI provider might produce to demonstrate their com
 pliance with AIR\; (2) legal experts then reviewed and annotated each samp
 le to indicate whether\, and in what way\, the AI system described therein
  violates specific Articles of the AIA. The resulting dataset\, together w
 ith our evaluation of whether frontier LLMs can reproduce the experts’ c
 ompliance labels\, provides a starting point to understand the opportuniti
 es and limitations of LLM-based AIR compliance assessment tools and establ
 ishes a benchmark against which subsequent LLMs can be compared. The datas
 et and evaluation code are available at https://github.com/camlsys/aireg-b
 ench.\n\n"You can also join us on Zoom":https://cam-ac-uk.zoom.us/j/834003
 35522?pwd=LkjYvMOvVpMbabOV1MVTm8QU6DrGN7.1\n
LOCATION:Lecture Theatre 1\, Computer Laboratory\, William Gates Building
END:VEVENT
END:VCALENDAR
