Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Capability-oriented Evaluation in AI: From IRT to Measurement Layouts

Add to your list(s) Download to your calendar using vCal

Prof Jose Hernandez-Orallo
Tuesday 21 November 2023, 16:00-17:30
S3.04, Simon Sainsbury Centre, Cambridge Judge Business School.

If you have a question about this talk, please contact Luning Sun.

The talk is available online. Please email the organiser and ask for the Teams invite.

With the advent of general-purpose systems in AI, such as large language models, their evaluation is finally transitioning from the reporting of aggregate performance on some benchmarks to the extraction of capabilities in more well-thought measurement experiments, in a way that should resemble the theory and practice of psychological measurement. I will illustrate some examples where Factor Analysis and Item Response Theory have been applied to AI evaluation in the past. In these psychometric approaches, estimating capabilities excels over measuring performance in that capabilities aim to be independent from the task distribution. However, the parameters and factors in these models are still highly dependent on the underlying population of AI systems, which are more arbitrary and changing than human or animal populations. To address this issue, we need a more cognitive, intrinsic approach, identifying task demands and mapping the capabilities that can meet these demands. Under this perspective, I will present a new approach referred to as ‘measurement layouts’, generalised (non-linear) Hierarchical Bayesian Networks that can infer the latent capabilities of a single AI system from observed performance and task demands, and then predict performance for new tasks. Measurement layouts provide understanding of what makes an individual AI system fail and anticipation of performance for future tasks. At the end of the talk, I’ll invite attendees to an open discussion on how measurement layouts compare to other novel approaches such as Assessors (performance models trained on test data) and more traditional approaches such as Structural Equation Modelling (if used for individuals).

This talk is part of the Cambridge Psychometrics Centre Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Capability-oriented Evaluation in AI: From IRT to Measurement Layouts

This talk is included in these lists:

Other lists

Other talks