University of Cambridge > Talks.cam > Cambridge Psychometrics Centre Seminars > Capability-oriented Evaluation in AI: From IRT to Measurement Layouts

Capability-oriented Evaluation in AI: From IRT to Measurement Layouts

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Luning Sun.

The talk is available online. Please email the organiser and ask for the Teams invite.

With the advent of general-purpose systems in AI, such as large language models, their evaluation is finally transitioning from the reporting of aggregate performance on some benchmarks to the extraction of capabilities in more well-thought measurement experiments, in a way that should resemble the theory and practice of psychological measurement. I will illustrate some examples where Factor Analysis and Item Response Theory have been applied to AI evaluation in the past. In these psychometric approaches, estimating capabilities excels over measuring performance in that capabilities aim to be independent from the task distribution. However, the parameters and factors in these models are still highly dependent on the underlying population of AI systems, which are more arbitrary and changing than human or animal populations. To address this issue, we need a more cognitive, intrinsic approach, identifying task demands and mapping the capabilities that can meet these demands. Under this perspective, I will present a new approach referred to as ‘measurement layouts’, generalised (non-linear) Hierarchical Bayesian Networks that can infer the latent capabilities of a single AI system from observed performance and task demands, and then predict performance for new tasks. Measurement layouts provide understanding of what makes an individual AI system fail and anticipation of performance for future tasks. At the end of the talk, I’ll invite attendees to an open discussion on how measurement layouts compare to other novel approaches such as Assessors (performance models trained on test data) and more traditional approaches such as Structural Equation Modelling (if used for individuals).

This talk is part of the Cambridge Psychometrics Centre Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity