COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Cambridge Psychometrics Centre Seminars > Capability-oriented Evaluation in AI: From IRT to Measurement Layouts
Capability-oriented Evaluation in AI: From IRT to Measurement LayoutsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Luning Sun. The talk is available online. Please email the organiser and ask for the Teams invite. With the advent of general-purpose systems in AI, such as large language models, their evaluation is finally transitioning from the reporting of aggregate performance on some benchmarks to the extraction of capabilities in more well-thought measurement experiments, in a way that should resemble the theory and practice of psychological measurement. I will illustrate some examples where Factor Analysis and Item Response Theory have been applied to AI evaluation in the past. In these psychometric approaches, estimating capabilities excels over measuring performance in that capabilities aim to be independent from the task distribution. However, the parameters and factors in these models are still highly dependent on the underlying population of AI systems, which are more arbitrary and changing than human or animal populations. To address this issue, we need a more cognitive, intrinsic approach, identifying task demands and mapping the capabilities that can meet these demands. Under this perspective, I will present a new approach referred to as ‘measurement layouts’, generalised (non-linear) Hierarchical Bayesian Networks that can infer the latent capabilities of a single AI system from observed performance and task demands, and then predict performance for new tasks. Measurement layouts provide understanding of what makes an individual AI system fail and anticipation of performance for future tasks. At the end of the talk, I’ll invite attendees to an open discussion on how measurement layouts compare to other novel approaches such as Assessors (performance models trained on test data) and more traditional approaches such as Structural Equation Modelling (if used for individuals). This talk is part of the Cambridge Psychometrics Centre Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsType the title of a new list here Long-term dynamics of synapses and representations in the mouse auditory cortex. Genetics Postdoc SeminarOther talksThe Anne McLaren Lecture: Coordination of cell states and tissue architecture by mechanical forces The interplay between dissipation and kinetics in nonequilibrium states Reaction-infiltration Instability in Partially Molten Mantle Power analysis Refreezing the Arctic Road to Future Roads: Carbon Data Ontology |