University of Cambridge > > NLIP Seminar Series > Measuring Causal Effects of Data Statistics on Language Model Predictions

Measuring Causal Effects of Data Statistics on Language Model Predictions

Add to your list(s) Download to your calendar using vCal

  • UserYanai Elazar (Bar-Ilan University) World_link
  • ClockWednesday 01 June 2022, 17:00-18:00
  • HouseComputer Lab, FW26.

If you have a question about this talk, please contact Michael Schlichtkrull.


The training data is one of the major reasons for state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this research question by formalizing it in a causal framework that provides a useful language for investigating how training data influence predictions. Importantly, our causal framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics: co-occurrences counts, and show that these statistics influence the predictions of PLMs. This establishes a causal link between simple statistics from the training data (co-occurrence counts) and PLMs’ behavior, and shows that their language understanding is limited. Our causal framework and our results demonstrate the importance of categorizing and studying datasets used for model training and the benefits of causality in our field for understanding NLP models.


Yanai Elazar is a fourth-year PhD student at Bar-Ilan University, working with Prof. Yoav Goldberg on NLP . His main interests involve model interpretation, analysis, biases in datasets and models, and commonsense reasoning. Yanai was awarded multiple scholarships, including the PBC fellowship for outstanding PhD candidates in Data Science, and the Google PhD Fellowship.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity