University of Cambridge > Talks.cam > Language Technology Lab Seminars > Subtleties about Pre-Training Data: Imbalance and Staleness

Subtleties about Pre-Training Data: Imbalance and Staleness

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tiancheng Hu.

Abstract: The success of pre-trained large language models (LLMs) is largely attributed to the extensive and diverse data used during their pre-training phase. Leveraging this pre-training effectively can lead to notable improvements in model quality, robustness, and cost-efficiency. Firstly, I will address the challenges of [pre-]training on imbalanced datasets, such as those found in multilingual settings where data availability varies greatly between high- and low-resource languages. Common approaches to mitigate this issue include upsampling low-resource languages or enhancing their loss weight. Although these methods are often seen as equivalent, I will demonstrate through theoretical and empirical evidence that they are distinct. Based on these insights, we propose a strategy for efficient and balanced training on imbalanced datasets. Secondly, I will investigate the issue of temporal degradation in LLMs, which arises after the cutoff dates for training data collection. Our empirical evidence indicates that this degradation often begins well before the stated cutoff, a point we call the “effective cutoff” date. I will discuss our analysis of open pre-training datasets, which uncovers the main causes for these observations. These findings imply that knowledge cutoffs are more intricate than previously thought, necessitating careful consideration from both LLM dataset curators and users.

Based on the following works:

1. Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets: https://arxiv.org/abs/2410.04579

2. Dated Data: Tracing Knowledge Cutoffs in Large Language Models: https://arxiv.org/abs/2403.12958

Bio: Daniel Khashabi is an assistant professor of computer science at Johns Hopkins University and is affiliated with the Center for Language and Speech Processing (CLSP) and the Data Science and AI Institute. He is interested in building reasoning-driven modular NLP systems that are robust, transparent, and communicative, particularly those that use natural language as the communication medium. Khashabi has published over 50 papers on natural language processing and AI in top-tier venues. His research has won the best paper awards at COLM (2024), ACL (2023), and NAACL (2022), Amazon Research Award (2022) and AI2 ’s Last Impact Award (2024). Before joining Hopkins, he was a postdoctoral fellow at the Allen Institute for AI (2019-2022) and obtained a Ph.D. from the University of Pennsylvania in 2019.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity