COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Language Technology Lab Seminars > Subtleties about Pre-Training Data: Imbalance and Staleness
Subtleties about Pre-Training Data: Imbalance and StalenessAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Tiancheng Hu. Abstract: The success of pre-trained large language models (LLMs) is largely attributed to the extensive and diverse data used during their pre-training phase. Leveraging this pre-training effectively can lead to notable improvements in model quality, robustness, and cost-efficiency. Firstly, I will address the challenges of [pre-]training on imbalanced datasets, such as those found in multilingual settings where data availability varies greatly between high- and low-resource languages. Common approaches to mitigate this issue include upsampling low-resource languages or enhancing their loss weight. Although these methods are often seen as equivalent, I will demonstrate through theoretical and empirical evidence that they are distinct. Based on these insights, we propose a strategy for efficient and balanced training on imbalanced datasets. Secondly, I will investigate the issue of temporal degradation in LLMs, which arises after the cutoff dates for training data collection. Our empirical evidence indicates that this degradation often begins well before the stated cutoff, a point we call the “effective cutoff” date. I will discuss our analysis of open pre-training datasets, which uncovers the main causes for these observations. These findings imply that knowledge cutoffs are more intricate than previously thought, necessitating careful consideration from both LLM dataset curators and users. Based on the following works: 1. Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets: https://arxiv.org/abs/2410.04579 2. Dated Data: Tracing Knowledge Cutoffs in Large Language Models: https://arxiv.org/abs/2403.12958 Bio: Daniel Khashabi is an assistant professor of computer science at Johns Hopkins University and is affiliated with the Center for Language and Speech Processing (CLSP) and the Data Science and AI Institute. He is interested in building reasoning-driven modular NLP systems that are robust, transparent, and communicative, particularly those that use natural language as the communication medium. Khashabi has published over 50 papers on natural language processing and AI in top-tier venues. His research has won the best paper awards at COLM (2024), ACL (2023), and NAACL (2022), Amazon Research Award (2022) and AI2 ’s Last Impact Award (2024). Before joining Hopkins, he was a postdoctoral fellow at the Allen Institute for AI (2019-2022) and obtained a Ph.D. from the University of Pennsylvania in 2019. This talk is part of the Language Technology Lab Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsTriple Helix Cambridge bewise 1 and 1/2 APDE daysOther talksTitle TBC Equivariant Thom spectra Origins and consequences of gene dosage mutations in cancer Formal Foundations for Translational Separation Logic Verifiers Title: Causal Sets, Discrete Gravity Act III: Fantastic Beasts and How to Fly Them |