University of Cambridge > > Language Technology Lab Seminars > Data Selection for Pre-training and Instruction-tuning of LLMs

Data Selection for Pre-training and Instruction-tuning of LLMs

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Panagiotis Fytas.

There is increasing evidence that choosing the right training data is essential for producing state-of-the-art large language models (LLMs). How can we decide on high-quality training data? Can we possibly select fewer data examples to improve performance and efficiency? In this talk, I will present two recent works on selecting high-quality data in pre-training and instruction tuning. I will first present QuRating, a simple framework for selecting pre-training data that captures the abstract attributes of texts humans intuitively perceive. We demonstrate that using state-of-the-art LLMs (e.g., GPT -3.5-turbo) can discern these qualities in pairwise judgments and emphasize the importance of balancing quality and diversity. We have created QuRatedPajama, a dataset comprising 260 billion tokens with fine-grained quality ratings, and show that sampling according to these ratings improves perplexity and in-context learning. Second, I present LESS , a method that effectively estimates data influences for identifying relevant instruction-tuning data for specific applications (a setting we call “targeted instruction tuning”). LESS is efficient, transferrable (we can use a smaller model for data selection), optimizer-aware (working with Adam), and easy to interpret. We show that training on a LESS -selected 5% of the data can often outperform training on full datasets on diverse downstream tasks.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.


© 2006-2024, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity