COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Language Technology Lab Seminars > Data Selection for Pre-training and Instruction-tuning of LLMs
Data Selection for Pre-training and Instruction-tuning of LLMsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Panagiotis Fytas. There is increasing evidence that choosing the right training data is essential for producing state-of-the-art large language models (LLMs). How can we decide on high-quality training data? Can we possibly select fewer data examples to improve performance and efficiency? In this talk, I will present two recent works on selecting high-quality data in pre-training and instruction tuning. I will first present QuRating, a simple framework for selecting pre-training data that captures the abstract attributes of texts humans intuitively perceive. We demonstrate that using state-of-the-art LLMs (e.g., GPT -3.5-turbo) can discern these qualities in pairwise judgments and emphasize the importance of balancing quality and diversity. We have created QuRatedPajama, a dataset comprising 260 billion tokens with fine-grained quality ratings, and show that sampling according to these ratings improves perplexity and in-context learning. Second, I present LESS , a method that effectively estimates data influences for identifying relevant instruction-tuning data for specific applications (a setting we call “targeted instruction tuning”). LESS is efficient, transferrable (we can use a smaller model for data selection), optimizer-aware (working with Adam), and easy to interpret. We show that training on a LESS -selected 5% of the data can often outperform training on full datasets on diverse downstream tasks. This talk is part of the Language Technology Lab Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsIndustrial Sustainability Study Group on a Langlands Correspondence for Loop Groups Cambridge Geotechnical Society Seminar SeriesOther talksDeveloping cardiac digital twins at scale Description of the transient processes in waveguides with the multi-contour saddle point method Quantitative Wasserstein rounding Stochastic approximation with heavy tailed noise Poster Lightening Round Cancer resistance in the naked mole-rat: tales of transformation and more! |