University of Cambridge > Talks.cam > Language Technology Lab Seminars > Data Repurposing: Improving LLM Capabilities with Synthetic Data Generation

Data Repurposing: Improving LLM Capabilities with Synthetic Data Generation

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tiancheng Hu.

Abstract: Creating high-quality, diverse, large-scale datasets remains a critical and time-consuming challenge in improving LLM capabilities. Motivated by prior work that manually identifies implicit signals in raw corpora, we aim to address this challenge by investigating data repurposing strategies, a methodology for automatically transforming existing data resources into new formats and purposes. First, we propose reverse instructions to build an English instruction-following dataset by synthetically generating instructions for a given human-written corpus document. The model trained with our synthetic dataset performs significantly better than other instruction-following models, especially in long-form generation. Next, in MURI , we extend this approach to 200 languages to create a culturally inclusive, native dataset and multilingual instruction-following models for very low-resource languages. Finally, we customize this approach to generate any downstream dataset in CRAFT , targeting unannotated corpora to synthesize custom downstream task examples by retrieving and rewriting corpus documents using few-shot examples. Our experiments demonstrate that this approach can generate large-scale datasets for any given task, showing up to a 25% improvement in tasks such as biology QA and summarization compared to few-shot settings.

Bio: Abdullatif Köksal is a final-year ELLIS PhD student at CIS , LMU Munich and LTL , University of Cambridge, supervised by Prof. Hinrich Schütze and Prof. Anna Korhonen. His research focuses on improving LLM capabilities through effective data utilization and synthetic data generation. He has proposed several works around data repurposing by restructuring and augmenting existing data resources, including reverse instructions for long-form instruction-tuning and a culturally-respectful multilingual instruction-following dataset for 200 languages. He expanded these approaches to dataset generation for downstream tasks through better corpus mining with LLMs in CRAFT . He worked in other areas such as counterfactuality, robustness, and multilinguality and published multiple papers in top-tier NLP venues. He interned at Google and Amazon, where he worked on counterfactuality and faithfulness.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity