![]() |
COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. | ![]() |
University of Cambridge > Talks.cam > Machine Learning Reading Group @ CUED > Out-of-context reasoning/learning in LLMs and its safety implications
![]() Out-of-context reasoning/learning in LLMs and its safety implicationsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact . Teams link available upon request (it is sent out on our mailing list, eng-mlg-rcc [at] lists.cam.ac.uk). Sign up to our mailing list for easier reminders via lists.cam.ac.uk. Beyond learning patterns within individual training datapoints, Large Language Models (LLMs) can infer latent structures and relationships by aggregating information scattered across different training samples through out-of-context reasoning (OOCR) [1, 2]. We’ll review key empirical findings, including Implicit Meta-Learning (models learning source reliability implicitly and subsequently internalizing reliable-seeming data more strongly [1]) and Inductive OOCR (models inferring other latent structures from scattered data [3]). We’ll explore potential mechanisms behind these phenomena [1, 4]. Finally, we’ll discuss the significant AI safety implications, arguing that OOCR coupled with Situational Awareness [5] underpins threats like Alignment Faking [6], potentially leading to persistent misalignment resistant to standard alignment techniques. 1. Krasheninnikov et al., “Implicit meta-learning may lead language models to trust more reliable sources” https://arxiv.org/abs/2310.15047 2. Berglund et al., “Taken Out of Context: On Measuring Out-of-Context Reasoning in LLMs” https://arxiv.org/abs/2309.00667 3. Treutlein et al., “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data” https://arxiv.org/abs/2406.14546 4. Feng et al., “Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts” https://arxiv.org/abs/2412.04614 5. Laine et al., “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” https://arxiv.org/abs/2407.04694 6. Greenblatt et al., “Alignment faking in large language models” https://arxiv.org/abs/2412.14093 This talk is part of the Machine Learning Reading Group @ CUED series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsType the title of a new list here AI+Pizza International Political Economy Research GroupOther talksRaphael Mattiuz, Post-doctoral Fellow in Immunobiology, Mount Sinai Mass Spectrometry: Proteomics Applications Moral Philosophy and the Dissenting Academies, 1660-1860 Causal Representation Learning Benefits of data openness in a digital world Shape-shifting Elephants: Multi-modal Transport for Integrated Research Infrastructure |