![]() |
COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. | ![]() |
University of Cambridge > Talks.cam > NLIP Seminar Series > Asymmetry in Supposedly Equivalent Facts: Pre-training Bias in Large Language Models
Asymmetry in Supposedly Equivalent Facts: Pre-training Bias in Large Language ModelsAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Suchir Salhan. Understanding and mitigating hallucinations in Large Language Models (LLMs) is crucial for ensuring reliable content generation. While previous research has primarily focused on “when” LLMs hallucinate, our work explains “why” and directly links model behaviour to the pre-training data that forms their prior knowledge. Specifically, we demonstrate that an asymmetry exists in the recognition of logically equivalent facts, which can be attributed to frequency discrepancies of entities appearing as subjects versus objects. Given that most pre-training datasets are inaccessible, we leverage the fully open-source OLMo series by indexing its Dolma dataset to estimate entity frequencies. Using relational facts (represented as triples) from Wikidata5M, we construct probing datasets to isolate this effect. Our experiments reveal that facts with a high-frequency subject and a low-frequency object are better recognised than their inverse, despite their logical equivalence. The pattern reverses in low-to-high frequency settings, and no statistically significant asymmetry emerges when both entities are high-frequency. These findings underscore the influential role of pre-training data in shaping model predictions and provide insights for inferring the characteristics of pre-training data in closed or partially closed LLMs. This talk is part of the NLIP Seminar Series series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other liststaskade Breaking Bread: What's wrong with wheat? Darwin College Science SeminarsOther talksTowards responsible deployment of robust and private AI models in healthcare Gender and the politics of the 'white working class': A feminist history of Brexit Britain Introduction to Flow Cytometry Fluorescent labelling, FRET and Light Sheet Microscopy LMB Seminar - Cryo-OrbiSIMS – high resolution mass spectrometry imaging in the native biological state The 'Wood Age' at Kalambo Falls, Zambia |