Talks.cam will close on 1 July 2026, further information is available on the UIS Help Site
 

University of Cambridge > Talks.cam > NLIP Seminar Series > Making and breaking tokenizers

Making and breaking tokenizers

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Suchir Salhan.

Despite massive investments in training large language models, tokenizers remain a critical but often neglected component with weaknesses that can cause wild hallucinations, bypass safety guardrails, and break downstream applications. This talk will cover:

Our recent research in automatically detecting problematic ‘glitch’ tokens in any model

Fundamental issues with pretokenizers and their design

Novel approaches to encodings and pretokenization that address some of these problems.

Speaker Bio Sander Land is a researcher at Writer, previously working at Cohere. He completed his PhD at the Department of Computer Science, University of Oxford, before undertaking a postdoc at Biomedical Engineering, King’s College London, University of London.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity