![]() |
COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. | ![]() |
University of Cambridge > Talks.cam > NLIP Seminar Series > The Past, Present and Future of Tokenization
The Past, Present and Future of TokenizationAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Suchir Salhan. Abstract: Current large language models (LLMs) predominantly use subword tokenization. They see text as chunks (called “tokens”) made up of individual words, or parts of words. This has a number of consequences. For example, LLMs often struggle with seemingly simple tasks involving character-level knowledge, like counting the number of letters in a word or comparing two numbers. Subword tokenization can also lead to discrepancies across languages: processing English text with an LLM is often cheaper than processing text in other languages. We will talk about how these issues came to be, as well as how to potentially improve tokenization by moving away from subwords (e.g., to models directly ingesting bytes) and/or towards more adaptive, modular, tokenization. Finally, we will conclude with discussing the far reach of tokenization into seemingly unrelated fields (model merging and multimodality). Speaker Biography: Benjamin Minixhofer is a PhD student in the Language Technology Lab, interested in multilinguality, tokenization and language emergence. This talk is part of the NLIP Seminar Series series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listsFrank King Category Theory Seminar Energy Innovation Beyond 2013: A talk by the Global Director of BP VenturesOther talksNon-apical mitoses contribute to cell delamination during mouse gastrulation Material culture in the Second Intermediate Period: Some results and work in progress Balancing predictive and reactive control in next generation bioelectronic systems: towards “circadian-aware” neuromodulation for neurological conditions Women Behind the "Great Men" of Mathematics: The Case of Caroline Eustis Seely |