University of Cambridge > Talks.cam > NLIP Seminar Series > Rethinking the role of tokenization in the NLP pipeline

Rethinking the role of tokenization in the NLP pipeline

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Michael Schlichtkrull.

Abstract:

Tokenization is an integral part of the modern NLP pipeline, yet it is often treated as a black box without regard for the design choices that must be made when choosing a tokenizer. I will give an overview of how the two currently dominant tokenization algorithms work, and discuss their limitations from both a computational and a typological perspective. I will then talk about my recent EMNLP paper, which suggests using multiple tokenizations from the tokenizer to overcome the limitations of taking a single tokenization. Finally, I will discuss some ongoing work which uses character-based tokenization for masked language modelling, and examines which modelling architectures work well in this setting.

Bio:

Kris is a senior research scientist in the Language team at DeepMind. His research interests are at the intersection of linguistics, NLP and machine learning, and he is primarily focused on problems of unsupervised structure induction from language. He received his PhD from the University of Cambridge, where he worked on deep generative models for text generation.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity