Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Rethinking the role of tokenization in the NLP pipeline

Add to your list(s) Download to your calendar using vCal

Kris Cao (DeepMind)
Friday 02 December 2022, 12:00-13:00
Computer Lab, FW26.

If you have a question about this talk, please contact Michael Schlichtkrull.

Abstract:

Tokenization is an integral part of the modern NLP pipeline, yet it is often treated as a black box without regard for the design choices that must be made when choosing a tokenizer. I will give an overview of how the two currently dominant tokenization algorithms work, and discuss their limitations from both a computational and a typological perspective. I will then talk about my recent EMNLP paper, which suggests using multiple tokenizations from the tokenizer to overcome the limitations of taking a single tokenization. Finally, I will discuss some ongoing work which uses character-based tokenization for masked language modelling, and examines which modelling architectures work well in this setting.

Bio:

Kris is a senior research scientist in the Language team at DeepMind. His research interests are at the intersection of linguistics, NLP and machine learning, and he is primarily focused on problems of unsupervised structure induction from language. He received his PhD from the University of Cambridge, where he worked on deep generative models for text generation.

This talk is part of the NLIP Seminar Series series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Rethinking the role of tokenization in the NLP pipeline

This talk is included in these lists:

Other lists

Other talks