University of Cambridge > Talks.cam > Language Technology Lab Seminars > Balanced and Efficient tokenization across languages

Balanced and Efficient tokenization across languages

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Shun Shao.

Abstract: In this talk I will present our work showing disparities in the way different languages are processed in today’s languages models and point to the challenges of current tokenization schemes. I will then propose two different ways to overcome those challenges: (1) by implicitly tokenizing the text during training; and (2) by removing tokenization and working with a new byte-level mapping. Together those methods pave the way to a more controlled and balanced preprocessing of multiple languages, resulting in more efficient language modeling.

Bio: Hila is an incoming Assistant Professor at UBC , currently a postdoctoral researcher at the University of Washington. In her research, Hila works towards two main goals: (1) developing algorithms and methods for controlling the model’s behavior; (2) making cutting-edge language technology available and fair across speakers of different languages and users of different socio-demographic groups.

Before joining UW, Hila was a postdoctoral researcher at Amazon and Meta AI. Prior to that she did her Ph.D in Computer Science at the NLP lab at Bar Ilan University. She obtained her Ms.C. in Computer Science from the Hebrew University. Hila is the recipient of several prestigious postdoc awards and an EECS Rising Stars award. Her work received the best paper awards at CoNLL 2019 and at the repL4nlp workshop 2022.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity