Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Balanced and Efficient tokenization across languages

Add to your list(s) Download to your calendar using vCal

Prof. Hila Gonen (University of Biritish Columbia)
Thursday 15 May 2025, 16:00-17:00
https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09.

If you have a question about this talk, please contact Shun Shao.

Abstract: In this talk I will present our work showing disparities in the way different languages are processed in today’s languages models and point to the challenges of current tokenization schemes. I will then propose two different ways to overcome those challenges: (1) by implicitly tokenizing the text during training; and (2) by removing tokenization and working with a new byte-level mapping. Together those methods pave the way to a more controlled and balanced preprocessing of multiple languages, resulting in more efficient language modeling.

Bio: Hila is an incoming Assistant Professor at UBC , currently a postdoctoral researcher at the University of Washington. In her research, Hila works towards two main goals: (1) developing algorithms and methods for controlling the model’s behavior; (2) making cutting-edge language technology available and fair across speakers of different languages and users of different socio-demographic groups.

Before joining UW, Hila was a postdoctoral researcher at Amazon and Meta AI. Prior to that she did her Ph.D in Computer Science at the NLP lab at Bar Ilan University. She obtained her Ms.C. in Computer Science from the Hebrew University. Hila is the recipient of several prestigious postdoc awards and an EECS Rising Stars award. Her work received the best paper awards at CoNLL 2019 and at the repL4nlp workshop 2022.

This talk is part of the Language Technology Lab Seminars series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Balanced and Efficient tokenization across languages

This talk is included in these lists:

Other lists

Other talks