Balanced and Efficient tokenization across languages
- π€ Speaker: Prof. Hila Gonen (University of Biritish Columbia)
- π Date & Time: Thursday 15 May 2025, 16:00 - 17:00
- π Venue: https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
Abstract
Abstract: In this talk I will present our work showing disparities in the way different languages are processed in today’s languages models and point to the challenges of current tokenization schemes. I will then propose two different ways to overcome those challenges: (1) by implicitly tokenizing the text during training; and (2) by removing tokenization and working with a new byte-level mapping. Together those methods pave the way to a more controlled and balanced preprocessing of multiple languages, resulting in more efficient language modeling.
Bio: Hila is an incoming Assistant Professor at UBC , currently a postdoctoral researcher at the University of Washington. In her research, Hila works towards two main goals: (1) developing algorithms and methods for controlling the modelβs behavior; (2) making cutting-edge language technology available and fair across speakers of different languages and users of different socio-demographic groups.
Before joining UW, Hila was a postdoctoral researcher at Amazon and Meta AI. Prior to that she did her Ph.D in Computer Science at the NLP lab at Bar Ilan University. She obtained her Ms.C. in Computer Science from the Hebrew University. Hila is the recipient of several prestigious postdoc awards and an EECS Rising Stars award. Her work received the best paper awards at CoNLL 2019 and at the repL4nlp workshop 2022.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Prof. Hila Gonen (University of Biritish Columbia)
Thursday 15 May 2025, 16:00-17:00