![]() |
COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. | ![]() |
University of Cambridge > Talks.cam > Language Technology Lab Seminars > Balanced and Efficient tokenization across languages
![]() Balanced and Efficient tokenization across languagesAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact Shun Shao. Abstract: In this talk I will present our work showing disparities in the way different languages are processed in today’s languages models and point to the challenges of current tokenization schemes. I will then propose two different ways to overcome those challenges: (1) by implicitly tokenizing the text during training; and (2) by removing tokenization and working with a new byte-level mapping. Together those methods pave the way to a more controlled and balanced preprocessing of multiple languages, resulting in more efficient language modeling. Bio: Hila is an incoming Assistant Professor at UBC , currently a postdoctoral researcher at the University of Washington. In her research, Hila works towards two main goals: (1) developing algorithms and methods for controlling the model’s behavior; (2) making cutting-edge language technology available and fair across speakers of different languages and users of different socio-demographic groups. Before joining UW, Hila was a postdoctoral researcher at Amazon and Meta AI. Prior to that she did her Ph.D in Computer Science at the NLP lab at Bar Ilan University. She obtained her Ms.C. in Computer Science from the Hebrew University. Hila is the recipient of several prestigious postdoc awards and an EECS Rising Stars award. Her work received the best paper awards at CoNLL 2019 and at the repL4nlp workshop 2022. This talk is part of the Language Technology Lab Seminars series. This talk is included in these lists:
Note that ex-directory lists are not shown. |
Other listshistory Sequencing Workshop Sir Richard Stone Annual LectureOther talksScience advice under uncertainty What’s the DVM Research Admin Team Up To? Grants, Culture, Impact & New Collaborations! Surgery and Obstetrics & Gynaecology But why here? Space technologies, the logic of location, and the violence of infrastructure Brain Boost: Healthy Habits for a Happier Life Challenges in estimating historical crisis mortality: spatial heterogeneity, endogenous incompleteness, sample size, and ad hoc methods. |