University of Cambridge > Talks.cam > NLIP Seminar Series > The tradeoff governing efficient language model architectures

The tradeoff governing efficient language model architectures

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Richard Diehl Martinez.

Recent work has proposed alternative language model architectures (e.g. RWKV , Mamba, Hyena) that are dramatically faster than Attention (e.g. 25x higher throughput). However, it’s unclear how switching to these new architectures might affect the behavior of language models when scaled up. In this talk, we’ll discuss our recent work studying the fundamental tradeoffs that govern autoregressive language models. In particular, we’ll focus on language model recall, the ability to ground generations on information seen in-context, which is critical for in-context learning and copying. We show with theory and experiments that all autoregressive architectures obey a fundamental tradeoff: the less memory the model consumes during inference, the worse it is at recall. This tradeoff matters because memory consumption dictates language model throughput in practice. We propose a simple architecture called Based that combines linear and sliding window attention. By varying Based window size and linear attention feature dimension, we can dial the model’s memory consumption and traverse the Pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the efficiency of the fastest attention alternatives on the other.

Bio:

I’m a Fourth-Year CS PhD Student in the Stanford Machine Learning Group advised by Chris Ré and James Zou. I am supported by the National Science Foundation GRFP . I like to develop a detailed understanding of how machine learning models work and when they fail by exploring the unstructured data on which they are trained and formalizing sub-tasks with synthetics. Most recently, I’ve been working on understanding how neural network building blocks affect the quality and efficiency of foundation models. I also like to build tools that leverage large, pre-trained models to facilitate the analysis and management of unstructured training and validation datasets. I’m motivated by challenges that arise when trying to apply machine learning in safety-critical settings like medicine and the sciences. Previously, I was a machine learning research intern at Flatiron Health. I completed my undergrad and master’s at Stanford, where I worked with Jure Leskovec’s SNAP Group and the AIMI Center.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity