Long-Range Transformers
Add to your list(s)
Download to your calendar using vCal
If you have a question about this talk, please contact Elre Oldewage.
These days, Transformer architectures are showing state-of-the-art performance in many tasks, including natural language processing, computer vision, protein modelling and beyond. Unfortunately, Transformers scale quadratically (O(L^2)) as the sequence length L grows. In this talk, we will discuss a zoo of recently proposed methods to reduce time or memory complexity of Transformers up to O(L) and even O(1).
Literature:
Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler. arXiv:2009.06732.
Rethinking Attention with Performers. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller. ICLR 2021 .
Sub-Linear Memory: How to Make Performers SLiM. Valerii Likhosherstov, Krzysztof Choromanski, Jared Davis, Xingyou Song, Adrian Weller. arXiv:2012.11346.
This talk is part of the Machine Learning Reading Group @ CUED series.
This talk is included in these lists:
Note that ex-directory lists are not shown.