Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

A Bayesian Perspective on Generalization and SGD

Add to your list(s) Download to your calendar using vCal

Dr. Samuel L. Smith, Google Brain
Tuesday 17 April 2018, 11:00-12:00
Cambridge University Engineering Department, Lecture Room 11.

If you have a question about this talk, please contact Alexander Matthews.

ABSTRACT :

This talk presents simple Bayesian insights on two fundamental questions:

1. How can we predict whether a model optimized on the training set will perform well on new test data?

2. Why is Stochastic Gradient Descent unreasonably effective at finding local minima that perform well?

I will begin with a brief refresher on Bayesian model comparison, demonstrating that we ought to seek “flat” local minima which minimize a weighted combination of the value of the cost function at the minimum and an “Occam factor” which penalizes curvature.

Zhang et al. [1] received the best paper award at ICLR 2017 for demonstrating deep convolutional networks can easily memorize random relabelings of their training sets. We show that the same phenomenon occurs in linear models. Bayesian model comparison successfully rejects models trained on random labels but accepts models trained on informative labels.

Keskar et al. [2] found that the performance of deep learning models often improves if one reduces the SGD batch size used to estimate the gradient. We argue that this can be understood directly from the principles above. Reducing the batch size introduces noise to the parameter updates, and this noise drives SGD towards flat minima which are likely to generalize well. Treating SGD as a stochastic differential equation, we predict scaling rules which describe how the optimum batch size is controlled by the learning rate, training set size and momentum coefficient. Finally, we demonstrate that decaying the learning rate and increasing the batch size during training are equivalent; obtaining the same test accuracy after the same number of training epochs, and we use this insight to train ResNet-50 on TPU in under 30 minutes.

[1] Understanding deep learning requires rethinking generalization, Zhang et al., ICLR 2017

[2] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Keskar et al., ICLR 2017

BIO :

Following a PhD in theoretical Physics at the University of Cambridge, Sam joined the machine learning team at Babylon health, developing a medical chatbot for primary care. In July 2017, he moved to California for the Google Brain Residency. His research is focused on optimization and natural language processing.

This talk is part of the Machine Learning @ CUED series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

A Bayesian Perspective on Generalization and SGD

This talk is included in these lists:

Other lists

Other talks