COOKIES: By using this website you agree that we can place Google Analytics Cookies on your device for performance monitoring. |
University of Cambridge > Talks.cam > Isaac Newton Institute Seminar Series > Heavy Tail Phenomenon in Stochastic Gradient Descent
Heavy Tail Phenomenon in Stochastic Gradient DescentAdd to your list(s) Download to your calendar using vCal
If you have a question about this talk, please contact nobody. TMLW02 - SGD: stability, momentum acceleration and heavy tails Stochastic gradient descent (SGD) methods are workhorse methods for training machine learning models, particularly in deep learning. After presenting numerical evidence demonstrating that SGD iterates with constant step size can exhibit heavy-tailed behavior even when the data is light-tailed, in the first part of the talk, we delve into the theoretical origins of heavy tails in SGD iterations and their connection to various capacity and complexity notions proposed for characterizing SGD ’s generalization properties in deep learning. Key notions correlating with performance on unseen data include the ‘flatness’ of the local minimum found by SGD (related to the Hessian eigenvalues), the ratio of step size η to batch size b (controlling stochastic gradient noise magnitude), and the ‘tail-index’ (which measures the heaviness of the tails of the eigenspectra of the network weights). We argue that these seemingly disparate perspectives on generalization are deeply intertwined. Depending on the Hessian structure at the minimum and algorithm parameter choices, SGD iterates converge to a heavy-tailed stationary distribution. We rigorously prove this claim in linear regression, demonstrating heavy tails and infinite variance in iterates even in simple quadratic optimization with Gaussian data. We further analyze tail behavior with respect to algorithm parameters, dimension, and curvature, providing insights into SGD behavior in deep learning. Experimental validation on synthetic data and neural networks supports our theory. Additionally, we discuss generalizations to decentralized stochastic gradient algorithms and to other popular step size schedules including the cyclic step sizes. In the second part of the talk, we introduce a new class of initialization schemes for fully-connected neural networks that enhance SGD training performance by inducing a specific heavy-tailed behavior in stochastic gradients. Based on joint work with Yuanhan Hu, Umut Simsekli, and Lingjiong Zhu. This talk is part of the Isaac Newton Institute Seminar Series series. This talk is included in these lists:This talk is not included in any other list Note that ex-directory lists are not shown. |
Other listsType the title of a new list here St Catharine's College John Ray Society. Cambridge Network IT & Infrastructure SIGOther talksCambridge Overcoming Polarisation Initiative Optimal approaches with Zig Smyth’s conjecture and a probabilistic local-to-global principle The role of radiation in cancer care: a spotlight on cancers of the oesophagus, head and neck Statistics Clinic Easter 2024 II |