BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Heavy Tail Phenomenon in Stochastic Gradient Descent  - Mert Gurb
 uzbalaban (Rutgers\, The State University of New Jersey)
DTSTART:20240423T104500Z
DTEND:20240423T113000Z
UID:TALK214132@talks.cam.ac.uk
DESCRIPTION:Stochastic gradient descent (SGD) methods are workhorse method
 s for training machine learning models\, particularly in deep learning. Af
 ter presenting numerical evidence demonstrating that SGD iterates with con
 stant step size can exhibit heavy-tailed behavior even when the data is li
 ght-tailed\, in the first part of the talk\, we delve into the theoretical
  origins of heavy tails in SGD iterations and their connection to various 
 capacity and complexity notions proposed for characterizing SGD's generali
 zation properties in deep learning. Key notions correlating with performan
 ce on unseen data include the 'flatness' of the local minimum found by SGD
  (related to the Hessian eigenvalues)\, the ratio of step size &eta\; to b
 atch size b (controlling stochastic gradient noise magnitude)\, and the 't
 ail-index' (which measures the heaviness of the tails of the eigenspectra 
 of the network weights). We argue that these seemingly disparate perspecti
 ves on generalization are deeply intertwined. Depending on the Hessian str
 ucture at the minimum and algorithm parameter choices\, SGD iterates conve
 rge to a heavy-tailed stationary distribution. We rigorously prove this cl
 aim in linear regression\, demonstrating heavy tails and infinite variance
  in iterates even in simple quadratic optimization with Gaussian data. We 
 further analyze tail behavior with respect to algorithm parameters\, dimen
 sion\, and curvature\, providing insights into SGD behavior in deep learni
 ng. Experimental validation on synthetic data and neural networks supports
  our theory. Additionally\, we discuss generalizations to decentralized st
 ochastic gradient algorithms and to other popular step size schedules incl
 uding the cyclic step sizes. In the second part of the talk\, we introduce
  a new class of initialization schemes for fully-connected neural networks
  that enhance SGD training performance by inducing a specific heavy-tailed
  behavior in stochastic gradients.&nbsp\;\nBased on joint work with Yuanha
 n Hu\, Umut Simsekli\, and Lingjiong Zhu.&nbsp\;
LOCATION:External
END:VEVENT
END:VCALENDAR
