BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//talks.cam.ac.uk//v3//EN
BEGIN:VTIMEZONE
TZID:Europe/London
BEGIN:DAYLIGHT
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
TZNAME:BST
DTSTART:19700329T010000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
TZNAME:GMT
DTSTART:19701025T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
CATEGORIES:Machine Learning @ CUED
SUMMARY:Distributed stochastic optimization for deep learn
 ing - Sixin Zhang (NYU)
DTSTART;TZID=Europe/London:20160614T100000
DTEND;TZID=Europe/London:20160614T110000
UID:TALK66515AThttp://talks.cam.ac.uk
URL:http://talks.cam.ac.uk/talk/index/66515
DESCRIPTION:We study the problem of how to distribute the trai
 ning of large-scale deep learning models in the pa
 rallel computing environment. We propose a new dis
 tributed stochastic optimization method called Ela
 stic Averaging SGD (EASGD). We analyze the converg
 ence rate of the EASGD method in the synchronous s
 cenario and compare its stability condition with t
 he existing ADMM method in the round-robin scheme.
  An asynchronous and momentum variant of the EASGD
  method is applied to train deep convolutional neu
 ral networks for image classification on the CIFAR
  and ImageNet datasets. Our approach accelerates t
 he training and furthermore achieves better test a
 ccuracy. It also requires a much smaller amount of
  communication than other common baseline approach
 es such as the DOWNPOUR method.\n\nWe then investi
 gate the limit in speedup of the initial and the a
 symptotic phase of the mini-batch SGD\, the moment
 um SGD\, and the EASGD methods. We find that the s
 pread of the input data distribution has a big imp
 act on their initial convergence rate and stabilit
 y region. We also find a surprising connection bet
 ween the momentum SGD and the EASGD method with a 
 negative moving average rate. A non-convex case is
  also studied to understand when EASGD can get tra
 pped by a saddle point.\n\nFinally\, we scale up t
 he EASGD method by using a tree structured network
  topology. We show empirically its advantage and c
 hallenge. We also establish a connection between t
 he EASGD and the DOWNPOUR method with the classica
 l Jacobi and the Gauss-Seidel method\, thus unifyi
 ng a class of distributed stochastic optimization 
 methods.\n\n(See https://arxiv.org/abs/1605.02216)
 \n\n
LOCATION:Engineering Department\, CBL Room BE-438
CONTACT:Louise Segar
END:VEVENT
END:VCALENDAR
