Talks.cam will close on 1 July 2026, further information is available on the UIS Help Site
 

University of Cambridge > Talks.cam > Isaac Newton Institute Seminar Series > Tutorial: Gradient Methods in RL and Their Convergence

Tutorial: Gradient Methods in RL and Their Convergence

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact nobody.

SCL - Bridging Stochastic Control And Reinforcement Learning

: Our aim is to learn about policy gradient methods for solving reinforcement learning (RL) problems modelled using the Markov decision problem (MDP) framework with general (possibly continuous, possibly infinite dimensional) state and action spaces. We will focus mainly on theoretical convergence of mirror descent with direct parametrisation and natural-gradient descent when employing log-linear parametrisation. For our purposes solving an RL problem means that we find a (nearly) optimal policy in a situation where the transition dynamics and costs are unknown but we can repeatedly interact with some system (or environment simulator). There are two main approaches to solving RL problems: action-value methods which learn the state-action value function (the Q-function) and then select actions based on this. Their convergence is understood Watkins and Dayan [1992], [Sutton and Barto, 2018, Ch. 6] and will not be discussed here. Policy gradient methods directly update the policy by stepping in the direction of the gradient of the value function and have a long history for which the reader is referred to [Sutton and Barto, 2018, Ch. 13]. Their convergence is only understood in specific settings, as we will see below. The focus here is to cover generic (Polish) state and action spaces. We will touch upon the popular PPO algorithm Schulman et al. [2017] and explain the difficulties arising when trying to prove convergence of PPO . Many related and interesting questions will not be touched upon: convergence of actor-critic methods, convergence in presence of Monte-Carlo errors, regret, off-policy gradient methods, near continuous time RL. Large parts of what is presented here in particular on mirror descent and natural-gradient descent is from Kerimkulov et al. [2025]. This work was itself inspired by the recent results of Mei et al. [2021], Lan [2023] and Cayci et al. [2021].

This talk is part of the Isaac Newton Institute Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2025 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity