Policy Gradients
As I mentioned before, we want to use policy gradient methods because:
Model-free methods do not make any assumption about the dynamics $P(s' | s, a)$, other than the ones made by the MDP framework. The most popular version of model-free methods are based on the score-function or likelihood-ratio trick (see this blog entry by Shakir Mohammed for more). Taking the gradient of $V(\pi_\theta)$ (following the notation from this paper by Jie Tang) \[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int \frac{P(\tau|\theta)}{P(\tau|\theta)} \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int P(\tau|\theta) R(\tau) \nabla_\theta \log P(\tau|\theta) \mathrm{d}\tau \\ &= \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \nabla_\theta \log P(\tau|\theta) \right\rbrace \end{align*}\] Model-free methods use the Markov assumption of the MDP (that the dynamics only require the previous state and action to predict the next state), to transform $\nabla_\theta \log P(\tau|\theta)$ into $\sum_{t=0}^H\nabla_\theta \log \pi_{\theta}\left(a_t\middle| s_t\right)$. Thus, evaluating the gradient doesn't require knowing the transition dynamics (only being able to sample from it).
Model-based methods use an estimator for $P(s'|s,a)$ to evaluate the policy gradients. One approach would be to use the estimator as part of a simulator for generating samples of $\tau$, and evaluating the expected gradients using the likelihood ratio trick (I'll write a derivation for the gradients in a later post). An example is to build a neural network that takes as input a state and an action, and predicts the next state; such model can be trained by regression. This is the approach used in the model-based TRPO paper. The other common approach is to use the reparametrization trick. The reparemetrization trick consists defining random variables (for example $\tau$), as deterministic transformations of random variables that are easier to sample; e.g. $\epsilon \sim P(\epsilon)$, where $P(\epsilon)$ is a standard normal or uniform distribution (see this blog entry for a more in depth explanation).
Let's go back to the gradient of our objective and follow a different path, \[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int \nabla_\theta P(\epsilon) R(g(\theta, \epsilon)) \mathrm{d}\epsilon \\ &= \int P(\epsilon) \frac{\partial R}{\partial \tau} \nabla_{\theta} g(\theta, \epsilon) \mathrm{d}\epsilon \\ &= \mathbb{E}_{P(\epsilon)}\left\lbrace \frac{\partial R}{\partial \tau} \nabla_{\theta} g(\theta, \epsilon) \right\rbrace \end{align*}\]
If the transformation $g$ is a deterministic function of $\theta$, then we can compute gradients analytically This is the approach taken by Heess and others in the Stochastic Value Gradients paper, and (implicitly) in Deep-PILCO.
The original PILCO algorithm is related to the reparametrization trick, but it replaces the idea of sampling the noise variable $\epsilon$, with analytical integrations. The trajectory distribution $P(\tau)$ is approximated as a product of Gaussian distributions $P(s_0|\mu_0, \Sigma_0)P(s_1|\mu_1, \Sigma_1)...P(s_H|\mu_H, \Sigma_H)$. The model is assumed to be of the form
\[s_t = f(s_{t-1})\], and used to calculate $\mu_t, \Sigma_t$ from $\mu_{t-1}, \Sigma_{t-1}$ via moment-matching. Once the parameters of the distributions for $s_0$ to $s_H$ have been obtained, we can try computing the policy gradients as follows
\[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace\\ &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace \sum_{t=0}^{H-1} r_t \middle| \tau, \theta \right\rbrace\\ &\approx \nabla_\theta \sum_{t=0}^{H-1} \mathbb{E}_{P(s_t)}\left\lbrace r_t \middle| s_t, \theta \right\rbrace\\ &= \sum_{t=0}^{H-1} \nabla_\theta \mathbb{E}_{P(s_t)}\left\lbrace r(s_t) \middle| \mu_t, \Sigma_t, \theta \right\rbrace \end{align*}\]
PILCO uses Gaussian Process regression for estimating the dynamics model $f$, and assumes the cost function is available to the agent in closed-form. The gradients of the objective as \[\nabla_\theta \mathbb{E}_{P(s_t)}\left\lbrace r(s_t) \middle| \mu_t, \Sigma_t, \theta \right\rbrace \approx \frac{\partial r_t}{\partial \mu_t}\nabla_{\theta}\mu_t + \frac{\partial r_t}{\partial \Sigma_t}\nabla_{\theta}\Sigma_t\]
where $\nabla_{\theta}\mu_t$ and $\nabla_{\theta}\Sigma_t$ require backpropagation through time, as they are deterministic functions of the previous state distribution.
Deep-PILCO is a bit in between the two: it still requires approximating $P{\tau}$ as a product of Gaussians, but uses the reparametrization trick for forward simulation. Since it uses the reparametrization trick, we can apply the PEGASUS algorithm, which requires forward simulation models for which we can fix the random numbers. In recent work, we used this observation to reduce the variance of gradients in Deep-PILCO, improving its convergence and its computational complexity. We demonstrated its efficacy when training neural network controllers, in particular, for the task of learning of swimming gaits for an underwater robot.
Now that we've introduced some of the motivation for our work, we can start with some coding examples in the next post. If light-speed introduction to policy gradients seemed a bit too fast for you, you can try taking a look at David Silver's course at UCL, the UC Berkeley course on Deep-RL, or last year's RL summer school material. Hopefully, as we write some code, things will become clearer.
- We're interested in controlling robot systems that live in a continuous world
- Model-Based policy gradient methods have the potential of data-efficiency
Model-free methods do not make any assumption about the dynamics $P(s' | s, a)$, other than the ones made by the MDP framework. The most popular version of model-free methods are based on the score-function or likelihood-ratio trick (see this blog entry by Shakir Mohammed for more). Taking the gradient of $V(\pi_\theta)$ (following the notation from this paper by Jie Tang) \[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int \frac{P(\tau|\theta)}{P(\tau|\theta)} \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int P(\tau|\theta) R(\tau) \nabla_\theta \log P(\tau|\theta) \mathrm{d}\tau \\ &= \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \nabla_\theta \log P(\tau|\theta) \right\rbrace \end{align*}\] Model-free methods use the Markov assumption of the MDP (that the dynamics only require the previous state and action to predict the next state), to transform $\nabla_\theta \log P(\tau|\theta)$ into $\sum_{t=0}^H\nabla_\theta \log \pi_{\theta}\left(a_t\middle| s_t\right)$. Thus, evaluating the gradient doesn't require knowing the transition dynamics (only being able to sample from it).
Model-based methods use an estimator for $P(s'|s,a)$ to evaluate the policy gradients. One approach would be to use the estimator as part of a simulator for generating samples of $\tau$, and evaluating the expected gradients using the likelihood ratio trick (I'll write a derivation for the gradients in a later post). An example is to build a neural network that takes as input a state and an action, and predicts the next state; such model can be trained by regression. This is the approach used in the model-based TRPO paper. The other common approach is to use the reparametrization trick. The reparemetrization trick consists defining random variables (for example $\tau$), as deterministic transformations of random variables that are easier to sample; e.g. $\epsilon \sim P(\epsilon)$, where $P(\epsilon)$ is a standard normal or uniform distribution (see this blog entry for a more in depth explanation).
Let's go back to the gradient of our objective and follow a different path, \[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int \nabla_\theta P(\tau|\theta) R(\tau) \mathrm{d}\tau \\ &= \int \nabla_\theta P(\epsilon) R(g(\theta, \epsilon)) \mathrm{d}\epsilon \\ &= \int P(\epsilon) \frac{\partial R}{\partial \tau} \nabla_{\theta} g(\theta, \epsilon) \mathrm{d}\epsilon \\ &= \mathbb{E}_{P(\epsilon)}\left\lbrace \frac{\partial R}{\partial \tau} \nabla_{\theta} g(\theta, \epsilon) \right\rbrace \end{align*}\]
If the transformation $g$ is a deterministic function of $\theta$, then we can compute gradients analytically This is the approach taken by Heess and others in the Stochastic Value Gradients paper, and (implicitly) in Deep-PILCO.
The original PILCO algorithm is related to the reparametrization trick, but it replaces the idea of sampling the noise variable $\epsilon$, with analytical integrations. The trajectory distribution $P(\tau)$ is approximated as a product of Gaussian distributions $P(s_0|\mu_0, \Sigma_0)P(s_1|\mu_1, \Sigma_1)...P(s_H|\mu_H, \Sigma_H)$. The model is assumed to be of the form
\[s_t = f(s_{t-1})\], and used to calculate $\mu_t, \Sigma_t$ from $\mu_{t-1}, \Sigma_{t-1}$ via moment-matching. Once the parameters of the distributions for $s_0$ to $s_H$ have been obtained, we can try computing the policy gradients as follows
\[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace\\ &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace \sum_{t=0}^{H-1} r_t \middle| \tau, \theta \right\rbrace\\ &\approx \nabla_\theta \sum_{t=0}^{H-1} \mathbb{E}_{P(s_t)}\left\lbrace r_t \middle| s_t, \theta \right\rbrace\\ &= \sum_{t=0}^{H-1} \nabla_\theta \mathbb{E}_{P(s_t)}\left\lbrace r(s_t) \middle| \mu_t, \Sigma_t, \theta \right\rbrace \end{align*}\]
PILCO uses Gaussian Process regression for estimating the dynamics model $f$, and assumes the cost function is available to the agent in closed-form. The gradients of the objective as \[\nabla_\theta \mathbb{E}_{P(s_t)}\left\lbrace r(s_t) \middle| \mu_t, \Sigma_t, \theta \right\rbrace \approx \frac{\partial r_t}{\partial \mu_t}\nabla_{\theta}\mu_t + \frac{\partial r_t}{\partial \Sigma_t}\nabla_{\theta}\Sigma_t\]
where $\nabla_{\theta}\mu_t$ and $\nabla_{\theta}\Sigma_t$ require backpropagation through time, as they are deterministic functions of the previous state distribution.
Deep-PILCO is a bit in between the two: it still requires approximating $P{\tau}$ as a product of Gaussians, but uses the reparametrization trick for forward simulation. Since it uses the reparametrization trick, we can apply the PEGASUS algorithm, which requires forward simulation models for which we can fix the random numbers. In recent work, we used this observation to reduce the variance of gradients in Deep-PILCO, improving its convergence and its computational complexity. We demonstrated its efficacy when training neural network controllers, in particular, for the task of learning of swimming gaits for an underwater robot.
Now that we've introduced some of the motivation for our work, we can start with some coding examples in the next post. If light-speed introduction to policy gradients seemed a bit too fast for you, you can try taking a look at David Silver's course at UCL, the UC Berkeley course on Deep-RL, or last year's RL summer school material. Hopefully, as we write some code, things will become clearer.
Comments
Post a Comment