Some formalities
The usual formalization used in RL is to treat tasks as Markov Decision Processes (MDP). The main ingredients of an MDP are:
where $\gamma_t$ is either $\frac{1}{H}$ (average reward) when H is finite or $\gamma^t, \gamma < 1$ (discounted reward) when $H \to \infty$. One way to solve this problem is to build an estimator for the value function $\hat{V}_\pi(s_0)$ and to choose the actions that maximize the future values at each time step. These are called value-based methods because the policy is implicitly defined by the value function.
Another set of methods use parametrized policies: $\pi_\theta$ with parameters $\theta$; e.g. a neural network with $\theta$ denoting its set of weights. These methods attempt to directly optimize the objective ($\ref{eq:opt}$) by searching over the space of parameters $\theta$; they are called policy search methods. When search is done via gradient ascent, which requires differentiating $V(\pi_\theta)$ with respect to $\theta$, we call them policy gradient methods. If we define $\tau = s_0, a_0, s1, a_1,.. , s_{H-1}, a_{H-1}$ as the trajectory obtained by applying the policy, we can rewrite the objective as,
\[ \begin{align*} \label{eq:opt2} \underset{\theta}{\operatorname{argmax}} V(\theta) &= \mathbb{E}\left\lbrace \sum_{t=0}^{H-1} \gamma_t r_t \middle| \pi_{\theta}, s_0 \right\rbrace \\ &= \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int P(\tau|\theta) R(\tau) \mathrm{d}\tau \end{align*} \] Search is done by taking small steps in the direction pointed by the gradient, defined as,
\[\nabla_\theta V(\pi_\theta) = \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\
\] We are interested in policy gradient methods as they are more readily applicable to control problems in robotics. Robots live in a continuous world and the actions they can take are, practically, continuous signals, e.g. the PWM duty cycle of the voltage signal used to drive their motors. In the next blog post, I'll describe with a bit more detail the family of policy gradient methods, which will brings us closer to the topic of this blog: model-based RL.
- A set of states $s \in \mathcal{S}$ that describe the possible configurations of the environment
- A set of actions $a \in \mathcal{A}$ that can be applied at each state
- Transition dynamics $P(s' | s, a)$ that establish how actions change the state of the environment
- An instantaneous reward $P(r|s)$ function for evaluating states; e. g. desired states have high rewards.
- An initial state distribution $P(s_0)$, which tells us in which states the agent is likely to start.
- A time horizon $H$ during which the behaviour of the agent is going to be evaluated
where $\gamma_t$ is either $\frac{1}{H}$ (average reward) when H is finite or $\gamma^t, \gamma < 1$ (discounted reward) when $H \to \infty$. One way to solve this problem is to build an estimator for the value function $\hat{V}_\pi(s_0)$ and to choose the actions that maximize the future values at each time step. These are called value-based methods because the policy is implicitly defined by the value function.
Another set of methods use parametrized policies: $\pi_\theta$ with parameters $\theta$; e.g. a neural network with $\theta$ denoting its set of weights. These methods attempt to directly optimize the objective ($\ref{eq:opt}$) by searching over the space of parameters $\theta$; they are called policy search methods. When search is done via gradient ascent, which requires differentiating $V(\pi_\theta)$ with respect to $\theta$, we call them policy gradient methods. If we define $\tau = s_0, a_0, s1, a_1,.. , s_{H-1}, a_{H-1}$ as the trajectory obtained by applying the policy, we can rewrite the objective as,
\[ \begin{align*} \label{eq:opt2} \underset{\theta}{\operatorname{argmax}} V(\theta) &= \mathbb{E}\left\lbrace \sum_{t=0}^{H-1} \gamma_t r_t \middle| \pi_{\theta}, s_0 \right\rbrace \\ &= \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int P(\tau|\theta) R(\tau) \mathrm{d}\tau \end{align*} \] Search is done by taking small steps in the direction pointed by the gradient, defined as,
\[\nabla_\theta V(\pi_\theta) = \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\
\] We are interested in policy gradient methods as they are more readily applicable to control problems in robotics. Robots live in a continuous world and the actions they can take are, practically, continuous signals, e.g. the PWM duty cycle of the voltage signal used to drive their motors. In the next blog post, I'll describe with a bit more detail the family of policy gradient methods, which will brings us closer to the topic of this blog: model-based RL.
Comments
Post a Comment