Some formalities

The usual formalization used in RL is to treat tasks as Markov Decision Processes (MDP). The main ingredients of an MDP are:
  1. A set of states  $s \in \mathcal{S}$ that describe the possible configurations of the environment
  2. A set of actions $a \in \mathcal{A}$ that can be applied  at each state
  3. Transition dynamics $P(s' | s, a)$ that establish how actions change the state of the environment
  4. An instantaneous reward $P(r|s)$ function for evaluating states; e. g. desired states have high rewards.
  5. An initial state distribution $P(s_0)$, which tells us in which states the agent is likely to start.
  6. A time horizon $H$ during which the behaviour of the agent is going to be evaluated
Time is assumed to evolve in discrete steps. The objective of the agent is to  maximize the reward accumulated over the horizon. Solutions to the problem determine which actions should be applied at any given state. This mapping from states to action is called a policy, usually written as $\pi(a|s)$; i.e. the probability of taking action $a$ at state $s$. The objective can be stated as finding the policy that maximizes the accumulated rewards. We say that a policy $\pi_1$ is better than another $\pi_2$ if the agent obtains higher accumulated rewards by following the actions dictated by $p_1$. The accumulated rewards, as a function of a policy and starting state distribution, are called the value of a policy $V(\pi)$. We can then, write the RL problem as \[ \begin{equation} \label{eq:opt} \underset{\pi}{\operatorname{argmax}} V(\pi) = \mathbb{E}\left\lbrace \sum_{t=0}^{H-1} \gamma_t r_t \middle| \pi, s_0 \right\rbrace \end{equation} \]
where $\gamma_t$ is either $\frac{1}{H}$ (average reward) when H is finite or $\gamma^t, \gamma < 1$ (discounted reward) when $H \to \infty$. One way to solve this problem is to build an estimator for the value function $\hat{V}_\pi(s_0)$ and to choose the actions that maximize the future values at each time step. These are called value-based methods because the policy is implicitly defined by the value function.

Another set of methods use parametrized policies: $\pi_\theta$ with parameters $\theta$; e.g. a neural network with $\theta$ denoting its set of weights. These methods attempt to directly optimize the objective ($\ref{eq:opt}$) by searching over the space of parameters $\theta$; they are called policy search methods. When search is done via gradient ascent, which requires differentiating $V(\pi_\theta)$ with respect to $\theta$, we call them policy gradient methods. If we define $\tau = s_0, a_0, s1, a_1,.. , s_{H-1}, a_{H-1}$ as the trajectory obtained by applying the policy, we can rewrite the objective as,
\[ \begin{align*} \label{eq:opt2} \underset{\theta}{\operatorname{argmax}} V(\theta) &= \mathbb{E}\left\lbrace \sum_{t=0}^{H-1} \gamma_t r_t \middle| \pi_{\theta}, s_0 \right\rbrace \\ &= \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int P(\tau|\theta) R(\tau) \mathrm{d}\tau \end{align*} \] Search is done by taking small steps in the direction pointed by the gradient, defined as,
\[\nabla_\theta V(\pi_\theta) = \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\
\] We are interested in policy gradient methods as they are more readily applicable to control problems in robotics. Robots live in a continuous world  and the actions they can take are, practically, continuous signals, e.g. the PWM duty cycle of the voltage signal used to drive their motors. In the next blog post, I'll describe with a bit more detail the family of policy gradient methods, which will brings us closer to the topic of this blog: model-based RL.

Comments

Popular posts from this blog

Introduction to Model Based RL for robotics

Policy Gradients

More motivation on why we want models