More motivation on why we want models

We want to build models for our model-based RL algorithms because they might help us with the data-efficiency problem. Let's take a look again at the objective we want to optimize\[ \begin{equation} \label{eq:objective} \underset{\theta}{\operatorname{argmax}} V(\theta) = \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \end{equation} \] The distribution over trajectories $P(\tau| \theta)$ describes which sequences of states and actions $s_0,a_0,...,s_{H-1},a_{H-1}$ are likely to be observed when applying a policy with parameters $\theta$. Since we're working under the MDP framework, sampling trajectories $\tau$ is usualy done with the following steps: Sample an initial state $s_0 \sim P(s_0)$ Repeat for $H$ steps Evaluate policy $a_t \sim \pi_{\theta}(a_t|s_t)$ Apply action to obtain next state $s_{t+1} \sim P(s_{t+1}|s_t, a_t)$  Obtain reward $r_t$ For our current purpose, it doesn't matter if the reward is a deterministic function of the

Policy Gradients

As I mentioned before, we want to use policy gradient methods because: We're interested in controlling robot systems that live in a continuous world Model-Based policy gradient methods have the potential of data-efficiency Within policy gradient methods there are two main categories: model-free and model-based . The main difference between these two is the way the expectation over trajectories is evaluated. Model-free methods do not make any assumption about the dynamics $P(s' | s, a)$, other than the ones made by the MDP framework. The most popular version of model-free methods are based on the score-function or likelihood-ratio trick (see this blog entry by Shakir Mohammed for more). Taking the gradient of $V(\pi_\theta)$ (following the notation from this paper by Jie Tang ) \[ \begin{align*} \nabla_\theta V(\pi_\theta) &= \nabla_\theta \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \\ &= \int \nabla_\theta P(\tau|\theta) R(\tau) \mathrm

Some formalities

The usual formalization used in RL is to treat tasks as Markov Decision Processes (MDP). The main ingredients of an MDP are: A set of states   $s \in \mathcal{S}$ that describe the possible configurations of the environment A set of actions $a \in \mathcal{A}$ that can be applied  at each state Transition dynamics $P(s' | s, a)$ that establish how actions change the state of the environment An instantaneous reward $P(r|s)$ function for evaluating states; e. g. desired states have high rewards. An initial state distribution $P(s_0)$, which tells us in which states the agent is likely to start. A time horizon $H$ during which the behaviour of the agent is going to be evaluated Time is assumed to evolve in discrete steps. The objective of the agent is to  maximize the reward accumulated over the horizon . Solutions to the problem determine which actions should be applied at any given state. This mapping from states to action is called a policy , usually written as $\

Introduction to Model Based RL for robotics

Its' been a while since I wanted to start a blog about the stuff I've been working on for the past few years. Today, I encountered the opportunity to do so: I want to re-write an simplify some of my research code base to enable so of the experiments I'd like to do. The work I've been doing is on the application of Reinforcement Learning to robotics. Reinforcement Learning (RL) has been shown to produce computer programs that beat experts in video games and board games, that control complex robotics systems, or produce believable physics based simulations of articulated characters. RL can be seen as a meta-programming paradigm where computer software changes itself as it interacts with the world, via trial an error. Under this paradigm, a computer programmer writes code that encodes the way the software should change according to its experience; i.e. its learning rules and the objective it is supposed to achieve. The software agent is allowed to measure the state of