Introduction to Model Based RL for robotics

Its' been a while since I wanted to start a blog about the stuff I've been working on for the past few years. Today, I encountered the opportunity to do so: I want to re-write an simplify some of my research code base to enable so of the experiments I'd like to do.

The work I've been doing is on the application of Reinforcement Learning to robotics. Reinforcement Learning (RL) has been shown to produce computer programs that beat experts in video games and board games, that control complex robotics systems, or produce believable physics based simulations of articulated characters. RL can be seen as a meta-programming paradigm where computer software changes itself as it interacts with the world, via trial an error. Under this paradigm, a computer programmer writes code that encodes the way the software should change according to its experience; i.e. its learning rules and the objective it is supposed to achieve. The software agent is allowed to measure the state of the environment, to act in order to change the state of the environment, and it receives feedback on its performance in the form of a numeric reward. Desirable states in high reward and the goal of learning is to maximize said reward. As the agent collects experience, RL algorithms determine how to change its behavior in order to (provably) maximize the reward it will acquire during a given period.

A problem with the application of RL to robotics is the issue of data-efficiency: how much experience an agent needs to learn a new task. RL requires interacting with an environment in order to gather experience. As shown in recent successes of RL, some of these algorithms require a large amount of experience; e.g.  29 million games to train a Go master program. This is feasible for environments that are straightforward to simulate (board games and video games), but with state-of-the-art robotic systems gathering experience is expensive (requires a lot of human supervision) and risky (it can destroy the robot system or cause harm to people).

In this blog, I'm going to focus on one of the many flavors of RL: Model-based RL. This kind of RL aims to mitigate the costs and risks of RL by using experience to build a predictive model, and using the model as a surrogate for the real environment. The predictive model is used to determine what is the effect of actions on the environment. This has an impact on data efficiency in two ways.

First, if the model is reasonably accurate, the agent can use the model to successfully perform a desired task. In the best case, the model is perfect and the agent does not require any data from the environment to perform the task. In the worst case, the model is biased, and performing the task based on the model alone will be impossible. If the agent can update the model with new experience data, its model should eventually match reality.

Second, the agent now has an additional signal for learning: trying to reduce the bias of its model. Instead of just trying out behaviours that could maximize reward, it can also try out behaviours that collect data required to build a better predictive model. Building a model of the environment is not enough to obtain data-efficiency. An agent may decide some suboptimal behaviour is the best it can do, based on its model predictions. Trying out the behaviour in the real world, it will obtain some numeric reward R. Trying out small variations of the behaviour may return rewards lower than R. Model bias hinders learning as the agent does not explore situations for which its model makes bad predictions.

There are at least two ways to deal with this problem; based on quantifying the uncertainty of the predictive model. One is to use it explicitly as a signal for learning:  the reward is a combination of the model's uncertainty and tha task specific reward. In this way, the agent is motivated to explore behaviours that have not been tried out in the past. Another way to deal with the problem is to acknowledge that the observed data has multiple explanations. As new data is obtained, the agent updates its belief over which explanations are valid. When using the model, the agent should prefer behaviours that are good under any of the possible explanations. This is, roughly, the mechanism used by probabilistic methods for model based RL, for example PILCO, Deep-PILCO, and Model-based TRPO.

Comments

Popular posts from this blog

Policy Gradients

Some formalities