More motivation on why we want models

We want to build models for our model-based RL algorithms because they might help us with the data-efficiency problem. Let's take a look again at the objective we want to optimize \begin{equation} \label{eq:objective} \underset{\theta}{\operatorname{argmax}} V(\theta) = \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \end{equation} The distribution over trajectories P(\tau| \theta) describes which sequences of states and actions s_0,a_0,...,s_{H-1},a_{H-1} are likely to be observed when applying a policy with parameters \theta. Since we're working under the MDP framework, sampling trajectories \tau is usualy done with the following steps: Sample an initial state s_0 \sim P(s_0) Repeat for H steps Evaluate policy a_t \sim \pi_{\theta}(a_t|s_t) Apply action to obtain next state s_{t+1} \sim P(s_{t+1}|s_t, a_t) Obtain reward r_t For our current purpose, it doesn't matter if the reward is a deterministic function of the ...