More motivation on why we want models
We want to build models for our model-based RL algorithms because they might help us with the data-efficiency problem. Let's take a look again at the objective we want to optimize\[ \begin{equation} \label{eq:objective} \underset{\theta}{\operatorname{argmax}} V(\theta) = \mathbb{E}_{P(\tau|\theta)}\left\lbrace R(\tau) \right\rbrace \end{equation} \] The distribution over trajectories $P(\tau| \theta)$ describes which sequences of states and actions $s_0,a_0,...,s_{H-1},a_{H-1}$ are likely to be observed when applying a policy with parameters $\theta$. Since we're working under the MDP framework, sampling trajectories $\tau$ is usualy done with the following steps: Sample an initial state $s_0 \sim P(s_0)$ Repeat for $H$ steps Evaluate policy $a_t \sim \pi_{\theta}(a_t|s_t)$ Apply action to obtain next state $s_{t+1} \sim P(s_{t+1}|s_t, a_t)$ Obtain reward $r_t$ For our current purpose, it doesn't matter if the reward is a deterministic function of the ...