Definations
As a beginner, the definations are so importnant for us. Only we are clear about these compelax things, then we could be much easier to understand the knowledge later.
Note: The lower case letters means the observation variable, the uppercase letters mean the random variable.
Terminologies
- Agent
- Environment
- State $S$
- Action $a$
- Reward $r$
- Policy $\pi(a|s)$
- State transsition $p(s’|s,a)$.
Return and Value
Return: $$ U_t = R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+ \dots $$
Action-value function: $$ Q_\pi(s_t,a_t) = \mathbb{E}[U_t|s_t,a_t] $$
Optimal action-value function: $$ Q^*(s_t,a_t) = \mathop{max}\limits_{\pi}Q_\pi(s_t,a_t) $$
State-value funtion:
$$ V_\pi(s_t) = \mathbb{E}_{A}[Q_{\pi}(s_t,A)] $$
Core
During the iteration process, the agent can be controlled by either $\pi(a|s)$ or $Q^*(s,a)$.
So these two things are the targets we should estimate and we will learn some methods to finish this process in the later lessions.