Tag: reinforcement learning

Posted 2021-09-02Updated 2024-10-25theory7 minutes read (About 1002 words)

强化学习(1)-马尔可夫决策过程

马尔可夫决策过程 Markov Decision Process (MDP)

马尔可夫决策过程（Markov Decision Process，简称 MDP）是强化学习中的一个重要概念，用于描述智能体在不确定环境下做出决策的数学模型。

MDP有几大要素：状态空间，动作空间，状态转移函数，奖励函数，折扣因子。

$\textrm{MDP}: (\mathcal{S}, \mathcal{A}, T, r).$

马尔可夫性质：当前状态可以完全表征过程。

MDP 的目标是找到一个最优策略，使得从任何初始状态出发，智能体在未来获得的累积奖励（回报）最大化。对于任意有限的马尔可夫决策过程，都存在一个最优策略，不差于其他所有可能的策略。

价值迭代

价值函数

This is called the “value function” for the policy $\pi$:

$V^\pi(s_0) = E_{a_t \sim \pi(s_t)} \Big[ R(\tau) \Big] = E_{a_t \sim \pi(s_t)} \Big[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \Big],$

The key idea behind all algorithms in reinforcement learning is that the value of state can be written as the average reward obtained in the first stage and the value function averaged over all possible next states.

$V^\pi(s_0) = r(s_0, a_0) + \gamma\ E_{a_0 \sim \pi(s_0)} \Big[ E_{s_1 \sim P(s_1 \mid s_0, a_0)} \Big[ V^\pi(s_1) \Big] \Big].$ $V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \Big[ r(s, a) + \gamma\ \sum_{s' \in \mathcal{S}} P(s' \mid s, a) V^\pi(s') \Big];\ \textrm{for all } s \in \mathcal{S}.$

动作价值函数

This is defined to be the average return of a trajectory that begins at $s_0$
but when the action of the first stage is fixed to be $a_0$.

$Q^\pi(s_0, a_0) = r(s_0, a_0) + E_{a_t \sim \pi(s_t)} \Big[ \sum_{t=1}^\infty \gamma^t r(s_t, a_t) \Big],$ $Q^\pi(s, a) = r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a) \sum_{a' \in \mathcal{A}} \pi(a' \mid s')\ Q^\pi(s', a');\ \textrm{ for all } s \in \mathcal{S}, a \in \mathcal{A}.$

最优策略

$\pi^*(s) = \underset{a \in \mathcal{A}}{\mathrm{argmax}} \Big[ r(s, a) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s, a)\ V^*(s') \Big].$

价值迭代

$V_{k+1}(s) = \max_{a \in \mathcal{A}} \Big\{ r(s, a) + \gamma\ \sum_{s' \in \mathcal{S}} P(s' \mid s, a) V_k(s') \Big\};\ \textrm{for all } s \in \mathcal{S}.$ $Q_{k+1}(s, a) = r(s, a) + \gamma \max_{a' \in \mathcal{A}} \sum_{s' \in \mathcal{S}} P(s' \mid s, a) Q_k (s', a');\ \textrm{ for all } s \in \mathcal{S}, a \in \mathcal{A}.$

The main idea behind the Value Iteration algorithm is to use the principle of dynamic programming to find the optimal average return obtained from a given state. Note that implementing the Value Iteration algorithm requires that we know the Markov decision process (MDP), e.g., the transition and reward functions, completely.

贝尔曼方程

贝尔曼方程（Bellman Equation）是强化学习和动态规划领域的一个重要概念，它用来描述在一个马尔可夫决策过程（MDP）中，状态价值函数或动作价值函数之间的关系。

Q-learning

Q-learning is an algorithm to learn the value function without necessarily knowing the MDP.
Q-learning 是一种强化学习算法，它属于无模型学习方法的一部分，因为 Q-learning 不需要了解环境的确切模型就能学习。Q-learning 的目标是学习一个动作价值函数, 更新规则是基于贝尔曼方程的。

为了学习这个动作价值函数，定义了一个损失函数，该损失函数度量了给定
Q-函数与通过贝尔曼方程所预测的理想
Q-函数之间的差异。

$\hat{Q} = \min_Q \underbrace{\frac{1}{nT} \sum_{i=1}^n \sum_{t=0}^{T-1} (Q(s_t^i, a_t^i) - r(s_t^i, a_t^i) - \gamma \max_{a'} Q(s_{t+1}^i, a'))^2}_{\stackrel{\textrm{def}}{=} \ell(Q)}.$

我们可以通过梯度下降来最小化这个损失函数。

$\begin{split}\begin{aligned}Q(s_t^i, a_t^i) &\leftarrow Q(s_t^i, a_t^i) - \alpha \nabla_{Q(s_t^i,a_t^i)} \ell(Q) \\&=(1 - \alpha) Q(s_t^i,a_t^i) - \alpha \Big( r(s_t^i, a_t^i) + \gamma \max_{a'} Q(s_{t+1}^i, a') \Big),\end{aligned}\end{split}$

参考文献

https://d2l.ai/chapter_reinforcement-learning/value-iter.html

https://d2l.ai/chapter_reinforcement-learning/qlearning.html

马尔可夫决策过程 Markov Decision Process (MDP)

价值迭代

价值函数

动作价值函数

最优策略

价值迭代

贝尔曼方程

Q-learning

参考文献

Links

Categories

Recents

Archives

Tags

Subscribe for updates

follow.it