前言
关于强化学习的思想和概念介绍,可以参考 Part 1: Key Concepts in RL。
强化学习的算法分类可以参考 Part 2: Kinds of RL Algorithms 。
这篇文章主要介绍基于策略梯度的强化学习算法。
策略梯度
定义策略为 $\pi(a|s) = P(A_t = a | S_t =s)$,即处于状态 $s$ 时采取动作 $a$ 的概率。
如果用参数为 $\theta$ 的神经网络来表示策略 $\pi_\theta$,在策略 $\pi_\theta$ 下产生一条轨迹 $\tau$ 的概率 $P(\tau;\theta)$ 为:
\[P(\tau|\theta) = \rho_0(s_0)\mathop{\prod}\limits_{t=0}^TP(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\]其中 $\rho_0$ 为初始状态分布(start-state distribution)。
策略 $\pi_\theta$ 的期望回报为:
\[J(\pi_\theta) = \mathop{E}\limits_{\tau \sim\pi_\theta}[R(\tau)]=\int_\tau P(\tau|\theta)R(\tau)\]其中 $R(\tau)= \sum\limits_{t=1}^Tr(s_t, a_t)$。
我们的目标是找到最佳策略 $\pi^* = \mathop{argmax}\limits_{\pi}J(\pi)$,即找到最佳参数 $\theta^*$ 使得 $J(\pi_\theta)$ 最大。
定义策略梯度为 $\nabla_\theta J(\pi_\theta)$,使用梯度上升 $\theta_{k+1} = \theta_{k} + \alpha \nabla_\theta J(\pi_\theta)|_{\theta_k}$ 对策略的期望回报 $J(\pi_\theta)$ 进行优化的算法,称为策略梯度算法。
由 log $x$ 对 $x$ 的导数为 1/$x$ 可得:
\[\nabla_\theta P(\tau|\theta) = P(\tau|\theta)\frac{\nabla_\theta P(\tau|\theta)}{P(\tau|\theta)} = P(\tau|\theta)\nabla_\theta logP(\tau|\theta)\]通过取 log 把乘除转换为加减,可得:
\[logP(\tau|\theta) = log\rho_0(s_0) + \sum\limits_{t=0}^T(logP(s_{t+1}|s_t, a_t) + log\pi_\theta(a_t|s_t))\]其中 $\rho_0(s_0)$,$P(s_{t+1}|s_t, a_t)$ 都只和环境相关,和策略无关,因此这几项对 $\theta$ 求导都是0,可得:
\[\nabla_\theta logP(\tau|\theta) = \sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\]因此策略梯度为:
\[\nabla_\theta J(\pi_\theta) = \nabla_\theta \mathop{E}\limits_{\tau \sim\pi_\theta}[R(\tau)]\\ \qquad \qquad =\nabla_\theta \int_\tau P(\tau|\theta)R(\tau) \\ \qquad \qquad =\int_\tau \nabla_\theta P(\tau|\theta)R(\tau)\\ \qquad \qquad =\int_\tau P(\tau|\theta)\nabla_\theta logP(\tau|\theta)R(\tau)\\ \qquad \qquad =\mathop{E}\limits_{\tau \sim\pi_\theta}[\nabla_\theta logP(\tau|\theta)R(\tau)]\\ \qquad \qquad =\mathop{E}\limits_{\tau \sim\pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)]\]期望可以通过采样求均值来预估:
\[\nabla_\theta J(\pi_\theta) \approx \frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)\]简单策略梯度算法
完整代码请见 github。
定义损失函数
1
2
3
4
5
6
7
8
9
10
11
12
# make core of policy network
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])
# make function to compute action distribution
def get_policy(obs):
logits = logits_net(obs)
return Categorical(logits=logits)
# make loss function whose gradient, for the right data, is policy gradient
def compute_loss(obs, act, weights):
logp = get_policy(obs).log_prob(act)
return -(logp * weights).mean()
这里的 loss 即对应 $log\pi_\theta(a_t|s_t)R(\tau)$,weights 在这个例子中代表 $R(\tau)$。求 loss 的梯度即等于求 $\nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)$。因为是采用梯度上升而非梯度下降,因此返回负的 $log\pi_\theta(a_t|s_t)R(\tau)$。
这里虽然命名为损失函数,但和监督学习中的损失函数意义并不同。
定义奖励
1
2
3
4
5
6
7
8
9
10
11
12
13
14
while True:
# act in the environment
obs, rew, done, _ = env.step(act)
# save action, reward
batch_acts.append(act)
ep_rews.append(rew)
if done:
# if episode is over, record info about episode
ep_ret, ep_len = sum(ep_rews), len(ep_rews)
# the weight for each logprob(a|s) is R(tau)
batch_weights += [ep_ret] * ep_len
break
通过在环境中执行动作 env.step(act)
,拿到对应的奖励 rew
,直到环境返回 done,表示一条轨迹结束时,对这条轨迹上的奖励求和,得到整条轨迹的回报 $R(\tau)$,并保存到 batch_weights
中。
训练
完整的一轮训练流程如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# for training policy
def train_one_epoch():
# make some empty lists for logging.
batch_obs = [] # for observations
batch_acts = [] # for actions
batch_weights = [] # for R(tau) weighting in policy gradient
batch_rets = [] # for measuring episode returns
batch_lens = [] # for measuring episode lengths
# reset episode-specific variables
obs = env.reset() # first obs comes from starting distribution
done = False # signal from environment that episode is over
ep_rews = [] # list for rewards accrued throughout ep
# collect experience by acting in the environment with current policy
while True:
# save obs
batch_obs.append(obs.copy())
# act in the environment
act = get_action(torch.as_tensor(obs, dtype=torch.float32))
obs, rew, done, _ = env.step(act)
# save action, reward
batch_acts.append(act)
ep_rews.append(rew)
if done:
# if episode is over, record info about episode
ep_ret, ep_len = sum(ep_rews), len(ep_rews)
batch_rets.append(ep_ret)
batch_lens.append(ep_len)
# the weight for each logprob(a|s) is R(tau)
batch_weights += [ep_ret] * ep_len
# reset episode-specific variables
obs, done, ep_rews = env.reset(), False, []
# end experience loop if we have enough of it
if len(batch_obs) > batch_size:
break
# take a single policy gradient update step
optimizer.zero_grad()
batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
act=torch.as_tensor(batch_acts, dtype=torch.int32),
weights=torch.as_tensor(batch_weights, dtype=torch.float32)
)
batch_loss.backward()
optimizer.step()
return batch_loss, batch_rets, batch_lens
Reward-to-Go 策略梯度算法
前面我们推导的策略梯度为 $\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)]$,可以看出是用整条轨迹的回报作为评估该轨迹上每个 action 的好坏。但实际上评估 action 应该通过采取了该 action 之后得到的累计 reward 来评估,即 only rewards that come after。
因此更好的方式是,通过下面的公式来计算策略梯度:
\[\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})]\]这种方式可以命名为 “reward-to-go policy gradient”。
采用 reward-to-go 算法,方差更低(因为只累计执行 action 之后的奖励,排除了 action 之前的奖励项,减少了不确定性),算法更易于收敛,即意味着该算法所需要的训练样本数量更少。证明可以参考Extra Material。
定义奖励
Reward-to-Go 相比简单梯度,实现上唯一区别就是 reward 的计算方式:
1
2
3
4
5
6
def reward_to_go(rews):
n = len(rews)
rtgs = np.zeros_like(rews)
for i in reversed(range(n)):
rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
return rtgs
唯一改动是将 batch_weights += [ep_ret] * ep_len
替换成 batch_weights += list(reward_to_go(ep_rews))
。完整代码可见 github。
Baselines in Policy Gradients
先介绍下 EGLP 定理:
EGLP Lemma: Suppose that $P_\theta$ is a parameterized probability distribution over a random variable $x$, then$\mathop{E}\limits_{x\sim P_\theta}[\nabla_\theta logP_\theta(x)]=0$.
从 EGLP 定理可得,对于不依赖策略只依赖于状态的函数 $b$ 可得:
\[\mathop{E}\limits_{a_t\sim \pi_\theta}[\nabla_\theta log\pi_\theta(a_t|s_t)b(s_t)]=0\]这意味着我们可以做如下变换:
\[\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})] \\ = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})] -\mathop{E}\limits_{a_t\sim \pi_\theta}[\nabla_\theta log\pi_\theta(a_t|s_t)b(s_t)]\\= \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)(\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})-b(s_t))]\]函数 $b$ 称之为 baseline。baseline 函数最常选择价值函数 $V^\pi(s_t)$ ,表示从状态 $s_t$ 开始得到的平均回报。使用 $V^\pi(s_t)$ 作为 baseline 可以减少样本方差使训练更加稳定且更快收敛。
实践中通常采用神经网络来近似法估算 $V^\pi(s_t)$,常见的策略优化算法(包括 VPG, TRPO, PPO, A2C等)都采用均方差作为损失函数,采用梯度下降来学习价值函数 $V_\phi(s_t)$ 。