强化学习:策略梯度算法(一)

Posted by Masutangu on October 16, 2022

前言

关于强化学习的思想和概念介绍,可以参考 Part 1: Key Concepts in RL

强化学习的算法分类可以参考 Part 2: Kinds of RL Algorithms

这篇文章主要介绍基于策略梯度的强化学习算法。

策略梯度

定义策略为 $\pi(a|s) = P(A_t = a | S_t =s)$,即处于状态 $s$ 时采取动作 $a$ 的概率。

如果用参数为 $\theta$ 的神经网络来表示策略 $\pi_\theta$,在策略 $\pi_\theta$ 下产生一条轨迹 $\tau$ 的概率 $P(\tau;\theta)$ 为:

\[P(\tau|\theta) = \rho_0(s_0)\mathop{\prod}\limits_{t=0}^TP(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\]

其中 $\rho_0$ 为初始状态分布(start-state distribution)

策略 $\pi_\theta$ 的期望回报为:

\[J(\pi_\theta) = \mathop{E}\limits_{\tau \sim\pi_\theta}[R(\tau)]=\int_\tau P(\tau|\theta)R(\tau)\]

其中 $R(\tau)= \sum\limits_{t=1}^Tr(s_t, a_t)$。

我们的目标是找到最佳策略 $\pi^* = \mathop{argmax}\limits_{\pi}J(\pi)$,即找到最佳参数 $\theta^*$ 使得 $J(\pi_\theta)$ 最大。

定义策略梯度为 $\nabla_\theta J(\pi_\theta)$,使用梯度上升 $\theta_{k+1} = \theta_{k} + \alpha \nabla_\theta J(\pi_\theta)|_{\theta_k}$ 对策略的期望回报 $J(\pi_\theta)$ 进行优化的算法,称为策略梯度算法

由 log $x$ 对 $x$ 的导数为 1/$x$ 可得:

\[\nabla_\theta P(\tau|\theta) = P(\tau|\theta)\frac{\nabla_\theta P(\tau|\theta)}{P(\tau|\theta)} = P(\tau|\theta)\nabla_\theta logP(\tau|\theta)\]

通过取 log 把乘除转换为加减,可得:

\[logP(\tau|\theta) = log\rho_0(s_0) + \sum\limits_{t=0}^T(logP(s_{t+1}|s_t, a_t) + log\pi_\theta(a_t|s_t))\]

其中 $\rho_0(s_0)$,$P(s_{t+1}|s_t, a_t)$ 都只和环境相关,和策略无关,因此这几项对 $\theta$ 求导都是0,可得:

\[\nabla_\theta logP(\tau|\theta) = \sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\]

因此策略梯度为:

\[\nabla_\theta J(\pi_\theta) = \nabla_\theta \mathop{E}\limits_{\tau \sim\pi_\theta}[R(\tau)]\\ \qquad \qquad =\nabla_\theta \int_\tau P(\tau|\theta)R(\tau) \\ \qquad \qquad =\int_\tau \nabla_\theta P(\tau|\theta)R(\tau)\\ \qquad \qquad =\int_\tau P(\tau|\theta)\nabla_\theta logP(\tau|\theta)R(\tau)\\ \qquad \qquad =\mathop{E}\limits_{\tau \sim\pi_\theta}[\nabla_\theta logP(\tau|\theta)R(\tau)]\\ \qquad \qquad =\mathop{E}\limits_{\tau \sim\pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)]\]

期望可以通过采样求均值来预估:

\[\nabla_\theta J(\pi_\theta) \approx \frac{1}{m}\sum\limits_{i=1}^m\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)\]

简单策略梯度算法

完整代码请见 github

定义损失函数

1
2
3
4
5
6
7
8
9
10
11
12
# make core of policy network
logits_net = mlp(sizes=[obs_dim]+hidden_sizes+[n_acts])

# make function to compute action distribution
def get_policy(obs):
    logits = logits_net(obs)
    return Categorical(logits=logits)

# make loss function whose gradient, for the right data, is policy gradient
def compute_loss(obs, act, weights):
    logp = get_policy(obs).log_prob(act)
    return -(logp * weights).mean()

这里的 loss 即对应 $log\pi_\theta(a_t|s_t)R(\tau)$,weights 在这个例子中代表 $R(\tau)$。求 loss 的梯度即等于求 $\nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)$。因为是采用梯度上升而非梯度下降,因此返回负的 $log\pi_\theta(a_t|s_t)R(\tau)$。

这里虽然命名为损失函数,但和监督学习中的损失函数意义并不同。

定义奖励

1
2
3
4
5
6
7
8
9
10
11
12
13
14
while True:
    # act in the environment
    obs, rew, done, _ = env.step(act)

    # save action, reward
    batch_acts.append(act)
    ep_rews.append(rew)

    if done:
        # if episode is over, record info about episode
        ep_ret, ep_len = sum(ep_rews), len(ep_rews)
        # the weight for each logprob(a|s) is R(tau)
        batch_weights += [ep_ret] * ep_len
        break

通过在环境中执行动作 env.step(act),拿到对应的奖励 rew,直到环境返回 done,表示一条轨迹结束时,对这条轨迹上的奖励求和,得到整条轨迹的回报 $R(\tau)$,并保存到 batch_weights 中。

训练

完整的一轮训练流程如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# for training policy
def train_one_epoch():
    # make some empty lists for logging.
    batch_obs = []          # for observations
    batch_acts = []         # for actions
    batch_weights = []      # for R(tau) weighting in policy gradient
    batch_rets = []         # for measuring episode returns
    batch_lens = []         # for measuring episode lengths

    # reset episode-specific variables
    obs = env.reset()       # first obs comes from starting distribution
    done = False            # signal from environment that episode is over
    ep_rews = []            # list for rewards accrued throughout ep

    # collect experience by acting in the environment with current policy
    while True:
        # save obs
        batch_obs.append(obs.copy())

        # act in the environment
        act = get_action(torch.as_tensor(obs, dtype=torch.float32))
        obs, rew, done, _ = env.step(act)

        # save action, reward
        batch_acts.append(act)
        ep_rews.append(rew)

        if done:
            # if episode is over, record info about episode
            ep_ret, ep_len = sum(ep_rews), len(ep_rews)
            batch_rets.append(ep_ret)
            batch_lens.append(ep_len)

            # the weight for each logprob(a|s) is R(tau)
            batch_weights += [ep_ret] * ep_len

            # reset episode-specific variables
            obs, done, ep_rews = env.reset(), False, []

            # end experience loop if we have enough of it
            if len(batch_obs) > batch_size:
                break

    # take a single policy gradient update step
    optimizer.zero_grad()
    batch_loss = compute_loss(obs=torch.as_tensor(batch_obs, dtype=torch.float32),
                              act=torch.as_tensor(batch_acts, dtype=torch.int32),
                              weights=torch.as_tensor(batch_weights, dtype=torch.float32)
                              )
    batch_loss.backward()
    optimizer.step()
    return batch_loss, batch_rets, batch_lens

Reward-to-Go 策略梯度算法

前面我们推导的策略梯度为 $\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)R(\tau)]$,可以看出是用整条轨迹的回报作为评估该轨迹上每个 action 的好坏。但实际上评估 action 应该通过采取了该 action 之后得到的累计 reward 来评估,即 only rewards that come after

因此更好的方式是,通过下面的公式来计算策略梯度:

\[\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})]\]

这种方式可以命名为 “reward-to-go policy gradient”。

采用 reward-to-go 算法,方差更低(因为只累计执行 action 之后的奖励,排除了 action 之前的奖励项,减少了不确定性),算法更易于收敛,即意味着该算法所需要的训练样本数量更少。证明可以参考Extra Material

定义奖励

Reward-to-Go 相比简单梯度,实现上唯一区别就是 reward 的计算方式:

1
2
3
4
5
6
def reward_to_go(rews):
    n = len(rews)
    rtgs = np.zeros_like(rews)
    for i in reversed(range(n)):
        rtgs[i] = rews[i] + (rtgs[i+1] if i+1 < n else 0)
    return rtgs

唯一改动是将 batch_weights += [ep_ret] * ep_len 替换成 batch_weights += list(reward_to_go(ep_rews))。完整代码可见 github

Baselines in Policy Gradients

先介绍下 EGLP 定理:

EGLP Lemma: Suppose that $P_\theta$ is a parameterized probability distribution over a random variable $x$, then$\mathop{E}\limits_{x\sim P_\theta}[\nabla_\theta logP_\theta(x)]=0$.

从 EGLP 定理可得,对于不依赖策略只依赖于状态的函数 $b$ 可得:

\[\mathop{E}\limits_{a_t\sim \pi_\theta}[\nabla_\theta log\pi_\theta(a_t|s_t)b(s_t)]=0\]

这意味着我们可以做如下变换:

\[\nabla_\theta J(\pi_\theta) = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})] \\ = \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})] -\mathop{E}\limits_{a_t\sim \pi_\theta}[\nabla_\theta log\pi_\theta(a_t|s_t)b(s_t)]\\= \mathop{E}\limits_{\tau \sim \pi_\theta}[\sum\limits_{t=0}^T \nabla _\theta log\pi_\theta(a_t|s_t)(\sum\limits_{t'=t}^TR(s_{t'}, a_{t'}, s_{t'+1})-b(s_t))]\]

函数 $b$ 称之为 baseline。baseline 函数最常选择价值函数 $V^\pi(s_t)$ ,表示从状态 $s_t$ 开始得到的平均回报。使用 $V^\pi(s_t)$ 作为 baseline 可以减少样本方差使训练更加稳定且更快收敛。

实践中通常采用神经网络来近似法估算 $V^\pi(s_t)$,常见的策略优化算法(包括 VPG, TRPO, PPO, A2C等)都采用均方差作为损失函数,采用梯度下降来学习价值函数 $V_\phi(s_t)$ 。

参考文献