Trust Region Policy Optimization

背景

(前一节: VPG背景

TRPO通过采取最大的可以改进策略的步骤来更新策略,同时满足关于允许新旧策略接近的特殊约束。 约束用 KL散度 表示,KL散度是对概率分布之间的距离(但不完全相同)的一种度量。

这与常规策略梯度不同,后者使新策略和旧策略在参数空间中保持紧密联系。 但是,即使参数空间上看似很小的差异也可能在性能上产生很大的差异──因此,一个糟糕的步骤可能会使策略性能崩溃。 这使得使用大步长的vanilla policy gradients变得危险,从而损害了其采样效率。 TRPO很好地避免了这种崩溃,并且倾向于快速单调地提高性能。

速览

  • TRPO是在轨算法。
  • TRPO可用于具有离散或连续动作空间的环境。
  • TRPO的Spinning Up实现支持与MPI并行化。

关键方程

\pi_{\theta} 表示参数为 \theta 的策略,理论上的TRPO更新为:

\theta_{k+1} = \arg \max_{\theta} \; & {\mathcal L}(\theta_k, \theta) \\
\text{s.t.} \; & \bar{D}_{KL}(\theta || \theta_k) \leq \delta

其中 {\mathcal L}(\theta_k, \theta)替代优势, 它使用旧策略中的数据来衡量策略 \pi_{\theta} 与旧策略 \pi_{\theta_k} 的相对性能:

{\mathcal L}(\theta_k, \theta) = \underE{s,a \sim \pi_{\theta_k}}{
    \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a)
    },

\bar{D}_{KL}(\theta || \theta_k) 是旧策略访问的各状态之间的策略之间的平均散度差异:

\bar{D}_{KL}(\theta || \theta_k) = \underE{s \sim \pi_{\theta_k}}{
    D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right)
}.

你应该知道

\theta = \theta_k 时,目标和约束都为零。 此外,当 \theta = \theta_k 时,约束相对于 \theta 的梯度为零。 要证明这些事实,需要对相关数学有一些微妙的掌握──每当您准备就绪时,这都是值得做的练习!

理论上的TRPO更新不是最容易使用的,因此TRPO做出了一些近似以快速获得答案。 我们使用泰勒展开将目标和约束扩展到 \theta_k 周围的首阶指数(leading order):

{\mathcal L}(\theta_k, \theta) &\approx g^T (\theta - \theta_k) \\
\bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k)

结果产生一个近似的优化问题,

\theta_{k+1} = \arg \max_{\theta} \; & g^T (\theta - \theta_k) \\
\text{s.t.} \; & \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta.

你应该知道

巧合的是,以 \theta = \theta_k 评估的替代优势函数相对于 \theta 的梯度 g 恰好等于策略梯度 \nabla_{\theta} J(\pi_{\theta})! 如果您愿意精通数学,请尝试证明这一点。

这个近似问题可以通过拉格朗日对偶 [1] 的方法来解析地解决,得出的结果是:

\theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.

如果我们到此为止,并仅使用此最终结果,该算法将准确地计算 自然策略梯度 (Natural Policy Gradient)。 一个问题是,由于泰勒展开式引入的近似误差,这可能无法满足KL约束,或实际上提高了替代优势。 TRPO对此更新规则进行了修改:回溯行搜索,

\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,

其中 \alpha \in (0,1) 是回溯系数, j\pi_{\theta_{k+1}} 满足KL约束并产生正的替代优势的最小非负整数。

Lastly: computing and storing the matrix inverse, H^{-1}, is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the `conjugate gradient`_ algorithm to solve Hx = g for x = H^{-1} g, requiring only a function which can compute the matrix-vector product Hx instead of computing and storing the whole matrix H directly. This is not too hard to do: we set up a symbolic operation to calculate

最后:处理带有成千上万个参数的神经网络策略时,矩阵逆 H^{-1} 的计算和存储非常昂贵。 TRPO通过使用 共轭梯度 算法对 x = H^{-1} g 求解 Hx = g 来回避问题, 仅需要一个可以计算矩阵矢量乘积 Hx 的函数,而不是直接计算和存储整个矩阵 H。 这并不难:我们设置了一个符号运算来计算

Hx = \nabla_{\theta} \left( \left(\nabla_{\theta} \bar{D}_{KL}(\theta || \theta_k)\right)^T x \right),

这样就可以在不计算整个矩阵的情况下提供正确的输出。

[1]参见Boyd和Vandenberghe的 凸优化,特别是第2至第5章。

探索与利用

TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

TRPO以一种在轨策略方式训练随机策略。这意味着它会根据其随机策略的最新版本通过采样操作来进行探索。 动作选择的随机性取决于初始条件和训练程序。 在训练过程中,由于更新规则鼓励该策略利用已经发现的奖励,因此该策略通常变得越来越少随机性。 这可能会导致策略陷入局部最优状态。

文档

spinup.trpo(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=0.001, train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10, backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10, algo='trpo')[源代码]
参数:
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
  • actor_critic

    A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:

    Symbol Shape Description
    pi (batch, act_dim)
    Samples actions from policy given
    states.
    logp (batch,)
    Gives log probability, according to
    the policy, of taking actions a_ph
    in states x_ph.
    logp_pi (batch,)
    Gives log probability, according to
    the policy, of the action sampled by
    pi.
    info N/A
    A dict of any intermediate quantities
    (from calculating the policy or log
    probabilities) which are needed for
    analytically computing KL divergence.
    (eg sufficient statistics of the
    distributions)
    info_phs N/A
    A dict of placeholders for old values
    of the entries in info.
    d_kl ()
    A symbol for computing the mean KL
    divergence between the current policy
    (pi) and the old policy (as
    specified by the inputs to
    info_phs) over the batch of
    states given in x_ph.
    v (batch,)
    Gives the value estimate for states
    in x_ph. (Critical: make sure
    to flatten this!)
  • ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TRPO.
  • seed (int) – Seed for random number generators.
  • steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
  • epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
  • gamma (float) – Discount factor. (Always between 0 and 1.)
  • delta (float) – KL-divergence limit for TRPO / NPG update. (Should be small for stability. Values like 0.01, 0.05.)
  • vf_lr (float) – Learning rate for value function optimizer.
  • train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
  • damping_coeff (float) –

    Artifact for numerical stability, should be smallish. Adjusts Hessian-vector product calculation:

    Hv \rightarrow (\alpha I + H)v

    where \alpha is the damping coefficient. Probably don’t play with this hyperparameter.

  • cg_iters (int) –

    Number of iterations of conjugate gradient to perform. Increasing this will lead to a more accurate approximation to H^{-1} g, and possibly slightly-improved performance, but at the cost of slowing things down.

    Also probably don’t play with this hyperparameter.

  • backtrack_iters (int) – Maximum number of steps allowed in the backtracking line search. Since the line search usually doesn’t backtrack, and usually only steps back once when it does, this hyperparameter doesn’t often matter.
  • backtrack_coeff (float) – How far back to step during backtracking line search. (Always between 0 and 1, usually above 0.5.)
  • lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
  • max_ep_len (int) – Maximum length of trajectory / episode / rollout.
  • logger_kwargs (dict) – Keyword args for EpochLogger.
  • save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
  • algo – Either ‘trpo’ or ‘npg’: this code supports both, since they are almost the same.

保存的模型的内容

记录的计算图包括:

x Tensorflow placeholder for state input.
pi Samples an action from the agent, conditioned on states in x.
v Gives value estimate for states in x.

可以通过以下方式访问此保存的模型

参考

为什么是这些论文?

包含Schulman 2015是因为它是描述TRPO的原始论文。 之所以包含Schulman 2016,是因为我们对TRPO的实现利用了通用优势估计来计算策略梯度。 之所以将Kakade和Langford 2002包括在内是因为它包含的理论结果激励并深深地与TRPO的理论基础联系在一起。