Trust Region Policy Optimization ¶

Trust Region Policy Optimization

背景 ¶

（前一节： VPG背景）

TRPO通过采取最大的可以改进策略的步骤来更新策略，同时满足关于允许新旧策略接近的特殊约束。约束用 KL散度表示，KL散度是对概率分布之间的距离（但不完全相同）的一种度量。

这与常规策略梯度不同，后者使新策略和旧策略在参数空间中保持紧密联系。但是，即使参数空间上看似很小的差异也可能在性能上产生很大的差异──因此，一个糟糕的步骤可能会使策略性能崩溃。这使得使用大步长的vanilla policy gradients变得危险，从而损害了其采样效率。 TRPO很好地避免了这种崩溃，并且倾向于快速单调地提高性能。

速览 ¶

TRPO是在轨算法。
TRPO可用于具有离散或连续动作空间的环境。
TRPO的Spinning Up实现支持与MPI并行化。

关键方程 ¶

令 $\pi_{\theta}$ 表示参数为 $\theta$ 的策略，理论上的TRPO更新为：

$\theta_{k+1} = \arg \max_{\theta} \; & {\mathcal L}(\theta_k, \theta) \\ \text{s.t.} \; & \bar{D}_{KL}(\theta || \theta_k) \leq \delta$

其中 ${\mathcal L}(\theta_k, \theta)$ 是 替代优势，它使用旧策略中的数据来衡量策略 $\pi_{\theta}$ 与旧策略 $\pi_{\theta_k}$ 的相对性能：

${\mathcal L}(\theta_k, \theta) = \underE{s,a \sim \pi_{\theta_k}}{ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)} A^{\pi_{\theta_k}}(s,a) },$

$\bar{D}_{KL}(\theta || \theta_k)$ 是旧策略访问的各状态之间的策略之间的平均散度差异：

$\bar{D}_{KL}(\theta || \theta_k) = \underE{s \sim \pi_{\theta_k}}{ D_{KL}\left(\pi_{\theta}(\cdot|s) || \pi_{\theta_k} (\cdot|s) \right) }.$

你应该知道

当 $\theta = \theta_k$ 时，目标和约束都为零。此外，当 $\theta = \theta_k$ 时，约束相对于 $\theta$ 的梯度为零。要证明这些事实，需要对相关数学有一些微妙的掌握──每当您准备就绪时，这都是值得做的练习！

理论上的TRPO更新不是最容易使用的，因此TRPO做出了一些近似以快速获得答案。我们使用泰勒展开将目标和约束扩展到 $\theta_k$ 周围的首阶指数（leading order）：

${\mathcal L}(\theta_k, \theta) &\approx g^T (\theta - \theta_k) \\ \bar{D}_{KL}(\theta || \theta_k) & \approx \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k)$

结果产生一个近似的优化问题，

$\theta_{k+1} = \arg \max_{\theta} \; & g^T (\theta - \theta_k) \\ \text{s.t.} \; & \frac{1}{2} (\theta - \theta_k)^T H (\theta - \theta_k) \leq \delta.$

你应该知道

巧合的是，以 $\theta = \theta_k$ 评估的替代优势函数相对于 $\theta$ 的梯度 $g$ 恰好等于策略梯度 $\nabla_{\theta} J(\pi_{\theta})$ ！如果您愿意精通数学，请尝试证明这一点。

这个近似问题可以通过拉格朗日对偶 [1] 的方法来解析地解决，得出的结果是：

$\theta_{k+1} = \theta_k + \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g.$

如果我们到此为止，并仅使用此最终结果，该算法将准确地计算自然策略梯度（Natural Policy Gradient）。一个问题是，由于泰勒展开式引入的近似误差，这可能无法满足KL约束，或实际上提高了替代优势。 TRPO对此更新规则进行了修改：回溯行搜索，

$\theta_{k+1} = \theta_k + \alpha^j \sqrt{\frac{2 \delta}{g^T H^{-1} g}} H^{-1} g,$

其中 $\alpha \in (0,1)$ 是回溯系数， $j$ 是 $\pi_{\theta_{k+1}}$ 满足KL约束并产生正的替代优势的最小非负整数。

Lastly: computing and storing the matrix inverse, $H^{-1}$ , is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the `conjugate gradient`_ algorithm to solve $Hx = g$ for $x = H^{-1} g$ , requiring only a function which can compute the matrix-vector product $Hx$ instead of computing and storing the whole matrix $H$ directly. This is not too hard to do: we set up a symbolic operation to calculate

最后：处理带有成千上万个参数的神经网络策略时，矩阵逆 $H^{-1}$ 的计算和存储非常昂贵。 TRPO通过使用共轭梯度算法对 $x = H^{-1} g$ 求解 $Hx = g$ 来回避问题，仅需要一个可以计算矩阵矢量乘积 $Hx$ 的函数，而不是直接计算和存储整个矩阵 $H$ 。这并不难：我们设置了一个符号运算来计算

$Hx = \nabla_{\theta} \left( \left(\nabla_{\theta} \bar{D}_{KL}(\theta || \theta_k)\right)^T x \right),$

这样就可以在不计算整个矩阵的情况下提供正确的输出。

[1]	参见Boyd和Vandenberghe的凸优化，特别是第2至第5章。

探索与利用 ¶

TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

TRPO以一种在轨策略方式训练随机策略。这意味着它会根据其随机策略的最新版本通过采样操作来进行探索。动作选择的随机性取决于初始条件和训练程序。在训练过程中，由于更新规则鼓励该策略利用已经发现的奖励，因此该策略通常变得越来越少随机性。这可能会导致策略陷入局部最优状态。

伪代码 ¶

$\begin{algorithm}[H] \caption{Trust Region Policy Optimization} \label{alg1} \begin{algorithmic}[1] \STATE Input: initial policy parameters $\theta_0$, initial value function parameters $\phi_0$ \STATE Hyperparameters: KL-divergence limit $\delta$, backtracking coefficient $\alpha$, maximum number of backtracking steps $K$ \FOR{$k = 0,1,2,...$} \STATE Collect set of trajectories ${\mathcal D}_k = \{\tau_i\}$ by running policy $\pi_k = \pi(\theta_k)$ in the environment. \STATE Compute rewards-to-go $\hat{R}_t$. \STATE Compute advantage estimates, $\hat{A}_t$ (using any method of advantage estimation) based on the current value function $V_{\phi_k}$. \STATE Estimate policy gradient as \begin{equation*} \hat{g}_k = \frac{1}{|{\mathcal D}_k|} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T \left. \nabla_{\theta} \log\pi_{\theta}(a_t|s_t)\right|_{\theta_k} \hat{A}_t. \end{equation*} \STATE Use the conjugate gradient algorithm to compute \begin{equation*} \hat{x}_k \approx \hat{H}_k^{-1} \hat{g}_k, \end{equation*} where $\hat{H}_k$ is the Hessian of the sample average KL-divergence. \STATE Update the policy by backtracking line search with \begin{equation*} \theta_{k+1} = \theta_k + \alpha^j \sqrt{ \frac{2\delta}{\hat{x}_k^T \hat{H}_k \hat{x}_k}} \hat{x}_k, \end{equation*} where $j \in \{0, 1, 2, ... K\}$ is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. \STATE Fit value function by regression on mean-squared error: \begin{equation*} \phi_{k+1} = \arg \min_{\phi} \frac{1}{|{\mathcal D}_k| T} \sum_{\tau \in {\mathcal D}_k} \sum_{t=0}^T\left( V_{\phi} (s_t) - \hat{R}_t \right)^2, \end{equation*} typically via some gradient descent algorithm. \ENDFOR \end{algorithmic} \end{algorithm}$

文档 ¶

spinup.trpo(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=0.001, train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10, backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10, algo='trpo')[源代码]¶

参数:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.

actor_critic –

A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:

Symbol	Shape	Description
`pi`	(batch, act_dim)	Samples actions from policy given states.
`logp`	(batch,)	Gives log probability, according to the policy, of taking actions `a_ph` in states `x_ph`.
`logp_pi`	(batch,)	Gives log probability, according to the policy, of the action sampled by `pi`.
`info`	N/A	A dict of any intermediate quantities (from calculating the policy or log probabilities) which are needed for analytically computing KL divergence. (eg sufficient statistics of the distributions)
`info_phs`	N/A	A dict of placeholders for old values of the entries in `info`.
`d_kl`	()	A symbol for computing the mean KL divergence between the current policy (`pi`) and the old policy (as specified by the inputs to `info_phs`) over the batch of states given in `x_ph`.
`v`	(batch,)	Gives the value estimate for states in `x_ph`. (Critical: make sure to flatten this!)

ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TRPO.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
gamma (float) – Discount factor. (Always between 0 and 1.)
delta (float) – KL-divergence limit for TRPO / NPG update. (Should be small for stability. Values like 0.01, 0.05.)
vf_lr (float) – Learning rate for value function optimizer.
train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
damping_coeff (float) –
Artifact for numerical stability, should be smallish. Adjusts Hessian-vector product calculation:

$Hv \rightarrow (\alpha I + H)v$

where $\alpha$ is the damping coefficient. Probably don’t play with this hyperparameter.
cg_iters (int) –
Number of iterations of conjugate gradient to perform. Increasing this will lead to a more accurate approximation to $H^{-1} g$ , and possibly slightly-improved performance, but at the cost of slowing things down.

Also probably don’t play with this hyperparameter.
backtrack_iters (int) – Maximum number of steps allowed in the backtracking line search. Since the line search usually doesn’t backtrack, and usually only steps back once when it does, this hyperparameter doesn’t often matter.
backtrack_coeff (float) – How far back to step during backtracking line search. (Always between 0 and 1, usually above 0.5.)
lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
algo – Either ‘trpo’ or ‘npg’: this code supports both, since they are almost the same.

保存的模型的内容 ¶

记录的计算图包括：

键	值
`x`	Tensorflow placeholder for state input.
`pi`	Samples an action from the agent, conditioned on states in `x`.
`v`	Gives value estimate for states in `x`.

可以通过以下方式访问此保存的模型

使用 test_policy.py 工具运行经过训练的策略，
或使用 restore_tf_graph 将整个保存的图形加载到程序中。

Trust Region Policy Optimization ¶

背景 ¶

速览 ¶

关键方程 ¶

探索与利用 ¶

伪代码 ¶

文档 ¶

保存的模型的内容 ¶

参考 ¶

相关论文 ¶

为什么是这些论文？¶

其他公开实现 ¶