# Trust Region Policy Optimization¶

## 背景¶

（前一节： VPG背景

TRPO通过采取最大的可以改进策略的步骤来更新策略，同时满足关于允许新旧策略接近的特殊约束。 约束用 KL散度 表示，KL散度是对概率分布之间的距离（但不完全相同）的一种度量。

### 速览¶

• TRPO是在轨算法。
• TRPO可用于具有离散或连续动作空间的环境。
• TRPO的Spinning Up实现支持与MPI并行化。

### 关键方程¶

Lastly: computing and storing the matrix inverse, , is painfully expensive when dealing with neural network policies with thousands or millions of parameters. TRPO sidesteps the issue by using the conjugate gradient_ algorithm to solve for , requiring only a function which can compute the matrix-vector product instead of computing and storing the whole matrix directly. This is not too hard to do: we set up a symbolic operation to calculate

 [1] 参见Boyd和Vandenberghe的 凸优化，特别是第2至第5章。

### 探索与利用¶

TRPO trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima.

TRPO以一种在轨策略方式训练随机策略。这意味着它会根据其随机策略的最新版本通过采样操作来进行探索。 动作选择的随机性取决于初始条件和训练程序。 在训练过程中，由于更新规则鼓励该策略利用已经发现的奖励，因此该策略通常变得越来越少随机性。 这可能会导致策略陷入局部最优状态。

## 文档¶

spinup.trpo(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=0.001, train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10, backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10, algo='trpo')[源代码]

• env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
• actor_critic

A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:

Symbol Shape Description
pi (batch, act_dim)
Samples actions from policy given
states.
logp (batch,)
Gives log probability, according to
the policy, of taking actions a_ph
in states x_ph.
logp_pi (batch,)
Gives log probability, according to
the policy, of the action sampled by
pi.
info N/A
A dict of any intermediate quantities
(from calculating the policy or log
probabilities) which are needed for
analytically computing KL divergence.
(eg sufficient statistics of the
distributions)
info_phs N/A
A dict of placeholders for old values
of the entries in info.
d_kl ()
A symbol for computing the mean KL
divergence between the current policy
(pi) and the old policy (as
specified by the inputs to
info_phs) over the batch of
states given in x_ph.
v (batch,)
Gives the value estimate for states
in x_ph. (Critical: make sure
to flatten this!)
• ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TRPO.
• seed (int) – Seed for random number generators.
• steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
• epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
• gamma (float) – Discount factor. (Always between 0 and 1.)
• delta (float) – KL-divergence limit for TRPO / NPG update. (Should be small for stability. Values like 0.01, 0.05.)
• vf_lr (float) – Learning rate for value function optimizer.
• train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
• damping_coeff (float) –

Artifact for numerical stability, should be smallish. Adjusts Hessian-vector product calculation:

where is the damping coefficient. Probably don’t play with this hyperparameter.

• cg_iters (int) –

Number of iterations of conjugate gradient to perform. Increasing this will lead to a more accurate approximation to , and possibly slightly-improved performance, but at the cost of slowing things down.

Also probably don’t play with this hyperparameter.

• backtrack_iters (int) – Maximum number of steps allowed in the backtracking line search. Since the line search usually doesn’t backtrack, and usually only steps back once when it does, this hyperparameter doesn’t often matter.
• backtrack_coeff (float) – How far back to step during backtracking line search. (Always between 0 and 1, usually above 0.5.)
• lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
• max_ep_len (int) – Maximum length of trajectory / episode / rollout.
• logger_kwargs (dict) – Keyword args for EpochLogger.
• save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
• algo – Either ‘trpo’ or ‘npg’: this code supports both, since they are almost the same.

### 保存的模型的内容¶

x Tensorflow placeholder for state input.
pi Samples an action from the agent, conditioned on states in x.
v Gives value estimate for states in x.