## 背景¶

（前一节：强化学习介绍：第三部分

### 速览¶

• VPG 是一种在轨策略算法。
• VPG可用于具有离散或连续动作空间的环境。
• VPG的Spinning Up实现支持与MPI并行化。

### 探索与利用¶

VPG以在轨策略方式训练随机策略。这意味着它会根据其随机策略的最新版本通过采样操作来进行探索。 动作选择的随机性取决于初始条件和训练程序。在训练过程中，由于更新规则鼓励该策略利用已经发现的奖励，因此该策略通常变得越来越少随机性。 这可能会导致策略陷入局部最优状态。

## 文档¶

spinup.vpg(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, pi_lr=0.0003, vf_lr=0.001, train_v_iters=80, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10)[源代码]

• env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
• actor_critic

A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:

Symbol Shape Description
pi (batch, act_dim)
Samples actions from policy given
states.
logp (batch,)
Gives log probability, according to
the policy, of taking actions a_ph
in states x_ph.
logp_pi (batch,)
Gives log probability, according to
the policy, of the action sampled by
pi.
v (batch,)
Gives the value estimate for states
in x_ph. (Critical: make sure
to flatten this!)
• ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to VPG.
• seed (int) – Seed for random number generators.
• steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
• epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
• gamma (float) – Discount factor. (Always between 0 and 1.)
• pi_lr (float) – Learning rate for policy optimizer.
• vf_lr (float) – Learning rate for value function optimizer.
• train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
• lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
• max_ep_len (int) – Maximum length of trajectory / episode / rollout.
• logger_kwargs (dict) – Keyword args for EpochLogger.
• save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.

### 保存的模型的内容¶

x Tensorflow placeholder for state input.
pi Samples an action from the agent, conditioned on states in x.
v Gives value estimate for states in x.