Vanilla Policy Gradient¶
背景¶
(前一节:强化学习介绍:第三部分)
策略梯度背后的关键思想是提高导致更高回报的动作的概率,并降低导致更低回报的动作的概率,直到你获得最佳策略。
关键方程¶
令 表示参数为
的策略,
而
表示有限视野无折扣的策略回报期望。
的梯度为
其中 是轨迹,
是当前策略的优势函数。
策略梯度算法的工作原理是通过随机梯度提升策略性能来更新策略参数:
策略梯度实现通常基于无限视野折扣回报来计算优势函数估计,尽管其他情况下使用有限视野无折扣策略梯度公式。
文档¶
-
spinup.
vpg
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, pi_lr=0.0003, vf_lr=0.001, train_v_iters=80, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10)[源代码]¶ 参数: - env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
- actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Samples actions from policy givenstates.logp
(batch,) Gives log probability, according tothe policy, of taking actionsa_ph
in statesx_ph
.logp_pi
(batch,) Gives log probability, according tothe policy, of the action sampled bypi
.v
(batch,) Gives the value estimate for statesinx_ph
. (Critical: make sureto flatten this!) - ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to VPG.
- seed (int) – Seed for random number generators.
- steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
- epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
- gamma (float) – Discount factor. (Always between 0 and 1.)
- pi_lr (float) – Learning rate for policy optimizer.
- vf_lr (float) – Learning rate for value function optimizer.
- train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
- lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
- max_ep_len (int) – Maximum length of trajectory / episode / rollout.
- logger_kwargs (dict) – Keyword args for EpochLogger.
- save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
保存的模型的内容¶
记录的计算图包括:
键 | 值 |
---|---|
x |
Tensorflow placeholder for state input. |
pi |
Samples an action from the agent, conditioned on states in x . |
v |
Gives value estimate for states in x . |
可以通过以下方式访问此保存的模型
- 使用 test_policy.py 工具运行经过训练的策略,
- 或使用 restore_tf_graph 将整个保存的图形加载到程序中。
参考¶
相关论文¶
- Policy Gradient Methods for Reinforcement Learning with Function Approximation, Sutton et al. 2000
- Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs, Schulman 2016(a)
- Benchmarking Deep Reinforcement Learning for Continuous Control, Duan et al. 2016
- High Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al. 2016(b)
为什么是这些论文?¶
包含Sutton 2000是因为它是强化学习理论的永恒经典,并且包含了导致现代策略梯度的早期工作的参考。 之所以包括Schulman 2016(a),是因为第2章对策略梯度算法(包括伪代码)的理论进行了清晰的介绍。 Duan 2016是一份清晰的,最新的基准论文,显示了深度强化学习设置 (例如,以神经网络策略和Adam为优化器)中的vanilla policy gradient与其他深度强化算法的比较。 之所以包含Schulman 2016(b),是因为我们在VPG的实现中利用了 通用优势估计(Generalized Advantage Estimation)来计算策略梯度。