Twin Delayed DDPG¶
背景¶
(前一节 背景 for DDPG)
While DDPG can achieve great performance sometimes, it is frequently brittle with respect to hyperparameters and other kinds of tuning. A common failure mode for DDPG is that the learned Qfunction begins to dramatically overestimate Qvalues, which then leads to the policy breaking, because it exploits the errors in the Qfunction. Twin Delayed DDPG (TD3) is an algorithm which addresses this issue by introducing three critical tricks:
Trick One: Clipped DoubleQ Learning. TD3 learns two Qfunctions instead of one (hence “twin”), and uses the smaller of the two Qvalues to form the targets in the Bellman error loss functions.
Trick Two: “Delayed” Policy Updates. TD3 updates the policy (and target networks) less frequently than the Qfunction. The paper recommends one policy update for every two Qfunction updates.
Trick Three: Target Policy Smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Qfunction errors by smoothing out Q along changes in action.
Together, these three tricks result in substantially improved performance over baseline DDPG.
速览¶
 TD3 is an offpolicy algorithm.
 TD3 can only be used for environments with continuous action spaces.
 The Spinning Up implementation of TD3 does not support parallelization.
关键方程¶
TD3 concurrently learns two Qfunctions, and , by mean square Bellman error minimization, in almost the same way that DDPG learns its single Qfunction. To show exactly how TD3 does this and how it differs from normal DDPG, we’ll work from the innermost part of the loss function outwards.
First: target policy smoothing. Actions used to form the Qlearning target are based on the target policy, , but with clipped noise added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range (all valid actions, , satisfy ). The target actions are thus:
Target policy smoothing essentially serves as a regularizer for the algorithm. It addresses a particular failure mode that can happen in DDPG: if the Qfunction approximator develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behavior. This can be averted by smoothing out the Qfunction over similar actions, which target policy smoothing is designed to do.
Next: clipped doubleQ learning. Both Qfunctions use a single target, calculated using whichever of the two Qfunctions gives a smaller target value:
and then both are learned by regressing to this target:
Using the smaller Qvalue for the target, and regressing towards that, helps fend off overestimation in the Qfunction.
Lastly: the policy is learned just by maximizing :
which is pretty much unchanged from DDPG. However, in TD3, the policy is updated less frequently than the Qfunctions are. This helps damp the volatility that normally arises in DDPG because of how a policy update changes the target.
探索与利用¶
TD3 trains a deterministic policy in an offpolicy way. Because the policy is deterministic, if the agent were to explore onpolicy, in the beginning it would probably not try a wide enough variety of actions to find useful learning signals. To make TD3 policies explore better, we add noise to their actions at training time, typically uncorrelated meanzero Gaussian noise. To facilitate getting higherquality training data, you may reduce the scale of the noise over the course of training. (We do not do this in our implementation, and keep noise scale fixed throughout.)
At test time, to see how well the policy exploits what it has learned, we do not add noise to the actions.
你应该知道
Our TD3 implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the start_steps
keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal TD3 exploration.
文档¶

spinup.
td3
(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, pi_lr=0.001, q_lr=0.001, batch_size=100, start_steps=10000, act_noise=0.1, target_noise=0.2, noise_clip=0.5, policy_delay=2, max_ep_len=1000, logger_kwargs={}, save_freq=1)[源代码]¶ 参数:  env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
 actor_critic –
A function which takes in placeholder symbols for state,
x_ph
, and action,a_ph
, and returns the main outputs from the agent’s Tensorflow computation graph:Symbol Shape Description pi
(batch, act_dim) Deterministically computes actionsfrom policy given states.q1
(batch,) Gives one estimate of Q* forstates inx_ph
and actions ina_ph
.q2
(batch,) Gives another estimate of Q* forstates inx_ph
and actions ina_ph
.q1_pi
(batch,) Gives the composition ofq1
andpi
for states inx_ph
:q1(x, pi(x)).  ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to TD3.
 seed (int) – Seed for random number generators.
 steps_per_epoch (int) – Number of steps of interaction (stateaction pairs) for the agent and the environment in each epoch.
 epochs (int) – Number of epochs to run and train agent.
 replay_size (int) – Maximum length of replay buffer.
 gamma (float) – Discount factor. (Always between 0 and 1.)
 polyak (float) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak. (Always between 0 and 1, usually close to 1.)
 pi_lr (float) – Learning rate for policy.
 q_lr (float) – Learning rate for Qnetworks.
 batch_size (int) – Minibatch size for SGD.
 start_steps (int) – Number of steps for uniformrandom action selection, before running real policy. Helps exploration.
 act_noise (float) – Stddev for Gaussian exploration noise added to policy at training time. (At test time, no noise is added.)
 target_noise (float) – Stddev for smoothing noise added to target policy.
 noise_clip (float) – Limit for absolute value of target policy smoothing noise.
 policy_delay (int) – Policy will only be updated once every policy_delay times for each update of the Qnetworks.
 max_ep_len (int) – Maximum length of trajectory / episode / rollout.
 logger_kwargs (dict) – Keyword args for EpochLogger.
 save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.
保存的模型的内容¶
记录的计算图包括：
键  值 

x 
Tensorflow placeholder for state input. 
a 
Tensorflow placeholder for action input. 
pi 
Deterministically computes an action from the agent, conditioned
on states in
x . 
q1 
Gives one actionvalue estimate for states in x and actions in a . 
q2 
Gives the other actionvalue estimate for states in x and actions in a . 
可以通过以下方式访问此保存的模型
 使用 test_policy.py 工具运行经过训练的策略，
 或使用 restore_tf_graph 将整个保存的图形加载到程序中。
参考¶
相关论文¶
 Addressing Function Approximation Error in ActorCritic Methods, Fujimoto et al, 2018