Soft Actor-Critic


(前一节 背景 for TD3)

Soft Actor Critic (SAC) is an algorithm which optimizes a stochastic policy in an off-policy way, forming a bridge between stochastic policy optimization and DDPG-style approaches. It isn’t a direct successor to TD3 (having been published roughly concurrently), but it incorporates the clipped double-Q trick, and due to the inherent stochasticity of the policy in SAC, it also winds up benefiting from something like target policy smoothing.

A central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum.


  • SAC is an off-policy algorithm.
  • The version of SAC implemented here can only be used for environments with continuous action spaces.
  • An alternate version of SAC, which slightly changes the policy update rule, can be implemented to handle discrete action spaces.
  • The Spinning Up implementation of SAC does not support parallelization.


To explain Soft Actor Critic, we first have to introduce the entropy-regularized reinforcement learning setting. In entropy-regularized RL, there are slightly-different equations for value functions.

Entropy-Regularized Reinforcement Learning

Entropy is a quantity which, roughly speaking, says how random a random variable is. If a coin is weighted so that it almost always comes up heads, it has low entropy; if it’s evenly weighted and has a half chance of either outcome, it has high entropy.

Let x be a random variable with probability mass or density function P. The entropy H of x is computed from its distribution P according to

H(P) = \underE{x \sim P}{-\log P(x)}.

In entropy-regularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that timestep. This changes the RL problem to:

\pi^* = \arg \max_{\pi} \underE{\tau \sim \pi}{ \sum_{t=0}^{\infty} \gamma^t \bigg( R(s_t, a_t, s_{t+1}) + \alpha H\left(\pi(\cdot|s_t)\right) \bigg)},

where \alpha > 0 is the trade-off coefficient. (Note: we’re assuming an infinite-horizon discounted setting here, and we’ll do the same for the rest of this page.) We can now define the slightly-different value functions in this setting. V^{\pi} is changed to include the entropy bonuses from every timestep:

V^{\pi}(s) = \underE{\tau \sim \pi}{ \left. \sum_{t=0}^{\infty} \gamma^t \bigg( R(s_t, a_t, s_{t+1}) + \alpha H\left(\pi(\cdot|s_t)\right) \bigg) \right| s_0 = s}

Q^{\pi} is changed to include the entropy bonuses from every timestep except the first:

Q^{\pi}(s,a) = \underE{\tau \sim \pi}{ \left. \sum_{t=0}^{\infty} \gamma^t  R(s_t, a_t, s_{t+1}) + \alpha \sum_{t=1}^{\infty} \gamma^t H\left(\pi(\cdot|s_t)\right)\right| s_0 = s, a_0 = a}

With these definitions, V^{\pi} and Q^{\pi} are connected by:

V^{\pi}(s) = \underE{a \sim \pi}{Q^{\pi}(s,a)} + \alpha H\left(\pi(\cdot|s)\right)

and the Bellman equation for Q^{\pi} is

Q^{\pi}(s,a) &= \underE{s' \sim P \\ a' \sim \pi}{R(s,a,s') + \gamma\left(Q^{\pi}(s',a') + \alpha H\left(\pi(\cdot|s')\right) \right)} \\
&= \underE{s' \sim P}{R(s,a,s') + \gamma V^{\pi}(s')}.


The way we’ve set up the value functions in the entropy-regularized setting is a little bit arbitrary, and actually we could have done it differently (eg make Q^{\pi} include the entropy bonus at the first timestep). The choice of definition may vary slightly across papers on the subject.

Soft Actor-Critic

SAC concurrently learns a policy \pi_{\theta}, two Q-functions Q_{\phi_1}, Q_{\phi_2}, and a value function V_{\psi}.

Learning Q. The Q-functions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:

L(\phi_i, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
    \Bigg( Q_{\phi_i}(s,a) - \left(r + \gamma (1 - d) V_{\psi_{\text{targ}}}(s') \right) \Bigg)^2

The target value network, like the target networks in DDPG and TD3, is obtained by polyak averaging the value network parameters over the course of training.

Learning V. The value function is learned by exploiting (a sample-based approximation of) the connection between Q^{\pi} and V^{\pi}. Before we go into the learning rule, let’s first rewrite the connection equation by using the definition of entropy to obtain:

V^{\pi}(s) &= \underE{a \sim \pi}{Q^{\pi}(s,a)} + \alpha H\left(\pi(\cdot|s)\right) \\
&= \underE{a \sim \pi}{Q^{\pi}(s,a) - \alpha \log \pi(a|s)}.

The RHS is an expectation over actions, so we can approximate it by sampling from the policy:

V^{\pi}(s) \approx Q^{\pi}(s,\tilde{a}) - \alpha \log \pi(\tilde{a}|s), \;\;\;\;\; \tilde{a} \sim \pi(\cdot|s).

SAC sets up a mean-squared-error loss for V_{\psi} based on this approximation. But what Q-value do we use? SAC uses clipped double-Q like TD3 for learning the value function, and takes the minimum Q-value between the two approximators. So the SAC loss for value function parameters is:

L(\psi, {\mathcal D}) = \underE{s \sim \mathcal{D} \\ \tilde{a} \sim \pi_{\theta}}{\Bigg(V_{\psi}(s) - \left(\min_{i=1,2} Q_{\phi_i}(s,\tilde{a}) - \alpha \log \pi_{\theta}(\tilde{a}|s) \right)\Bigg)^2}.

Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.

Learning the Policy. The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize V^{\pi}(s), which we expand out (as before) into

\underE{a \sim \pi}{Q^{\pi}(s,a) - \alpha \log \pi(a|s)}.

The way we optimize the policy makes use of the reparameterization trick, in which a sample from \pi_{\theta}(\cdot|s) is drawn by computing a deterministic function of state, policy parameters, and independent noise. To illustrate: following the authors of the SAC paper, we use a squashed Gaussian policy, which means that samples are obtained according to

\tilde{a}_{\theta}(s, \xi) = \tanh\left( \mu_{\theta}(s) + \sigma_{\theta}(s) \odot \xi \right), \;\;\;\;\; \xi \sim \mathcal{N}(0, I).


This policy has two key differences from the policies we use in the other policy optimization algorithms:

1. The squashing function. The \tanh in the SAC policy ensures that actions are bounded to a finite range. This is absent in the VPG, TRPO, and PPO policies. It also changes the distribution: before the \tanh the SAC policy is a factored Gaussian like the other algorithms’ policies, but after the \tanh it is not. (You can still compute the log-probabilities of actions in closed form, though: see the paper appendix for details.)

2. The way standard deviations are parameterized. In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work. (Can you think of why? Or better yet: run an experiment to verify?)

The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):

\underE{a \sim \pi_{\theta}}{Q^{\pi_{\theta}}(s,a) - \alpha \log \pi_{\theta}(a|s)} = \underE{\xi \sim \mathcal{N}}{Q^{\pi_{\theta}}(s,\tilde{a}_{\theta}(s,\xi)) - \alpha \log \pi_{\theta}(\tilde{a}_{\theta}(s,\xi)|s)}

To get the policy loss, the final step is that we need to substitute Q^{\pi_{\theta}} with one of our function approximators. The same as in TD3, we use Q_{\phi_1}. The policy is thus optimized according to

\max_{\theta} \underE{s \sim \mathcal{D} \\ \xi \sim \mathcal{N}}{Q_{\phi_1}(s,\tilde{a}_{\theta}(s,\xi)) - \alpha \log \pi_{\theta}(\tilde{a}_{\theta}(s,\xi)|s)},

which is almost the same as the DDPG and TD3 policy optimization, except for the stochasticity and entropy term.


SAC trains a stochastic policy with entropy regularization, and explores in an on-policy way. The entropy regularization coefficient \alpha explicitly controls the explore-exploit tradeoff, with higher \alpha corresponding to more exploration, and lower \alpha corresponding to more exploitation. The right coefficient (the one which leads to the stablest / highest-reward learning) may vary from environment to environment, and could require careful tuning.

At test time, to see how well the policy exploits what it has learned, we remove stochasticity and use the mean action instead of a sample from the distribution. This tends to improve performance over the original stochastic policy.


Our SAC implementation uses a trick to improve exploration at the start of training. For a fixed number of steps at the beginning (set with the start_steps keyword argument), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal SAC exploration.


spinup.sac(env_fn, actor_critic=<function mlp_actor_critic>, ac_kwargs={}, seed=0, steps_per_epoch=5000, epochs=100, replay_size=1000000, gamma=0.99, polyak=0.995, lr=0.001, alpha=0.2, batch_size=100, start_steps=10000, max_ep_len=1000, logger_kwargs={}, save_freq=1)[源代码]
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
  • actor_critic

    A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:

    Symbol Shape Description
    mu (batch, act_dim)
    Computes mean actions from policy
    given states.
    pi (batch, act_dim)
    Samples actions from policy given
    logp_pi (batch,)
    Gives log probability, according to
    the policy, of the action sampled by
    pi. Critical: must be differentiable
    with respect to policy parameters all
    the way through action sampling.
    q1 (batch,)
    Gives one estimate of Q* for
    states in x_ph and actions in
    q2 (batch,)
    Gives another estimate of Q* for
    states in x_ph and actions in
    q1_pi (batch,)
    Gives the composition of q1 and
    pi for states in x_ph:
    q1(x, pi(x)).
    q2_pi (batch,)
    Gives the composition of q2 and
    pi for states in x_ph:
    q2(x, pi(x)).
    v (batch,)
    Gives the value estimate for states
    in x_ph.
  • ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to SAC.
  • seed (int) – Seed for random number generators.
  • steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
  • epochs (int) – Number of epochs to run and train agent.
  • replay_size (int) – Maximum length of replay buffer.
  • gamma (float) – Discount factor. (Always between 0 and 1.)
  • polyak (float) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak. (Always between 0 and 1, usually close to 1.)

  • lr (float) – Learning rate (used for both policy and value learning).
  • alpha (float) – Entropy regularization coefficient. (Equivalent to inverse of reward scale in the original SAC paper.)
  • batch_size (int) – Minibatch size for SGD.
  • start_steps (int) – Number of steps for uniform-random action selection, before running real policy. Helps exploration.
  • max_ep_len (int) – Maximum length of trajectory / episode / rollout.
  • logger_kwargs (dict) – Keyword args for EpochLogger.
  • save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.



x Tensorflow placeholder for state input.
a Tensorflow placeholder for action input.
mu Deterministically computes mean action from the agent, given states in x.
pi Samples an action from the agent, conditioned on states in x.
q1 Gives one action-value estimate for states in x and actions in a.
q2 Gives the other action-value estimate for states in x and actions in a.
v Gives the value estimate for states in x.


注意:对于SAC,正确的评估策略由 mu 而不是 pi 给出。可以将策略 pi 视为探索策略,而将 mu 视为开发策略。