SAC Implementation

After implementing DQN-based algorithms and PPO with much help from existing implementations, I decided to give it a try to implement SAC from scratch. Here is the code.

After going through the paper and writing down a detailed pseudocode for the algorithm, I tried a few things to make the bare minimum work. The basic structure looks something like this:

  • Create the replay buffer with methods to update and sample
  • For the SAC algorithm:
    • Create the environment
    • Create the placeholders
    • Initiate the replay buffer
    • Define all policies
    • Define all networks (main value, target value, double Qs, policy)
    • Define all objectives (soft-Q, state value, policy)
    • Define all optimizers (policy, value)
    • Create session and initialize all variables
    • Define methods to update target and get actions
    • The main loop for training

Not to my surprise, the code didn’t work and I encountered a few issues initially. One key mistake was that I didn’t make sure all dimensions of the tensors are aligned in the way I want them to. Dimension mismatch could occur at various places. Below are some fixes I tried.

  • I added a lot of tf.squeeze, which inputs a tensor input and specified axis, returns a tensor of the same type with all dimensions of specific size 1 removed. This is useful since I defined rewards and dones buffer differently than current observations, next observations and actions buffer.
  • I reshaped current observations to match its placeholder dimension

In addition to shape compatibility, I also need to make sure the type are consistent, such as

  • Buffer size cannot be 1e6. It needs to be casted to int type
  • Need to convert done to float type to match its placeholder dimension

Another thing I learned is the need to specify order of variable computation through control dependencies.

Now that the code could run, I was eager to see how the agent performed in HalfCheetah environment. Here is the plot: bad-plot

The strange behavior made me realize that there were issues with how I defined the policy. In the mlp_gaussian_policy method, I made the following changes (thanks to Josh):

logp_pi -= tf.reduce_sum(2*(np.log(2) - pi - tf.nn.softplus(-2*pi)), axis=1)
pi = tf.tanh(pi)
mu = tf.tanh(mu)
return pi, mu, logp_pi
  • As noted in Appendix C in the paper, logp_pi cannot simply be log-likelihoods of actions in pi from a Gaussian distribution, which is unbounded. To ensure the actions are bounded to a finite interval, we need to apply tanh to the Gaussian samples and do change of variables accordingly. However, tanh in tensorflow behaves strangely, so we apply softplus instead.
  • Use mu instead of logp for policy network
  • Return tanh(mu) for use of deterministic policy

After fixing the policy, the performance became something like this: bad-plot-2

It was doing well for about 10 episodes and went downhills ever since. To make debugging simpler, I’ve also added tools to log various information and created a test environment for testing. After some minor fixes, I finally have the following performance: good-plot

Although I referenced the baselines implementation for debugging and consulted Josh for various bugs, I definitely learned more by attempting to implement it myself.



comments powered by Disqus