SAC Implementation

After implementing DQN-based algorithms and PPO with much help from existing implementations, I decided to give it a try to implement SAC from scratch. Here is the code.

After going through the paper and writing down a detailed pseudocode for the algorithm, I tried a few things to make the bare minimum work. The basic structure looks something like this:

• Create the replay buffer with methods to update and sample
• For the SAC algorithm:
• Create the environment
• Create the placeholders
• Initiate the replay buffer
• Define all policies
• Define all networks (main value, target value, double Qs, policy)
• Define all objectives (soft-Q, state value, policy)
• Define all optimizers (policy, value)
• Create session and initialize all variables
• Define methods to update target and get actions
• The main loop for training

Not to my surprise, the code didn’t work and I encountered a few issues initially. One key mistake was that I didn’t make sure all dimensions of the tensors are aligned in the way I want them to. Dimension mismatch could occur at various places. Below are some fixes I tried.

• I added a lot of tf.squeeze, which inputs a tensor input and specified axis, returns a tensor of the same type with all dimensions of specific size 1 removed. This is useful since I defined rewards and dones buffer differently than current observations, next observations and actions buffer.
• I reshaped current observations to match its placeholder dimension

In addition to shape compatibility, I also need to make sure the type are consistent, such as

• Buffer size cannot be 1e6. It needs to be casted to int type
• Need to convert done to float type to match its placeholder dimension

Another thing I learned is the need to specify order of variable computation through control dependencies.

Now that the code could run, I was eager to see how the agent performed in HalfCheetah environment. Here is the plot:

The strange behavior made me realize that there were issues with how I defined the policy. In the mlp_gaussian_policy method, I made the following changes (thanks to Josh):

logp_pi -= tf.reduce_sum(2*(np.log(2) - pi - tf.nn.softplus(-2*pi)), axis=1)
pi = tf.tanh(pi)
mu = tf.tanh(mu)
return pi, mu, logp_pi

• As noted in Appendix C in the paper, logp_pi cannot simply be log-likelihoods of actions in pi from a Gaussian distribution, which is unbounded. To ensure the actions are bounded to a finite interval, we need to apply tanh to the Gaussian samples and do change of variables accordingly. However, tanh in tensorflow behaves strangely, so we apply softplus instead.
• Use mu instead of logp for policy network
• Return tanh(mu) for use of deterministic policy

After fixing the policy, the performance became something like this:

It was doing well for about 10 episodes and went downhills ever since. To make debugging simpler, I’ve also added tools to log various information and created a test environment for testing. After some minor fixes, I finally have the following performance:

Although I referenced the baselines implementation for debugging and consulted Josh for various bugs, I definitely learned more by attempting to implement it myself.

Next
Previous