After implementing DQN-based algorithms and PPO with much help from existing implementations, I decided to give it a try to implement SAC from scratch. Here is the code.
After going through the paper and writing down a detailed pseudocode for the algorithm, I tried a few things to make the bare minimum work. The basic structure looks something like this:
- Create the replay buffer with methods to update and sample
- For the SAC algorithm:
- Create the environment
- Create the placeholders
- Initiate the replay buffer
- Define all policies
- Define all networks (main value, target value, double Qs, policy)
- Define all objectives (soft-Q, state value, policy)
- Define all optimizers (policy, value)
- Create session and initialize all variables
- Define methods to update target and get actions
- The main loop for training
Not to my surprise, the code didn’t work and I encountered a few issues initially. One key mistake was that I didn’t make sure all dimensions of the tensors are aligned in the way I want them to. Dimension mismatch could occur at various places. Below are some fixes I tried.
- I added a lot of
tf.squeeze, which inputs a tensor input and specified axis, returns a tensor of the same type with all dimensions of specific size 1 removed. This is useful since I defined rewards and dones buffer differently than current observations, next observations and actions buffer.
- I reshaped current observations to match its placeholder dimension
In addition to shape compatibility, I also need to make sure the type are consistent, such as
- Buffer size cannot be 1e6. It needs to be casted to int type
- Need to convert done to float type to match its placeholder dimension
Another thing I learned is the need to specify order of variable computation through control dependencies.
Now that the code could run, I was eager to see how the agent performed in HalfCheetah environment. Here is the plot:
The strange behavior made me realize that there were issues with how I defined the policy. In the
mlp_gaussian_policy method, I made the following changes (thanks to Josh):
logp_pi -= tf.reduce_sum(2*(np.log(2) - pi - tf.nn.softplus(-2*pi)), axis=1) pi = tf.tanh(pi) mu = tf.tanh(mu) return pi, mu, logp_pi
- As noted in Appendix C in the paper, logp_pi cannot simply be log-likelihoods of actions in pi from a Gaussian distribution, which is unbounded. To ensure the actions are bounded to a finite interval, we need to apply tanh to the Gaussian samples and do change of variables accordingly. However, tanh in tensorflow behaves strangely, so we apply softplus instead.
- Use mu instead of logp for policy network
- Return tanh(mu) for use of deterministic policy
After fixing the policy, the performance became something like this:
It was doing well for about 10 episodes and went downhills ever since. To make debugging simpler, I’ve also added tools to log various information and created a test environment for testing. After some minor fixes, I finally have the following performance:
Although I referenced the baselines implementation for debugging and consulted Josh for various bugs, I definitely learned more by attempting to implement it myself.