This week I tried implementing Soft Actor Critic (SAC) algorithm. It is a model-free, off-policy algorithm that combines insights from the Q-learning methods and policy optimization methods.

Paper: Soft Actor Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Authors: Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine

Date: Aug 2018

Topic: Model free RL

## Motivation:

Despite the success of model-free deep reinforcement learning algorithms, they often lack sample efficiency and good convergence properties. Thus, we often need hyperparameter tuning, which prohibits these algorithms from extending good performance to more complex tasks.

## Main problem:

The main problem is to overcome the current limitations (stability, sample efficiency, data reusability) so that we can apply deep reinforcement learning to challenging real world systems, such as robot manipulation and locomotion.

## Main contribution:

This paper introduces soft actor-critic (SAC) algorithm, which is an off-policy actor-critic method based on maximum entropy reinforcement learning framework. The high level idea is that the actor not only tries to maximize expected reward, but it also maximizes the policy’s entropy (or randomness). This means that **among all policies that give good rewards, the actor will choose the most random one, so that overall we will explore actions more differently**. To do this, the conventional objective function needs to be augmented to incorporate the entropy constraint.

The standard reinforcement learning objective is the expected sum of rewards, and the goal is to learn a policy that maximizes this objective. The augmented objective allows the optimal policy to maximize its entropy at each state visited.

The additional entropy term in the objective:

$$\alpha H(\pi (\cdot | s_t))$$

The algorithm essentially learns four parameter vectors, representing the soft value function, two Q-functions, and the policy using stochastic gradient descent.

## Soft value function

The soft value function is trained to minimize the squared residual error:

$$J_V(\psi)=\mathbb{E}_{s_t \sim D} [\frac{1}{2} (V_\psi(s_t) - \mathbb{E}_{a_t \sim \pi_\phi}[Q_\theta(s_t, a_t) - \log \pi_\phi(a_t | s_t)])^2]$$

where $D$ is our replay buffer.

## Soft Q-function

The soft Q-function parameters can be trained to minimize the soft Bellman residual:

$$J_Q(\theta) = \mathbb{E}_{(s_t, a_t)\sim D}[\frac{1}{2}(Q_\theta(s_t, a_t) - \hat{Q}(s_t, a_t))^2]$$

where

$$\hat{Q}(s_t, a_t) = r(s_t, a_t) + \lambda \mathbb{E}_{s_{t+1} \sim p}[V_\bar{\psi}(s_{t+1})]$$

## Policy

The policy uses a reparameterization trick, where an action is drawn by computing a deterministic function of state, policy parameters, and independent noise.

$$J_\pi(\phi) = \mathbb{E}_{s_t\sim D, \epsilon_t\sim N}[\log \pi_\phi(f_\phi(\epsilon_t;s_t)|s_t)-Q_\theta(s_t, f_\theta(\epsilon_t, s_t))]$$

## Algorithm

## Questions:

- Is there a way to automatically adjust the temperature parameter (or entropy regularization coefficient) $\alpha$, since we may require less exploration after the initial period?
- It is sometimes useful to use the mean action at test time instead of drawing from the policy distribution only. However, would it be better to put a decaying weight on the stochasticity gradually over time, without using the mean (deterministic policy)?