Assume we have some basic understanding of reinforcement learning. I start my deep reinforcement learning journey with model-free algorithms. Model-free methods assume that the agent does not have access to the model of the environment. The model here specifies the transition dynamics, which gives you the reward and the next state from a given state and action. Among model-free algorithms, I investigate Q-learning methods first. These methods learns a function approximator for the action value function.
This is the paper that introduces Double Q-learning. Before diving into the Double Q-learning algorithm, let’s recap on the general setup of DQN.
- Environment: Atari game emulator (stochastic and partially observable)
- Agent: only observes the current screen (vector of raw pixel values)
- Action: a set of legal game actions
- State: sequences of actions and observations (finite)
- Reward: change in game score
- Transition: mapping from current state and action to the next state
- Find optimal policy, which is mapping from state to action that maximizes total sum of discounted future rewards
Main difference from DQN:
The major difference is that when calculating the target values, we use the Main network to select the best actions, instead of using the Target network as in DQN.
Algorithm Pseudocode (Detailed):
Initialize an empty replay memory buffer
D, which includes:
We also need the following to
store and update our D:
size of D,
maximum size of D, training
batch size, and
sample batch function.
Initialize two copies of a network,
Main network and
Target network, along with their parameters.
#use main network to select best actions best_acts = tf.argmax(q_vals, axis=1) best_acts_one_hots = tf.one_hot(best_acts, n_actions) #use target network to evaluate best actions optimal_future_q = gamma * tf.reduce_sum(best_acts_one_hots * q_vals_targ, axis=1) target = tf.stop_gradient(rewards_ph + (1 - tf.cast(dones_ph, dtype=tf.float32)) * optimal_future_q)
Main part of training:
Initialize various (hyper)parameters for our networks Loop through the number of episodes: Initialize a sequence of observations Loop through the time steps of a single episode: Update our state with the sequence of observations Sample an action from current policy (from uniform to epsilon-greedy) Sample the next observation from the environment given our current observations and chosen action Update our state with the new observation Update reward resulting from chosen action if replay buffer is full: Delete the oldest observation Append the new observation Sample a batch from replay buffer uniformly For each (state, action, reward, next state) in the sample batch, construct target values by: Select the best actions according to the Main network Evaluate the best actions using the Target network Calculate target values using Bellman equation Calculate quadratic loss function using target values Perform gradient descent Update parameters for the Target network for a given frequency
Please see this Github repo.
Another algorithm in the Q-learning family is Dueling DQN. This is the paper that introduces Dueling Network for Deep RL. The algorithm for dueling is actually the same as Double DQN with experience replay and target network. For simplicity, I didn’t incorporate additional tricks such as prioritized experience replay.
The main improvement for Dueling DQN is their new architecture which separates the representation of state values and state-dependent action advantages. If we aggregate them well, then we still have a good estimate of the state-action value function. In essence, this new architecture will be able to learn the value of states that takes into account of the effect of each action for each state.
Build the dueling architecture for our two networks:
- one stream for the state value function
- one stream for the state-dependent action advantage function
- output our Q-values by
combiningthe two estimators by subtracting the mean advantage:
v_net + a_net - mean(a_net)
To see this more concretely, the following code illustrates the changes to build the new architecture:
#advantage stream a_net = tf.layers.dense(final_hidden, units=n_actions, activation=None) #value stream v_net = tf.layers.dense(final_hidden, units=1, activation=None) #combine the two by substracting the mean advantage q_vals = v_net + (a_net - tf.reduce_mean(a_net, axis=1, keep_dims=True))
Please see this Github repo.