Q Learning Methods

Assume we have some basic understanding of reinforcement learning. I start my deep reinforcement learning journey with model-free algorithms. Model-free methods assume that the agent does not have access to the model of the environment. The model here specifies the transition dynamics, which gives you the reward and the next state from a given state and action. Among model-free algorithms, I investigate Q-learning methods first. These methods learns a function approximator for the action value function.

This is the paper that introduces Double Q-learning. Before diving into the Double Q-learning algorithm, let’s recap on the general setup of DQN.

Basic setup:

  • Environment: Atari game emulator (stochastic and partially observable)
  • Agent: only observes the current screen (vector of raw pixel values)


  • Action: a set of legal game actions
  • State: sequences of actions and observations (finite)
  • Reward: change in game score
  • Transition: mapping from current state and action to the next state


  • Find optimal policy, which is mapping from state to action that maximizes total sum of discounted future rewards

Main difference from DQN:

The major difference is that when calculating the target values, we use the Main network to select the best actions, instead of using the Target network as in DQN.

Algorithm Pseudocode (Detailed):

Initialize an empty replay memory buffer D, which includes:

observations, next observations, actions , rewards and dones

We also need the following to store and update our D:

pointer, size of D, maximum size of D, training batch size, and sample batch function.

Initialize two copies of a network, Main network and Target network, along with their parameters.

#use main network to select best actions
best_acts = tf.argmax(q_vals, axis=1)
best_acts_one_hots = tf.one_hot(best_acts, n_actions)
#use target network to evaluate best actions
optimal_future_q = gamma * tf.reduce_sum(best_acts_one_hots * q_vals_targ, axis=1)
target = tf.stop_gradient(rewards_ph + (1 - tf.cast(dones_ph, dtype=tf.float32)) * optimal_future_q)

Main part of training:

Initialize various (hyper)parameters for our networks
Loop through the number of episodes:
    Initialize a sequence of observations
    Loop through the time steps of a single episode:
        Update our state with the sequence of observations
        Sample an action from current policy (from uniform to epsilon-greedy)
        Sample the next observation from the environment given our current observations and chosen action
        Update our state with the new observation
        Update reward resulting from chosen action
        if replay buffer is full:
            Delete the oldest observation
        Append the new observation
        Sample a batch from replay buffer uniformly
        For each (state, action, reward, next state) in the sample batch, construct target values by:
            Select the best actions according to the Main network
            Evaluate the best actions using the Target network
            Calculate target values using Bellman equation
        Calculate quadratic loss function using target values
        Perform gradient descent
        Update parameters for the Target network for a given frequency

Actual code:

Please see this Github repo.

Another algorithm in the Q-learning family is Dueling DQN. This is the paper that introduces Dueling Network for Deep RL. The algorithm for dueling is actually the same as Double DQN with experience replay and target network. For simplicity, I didn’t incorporate additional tricks such as prioritized experience replay.

The main improvement for Dueling DQN is their new architecture which separates the representation of state values and state-dependent action advantages. If we aggregate them well, then we still have a good estimate of the state-action value function. In essence, this new architecture will be able to learn the value of states that takes into account of the effect of each action for each state.

Build the dueling architecture for our two networks:

  • one stream for the state value function v_net
  • one stream for the state-dependent action advantage function a_net
  • output our Q-values by combining the two estimators by subtracting the mean advantage:
    • v_net + a_net - mean(a_net)

To see this more concretely, the following code illustrates the changes to build the new architecture:

#advantage stream
a_net = tf.layers.dense(final_hidden, units=n_actions, activation=None)
#value stream
v_net = tf.layers.dense(final_hidden, units=1,  activation=None)
#combine the two by substracting the mean advantage
q_vals = v_net + (a_net - tf.reduce_mean(a_net, axis=1, keep_dims=True))

Actual code:

Please see this Github repo.



comments powered by Disqus