Assume we have some basic understanding of reinforcement learning. I start my deep reinforcement learning journey with model-free algorithms. Model-free methods assume that the agent does not have access to the model of the environment. The model here specifies the transition dynamics, which gives you the reward and the next state from a given state and action. Among model-free algorithms, I investigate Q-learning methods first. These methods learns a function approximator for the action value function.

This is the paper that introduces Double Q-learning. Before diving into the Double Q-learning algorithm, let’s recap on the general setup of DQN.

### Basic setup:

**Environment**: Atari game emulator (stochastic and partially observable)**Agent**: only observes the current screen (vector of raw pixel values)

### MDP:

**Action**: a set of legal game actions**State**: sequences of actions and observations (finite)**Reward**: change in game score**Transition**: mapping from current state and action to the next state

### Goal:

- Find optimal policy, which is mapping from state to action that maximizes total sum of discounted future rewards

### Main difference from DQN:

The major difference is that when calculating the target values, we use the Main network to select the best actions, instead of using the Target network as in DQN.

### Algorithm Pseudocode (Detailed):

#### Initialize an empty replay memory buffer `D`

, which includes:

`observations`

, `next observations`

, `actions`

, `rewards`

and `dones`

#### We also need the following to `store`

and update our D:

`pointer`

, `size`

of D, `maximum size`

of D, training `batch size`

, and `sample batch`

function.

#### Initialize two copies of a network, `Main`

network and `Target`

network, along with their parameters.

```
#use main network to select best actions
best_acts = tf.argmax(q_vals, axis=1)
best_acts_one_hots = tf.one_hot(best_acts, n_actions)
#use target network to evaluate best actions
optimal_future_q = gamma * tf.reduce_sum(best_acts_one_hots * q_vals_targ, axis=1)
target = tf.stop_gradient(rewards_ph + (1 - tf.cast(dones_ph, dtype=tf.float32)) * optimal_future_q)
```

#### Main part of training:

```
Initialize various (hyper)parameters for our networks
Loop through the number of episodes:
Initialize a sequence of observations
Loop through the time steps of a single episode:
Update our state with the sequence of observations
Sample an action from current policy (from uniform to epsilon-greedy)
Sample the next observation from the environment given our current observations and chosen action
Update our state with the new observation
Update reward resulting from chosen action
if replay buffer is full:
Delete the oldest observation
Append the new observation
Sample a batch from replay buffer uniformly
For each (state, action, reward, next state) in the sample batch, construct target values by:
Select the best actions according to the Main network
Evaluate the best actions using the Target network
Calculate target values using Bellman equation
Calculate quadratic loss function using target values
Perform gradient descent
Update parameters for the Target network for a given frequency
```

### Actual code:

Please see this Github repo.

Another algorithm in the Q-learning family is Dueling DQN. This is the paper that introduces Dueling Network for Deep RL. The algorithm for dueling is actually the same as Double DQN with experience replay and target network. For simplicity, I didn’t incorporate additional tricks such as prioritized experience replay.

The main improvement for Dueling DQN is their new architecture which separates the representation of state values and state-dependent action advantages. If we aggregate them well, then we still have a good estimate of the state-action value function. In essence, this new architecture will be able to learn the value of states that takes into account of the effect of each action for each state.

#### Build the dueling architecture for our two networks:

- one stream for the state value function
`v_net`

- one stream for the state-dependent action advantage function
`a_net`

- output our Q-values by
`combining`

the two estimators by subtracting the mean advantage:`v_net + a_net - mean(a_net)`

To see this more concretely, the following code illustrates the changes to build the new architecture:

```
#advantage stream
a_net = tf.layers.dense(final_hidden, units=n_actions, activation=None)
#value stream
v_net = tf.layers.dense(final_hidden, units=1, activation=None)
#combine the two by substracting the mean advantage
q_vals = v_net + (a_net - tf.reduce_mean(a_net, axis=1, keep_dims=True))
```

### Actual code:

Please see this Github repo.