Github repository here (ongoing)

## Motivation

As I try to implement deep reinforcement learning algorithms, I asked about what to put in for $\gamma$ the discount factor in the parameter list. Basically I got three kinds of answers:

- 0.99
- Something close to 1, but not 1 for boundedness
- Just try a set of gammas, and pick the best!

As you could tell, none of these answers seems entirely satisfactory. Thus, I want to take a deeper dive into the role of discount factor plays in deep reinforcement learning, and try to understand how it might interfere with learning.

## What is discount factor?

To begin, let’s look at what discount factor is. Economists first defined $\gamma$ as it specifies some sort of intertemporal preferences. Consider a thought experiment:

Suppose I were to hand you a whole apple tomorrow, or a $\gamma$ fraction of an apple to you today. I’ll twist the values of $\gamma$ until it reaches a value that you become indifferent between the $\gamma$ fraction of the apple today and a whole apple tomorrow. This $\gamma$ of your choice reveals your intertemporal preference.

So how does this preference notion come into play in reinforcement learning? Consider this simple environment where the player in red could could either go left or right to collect coins (worth 1 reward each), or diamond (worth 5000 reward). I run the DQN algorithm twice, one with high discount ($\gamma=0.99$):

and the other with low discount($\gamma=0.2$):

As you could see, the agent with low discount prefers immediate reward (the coins) whereas the agent with high discount could forgo the immediate coins and went for the diamond far on the right. Thus, we could see how different intertemporal preferences lead to distinctive behaviors.

In this kind of environments where we always accumulate rewards, the default discount factor is 1. However, we cannot use exactly $\gamma=1$ in practice because of boundedness, for one reason. Moreover, we don’t have to use exactly 1. Imagine that in our Coin vs. Diamond example, having a slightly lower $\gamma$, say $0.99$, the agent would still prefer diamond over the coins. In other words, high enough discount factors would result in very similar intertemporal preferences and thus behaviors as having a discount factor of one.

## Blackwell Optimality Principle

There has been a theoretical result that confirms this intuition: the Blackwell Optimality Principle states that for all environments there exists a policy that is simultaneously optimal with $\gamma$’s higher than a certain threshold. (Blackwell, 1964)

Is this the complete picture? Let’s go back to our coin vs. diamond example again. This time, in addition to having two agents with high and low discount, I decreased the exploration factor for both agents. The lowered exploration for the low discount agent didn’t alter its actions – it went for all the coins.

## A counterexample

However, the lowered exploration for the high discount agent couldn’t make up its mind of which action to take. It was stuck in the middle and obtained zero rewards, even worse than the low discount agent!

## Key question

So the key question becomes that do DRL algorithms always find the Blackwell optimal policy for $\gamma$’s above the threshold?

Well, from this example at least, we can that the answer is not necessarily. In light of this counterexample, we will demonstrate this issue further with more thorough experiments and propose methods to repair it.

## Experiment setup

For the experiments, I customized the following two Gridworld environments using PyColab, one with sparse rewards:

and one with dense rewards:

The algorithm used in these two environments is modified from OpenAI Baselines DQN, where I vary only the discount factor, and control everything else.

The result for sparse reward env is quite consistent with the Blackwell Optimality Principle (BOP), where high $\gamma$s reaches optimal performance, and there seems to be a threshold between $\gamma=0.5$ and $\gamma=0.8$.

I ran the same experiment on the dense reward env, but this time the DQN algorithm with highest discount ($\gamma=0.99$) did not find the optimal policy.

In fact, the policy with $\gamma=0.99$ performed worse than $\gamma=0.5$.

After thinking through the role of discount factor plays in the DQN algorithm, I came up with the following hypothesis:

## Key hypothesis

$\gamma$ plays a dual role in deep Q-Networks:

- it explicitly specifies some intertemporal preferences (discounting the future)
- it implicitly encodes confidence on bootstrapping from function approximator (weighing the past)

## Initial myopia: a time varying discount scheme

To test this hypothesis, I came up with a simple scheme: time varying discount factor $\gamma(t)$.

Given a myopic fraction, which is some fraction of the total timesteps, we linearly increase our $\gamma$ from 0.1 to a specified final $\gamma$, say $\gamma=0.99$. $\gamma$ will stay fixed as the final $\gamma$, passing the initial myopia period.

The goal of this myopia schedule is to weigh earlier experience less during the myopic fraction. Let’s look at the experiment results with initial myopia.

## Initial myopia in dense reward env

In the dense reward environment, I compared fixed $\gamma=0.99$ with four myopia schemes with various myopic fraction. We can see that the myopia scheme with any myopic fraction reached optimal performance where the highest fixed $\gamma=0.99$ could not.

To see the results more clearly, I grouped all versions of initial myopia as a single group, and we can see that it clearly dominates the fixed $\gamma$ setup. This is a really good news!

I then ran the same set of experiments for $\gamma=0.8$, which was the best performing $\gamma$ in this environment. We see that any myopic fraction could eventually achieve the same level of performance as optimal fixed $\gamma$, though not as quickly.

## Initial myopia in sparse reward env

In the sparse reward environment, where the DQN algorithm could find the Blackwell optimal policy, we ran the same set up where we compare the fixed $\gamma$ and varying $\gamma$’s with myopia. The results were pretty similar for both $\gamma=0.99$ and $\gamma=0.8$, in that the fixed $\gamma$ set up yields the best performance, but all myopic schemes converge to optimal.

This suggests that having a myopia scheme doesnβt really hurt in the long run, as they eventually reached the optimal.

## Initial myopia results summary

Just to summarize what we’ve seen here, that high enough $\gamma$ became optimal with initial myopia. So the benefits of initial myopia scheme is that we don’t need to fine tune $\gamma$, since it improves learning in dense reward environment, and doesnβt harm learning even in sparse reward environment.

## Competing hypothesis: bias reduction or exploration?

After seeing some positive results, one might wonder that what if the benefits of myopia scheme was due to incresed initial exploration, not because of bias reduction? In other words, our hypothesis essentially suggests that: myopia β mitigate bias β better performance. However, a competing hypothesis could be that: myopia β exploration β better performance.

Thus, we need to test against the competing hypothesis. The set up is as follows:

Myopic Fraction | Explore Fraction | Discount | |
---|---|---|---|

baseline | 0 | 0.2 | 0.8, 0.99 |

myo02 | 0.2 | 0.2 | 0.8, 0.99 |

exp05 | 0 | 0.5 | 0.8, 0.99 |

Essentially we are fixing the $\gamma$, and compare across `baseline`

where we have no myopia and low exploration, with `myo02`

where we have myopia and low exploration, as well as `exp05`

where we have no myopia and high exploration.

## Bias reduction vs. Exploration results

I’ve tried two different $\gamma$s in the dense reward environment. As we can see from the results, exploration helps a bit with performance (especially when $\gamma=0.99$) but not significantly. The downside of higher exploration period is that it takes longer to train. Myopia on the other hand, improves performance significantly and it converges faster for both $\gamma$s. So we demonstrated our hypothesis is a stronger one.

## Future directions: Generalized Advantage Estimate (GAE)

In PPO (or other PG methods), GAE actually separates Ζ and π :

So it may express the dual role of discount factor in DQN. This suggests that we could use a myopia schedule on Ζ. I ran some initial experiments using Baselines PPO on the MuJoCo HalfCheetah environment where we have continuous action space, and found the preliminarily results pretty promising:

Looking ahead, I’m trying to

- formalize the dual role intuition using uncertainty estimate
- run more experiments on more standard testing ground for both DQN and PPO

## Final takeaways

Discount factor matters in DRL!

$\gamma$ has a dual role in that it

- specifies intertemporal preferences
- encodes confidence on bootstrapping

A simple myopic schedule is a robust and effective way to improve performance, and the same logic may work beyond DQN and discrete action/state framework.