# Proximal Policy Optimization Algorithms (PPO)

Continuing the class of policy optimization methods, I look into one of the most practical algorithms - PPO.

Paper: Proximal Policy Optimization Algorithms
Authors: John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
Year: 2017
Topic: Policy Gradient Methods

## Policy Gradient Methods and TRPO:

Looking at the types of RL algorithms, we can choose to group them by model-based methods and model-free methods (some have a bit of both). For model-free methods, the two popular ones are Q-learning methods and policy gradient methods. We look at the PG methods, which directly optimize policy parameters using gradient descent. One of the key algorithms using PG methods is TRPO.

TRPO is a practical algorithm similar to natural policy gradient methods and is effective for optimizing large nonlinear policies (neural nets). The paper proved that a proper surrogate objective function can improve policy with non-trivial step sizes (guaranteed monotonic improvement). Making several approximations (most notably, choosing a hard constraint instead of a penalty on KL divergence) to the theory, TRPO is able to scale pretty well in practice. However, despite its good performance on a variety of tasks, TRPO could be tricky to implement.

## Motivation for PPO

In light of limitations of TRPO, this paper introduced a modified algorithm to the class of policy gradient methods. The algorithm is similar to TRPO, but with a clipped probability ratio for the surrogate objective for loss function of policy gradient. It can avoid too large a policy update by restraining the probability ratio. Thus, it is more general (can apply to various RL architectures), performs better empirically (by benchmarking the experiments) and easier to implement (the motivation).

## Results

PPO is now a benchmark algorithm to compare performance on various domains like continuous domains (Humanoid Running and Steering) and Atari domain. Quote from OpenAI:

PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

## Why PPO is ‘easier’?

TRPO maximizes a surrogate objective function subject to a constraint on the size of the policy update. The constraint here is approximated as a second order optimization problem, so it is still computationally difficult. The reason why TRPO chooses a constraint over a penalty (theory-justified) is that a fixed penalty cannot generalize well. Here PPO replaces the constraint with a clipped objective function, which requires only a first order optimization, which is much simpler.

## What is the clipped objective function?

The TRPO surrogate objective we are maximizing is:

$$L^{CPI} (\theta) = \hat{E_t}[r_t(\theta)\hat{A_t}]$$

We define the probability ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}} (a_t \mid s_t)$

The PPO objective is then: $$L^{CPI} (\theta) = \hat{E_t}[\min(r_t(\theta)\hat{A_t}, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A_t)}]$$

Comparing the two, it’s not hard to see that the second term in the min clips the probability ratio, restraining it to our chosen parameter. The final objective thus is a lower bound on the unclipped objective. Thus, we cut when the new policy update differs too much from the old policy when the objective become worse.

## Questions about PPO:

Despite the great results PPO is able to show, I have several questions when reading the paper.

• How to choose epsilon, the clip hyperparameter? The paper picks 0.2 (heuristics), which seems to be a bit arbitrary. Is there a more principled way of choosing this hyperparameter?
• When would it be the case that PPO underperforms TRPO? Would it be better if we have an adaptive epsilon for the interval of the ratio? Would we want larger updates initially to speed up learning?

Overall, I hope we could have a more satisfactory theory justification of the clipped probability ratio approach. Looking back at TRPO, it does justify the three main approximations to the theory on KL divergence. However, the combination of three still works seems a bit too good to be true.

I will soon upload my implementation of PPO, which will be more detailed than this summary. Previous