After learning about some Q-learning based algorithms, I now look into policy optimization methods. These methods still exist in the model-free world where the full transition dynamics is unknown. I begin with one of the classic papers on policy gradient methods, that first formalized the policy gradient approach to reinforcement learning.

**Paper**: Policy Gradient Methods for Reinforcement Learning with Function Approximation**Authors**: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour**Year**: 1999**Topic**: Policy Gradient Methods

## Why do we care

The need to apply reinforcement learning on large state space is increasingly relevant. While the traditional tabular methods such as value iteration and policy iteration work on small state space, they are computationally intractable if we scale up. Thus, function approximation for generalization is a good approach to this issue. Prior to this work, we have seen successes using value function approximation, in the form of policy evaluation and policy improvement.

However, value function approximation has its limits. The main one is we don’t have a convergence guarantee. Intuitively, a small change in estimated value will lead to discontinuous changes in selection. Moreover, it’s hard to compute the maximum of Q value for actions that lie in high dimensional or continuous action space.

## Main goals of the paper

In light of the context, this paper aims to:

- Formalize policy gradient approach
- Reformulate the gradient for estimation from experience
- Prove some approximate policy iteration converges to local optimal policy

## Basic setup

Using the standard MDP formulation of $(s_t, a_t, r_t, P^a_{ss’}, R^a_s)$, we also have a parametrized policy $\pi(s, a, \theta) = Pr\{ a_t = a | s_t = s, \theta\}$.

One example of a parametrized policy is the softmax policy. In essence, a softmax policy allows us to weight our (discrete) actions by linear combinations of features, $\theta^T\phi_{sa}$, and it select actions based on an exponential softmax distribution. So $\pi(s, a, \theta) = \frac{e^{\theta^T\phi_{sa}}}{\sum_b e^{\theta^T\phi_{sb}}}$.

## What are we solving

**We aim to find the best $\theta$ that maximizes some performance measure $\rho(\pi)$.**

There are a few measures we can use. The two common ones are average-reward formulation and start-state formulation.

Average-reward: $$ \rho(\pi) = \lim_{n\rightarrow\infty} \frac{1}{n} E\{r_1 + r_2 + \cdots + r_n | \pi\} = \sum_s d^\pi(s)\sum_a\pi(s,a)R_s^a$$ This measure defines the best policy by having the maximum long term expected reward per step. The assumption here is that the stationary distribution under $\pi$ exists and is independent of start state for all policy. Equivalently this means if we fix a policy, the underlying chain is ergodic so it has a unique stationary distribution.

Start-state: $$ \rho(\pi) = E\{\sum_{t=1}^\infty \gamma^tr_t\} = \sum_sd^\pi(s)\sum_a\pi(s,a)Q^\pi(s,a) $$ This measure defines the best policy by having the maximum long term total discounted reward. Another difference from the above measure is that $d$ is the state visitation frequency with discount rate $\gamma$. Note that for $\gamma = 1$, we have to apply to episodic tasks, otherwise the sum is unbounded and not properly defined.

Now what’s the approach? As you can tell from the title of the paper, we use gradient update. You may wonder if we can use the simple update rule: $$ \theta_{t+1} \leftarrow \theta_t +\alpha \frac{\partial\rho}{\partial\theta}$$

Well it may be hard because $\rho(\pi)$ depends on $\theta$ in two ways:

- directly through action selection $\pi(s,a,\theta)$
- indirectly through state distribution $d^\pi (s)$

In other words, our performance measure depends on both state distribution and action selection given a state. Thus, when we change parameters, it will affect both state distribution and action selection. It is easy to recover the change on the policy for a given state. However, the effect on state distribution is hard to capture. Is there a solution to this?

## Main Theorem

The answer is YES! The paper’s main theorem tackles precisely this issue. First assume that our policy $\pi(s,a)$ is differentiable w.r.t. parameter $\theta$. The theorem shows that

$$ \frac{\partial\rho}{\partial\theta} = \sum_s d^\pi(s)\sum_a\frac{\partial\pi(s,a)}{\partial\theta} Q^\pi(s,a)$$

where $Q^\pi(s,a)$ is the value of a state-action pair given policy. Note that this term does **not** involve $\frac{\partial d^\pi (s)}{\partial\theta}$ !

The proof to the Theorem is really interesting to go through, but I’m a bit tired to typing them up. Will insert them in the future. The insight here is the gradient of the performance measure only depends on policy.

Prior to this paper, there were algorithms using similar ideas. The first policy-gradient learning algorithm was Monte Carlo Policy Gradient. The classic REINFORCE algorithm (Williams, 1992) directly applies our main Theorem, by replacing oracle to (unbiased) MC estimates. There were also independent theoretical work by Jordan (1995), Baxter and Bartlett (1999), Marbach and Tsitsiklis (1998).

## Can we do better?

Instead of using an unbiased but high-variance estimate $G_t$ for $Q^\pi(s,a)$, is it possible to use a function approximator (biased but low variance) for $Q^\pi(s,a)$?

In other words, if we let $f_w:S \times A \rightarrow \mathbb{R}$ be an approximation for $Q^\pi$, when does “approximate” policy gradient

$E_\pi[\nabla\log\pi(s,a)f_w(s,a)]\approx \frac{\partial\rho}{\partial\theta}$?

It turns out that we could. This leads to the second theorem in the paper.

### Theorem 2: PG with Approximation

This Theorem has two key assumptions:

- Approximation needs to be sufficiently good
- Approximation parameterization must be compatible with policy parameterization

The Theorem says that $\frac{\partial\rho}{\partial\theta}\propto E_\pi[\nabla\log\pi(s,a)f_w(s,a)]$

An important note is that under these conditions, we actually get the **exact** gradient.

### Theorem 2 assumptions

What does it mean for approximation to be sufficiently good? It means that loss function converges to local optimum. Since $\Delta w_t\propto \nabla_wL(f_w, \hat{Q}^\pi)$, we have $\Delta w_t = 0$.

[TBC]