Feb 8, Notes on Policy Gradient

type

status

date

slug

summary

Introduction

In this blog, I will explore the Policy Gradient algorithm. This algorithm is a fundamental approach in reinforcement learning fields. Below are my previous learning notes in this reinforcement learning series:

Basic Concepts in Reinforcement Learning

Markov Decision Process

Bellman Equation

Dec 16, Notes on DP, Monte Carlo, TD in Reinforcement Learning

Jan 21, Notes on Sarsa & Q-Learning

Before we jump into Policy Gradient, Let’s quickly consider a question: why choose Policy-Based methods?

In RL, there are broadly two categories of methods:

Value-Based Method (like Q-learning, SARSA) : These methods learn to estimate value functuions (Q-values or V-values) which tells us how good it is to be in a certain state or take a certain action in a state. The policy is then derived from these value functions. (e.g. by choosing actions with the highest Q-value).

Policy-Based Method (like Policy Gradient, PPO): These methods directly leran the policy itself. Instead of learning value functions and then deriving a policy, we directly parameterize and optimize the policy.

So, Policy-Based Methods have following advantages:

Handling Continuous Action Spaces： Value based methods often struggle with continuous action spaces because they

Policy Gradient

The policy gradient algorithm is a foundational policy-based method. The core idea is to directly adjust the policy parameters in the direction that improves performance, as measured by the expected return.

Let's first define the concept of a trajectory, denoted as . A trajectory is a sequence of states, actions, and rewards that an agent experiences while interacting with an environment. It represents a single episode of interaction from a starting state to either a terminal state or a time limit.

A trajectory is formally defined as:

How do we calculate the probability that a specific trajectory occurs?

where represents the parameters of the policy network.

The logic of policy gradient is straightforward: The parameter determines the agent's actions, which in turn determine the received rewards. Our goal is to maximize the expected reward. We can calculate the expected reward for a given set of parameters as:

This formula is equivalent to the expectation in statistics. We calculate the expected value of the reward across all possible trajectories, with each trajectory having a probability of . This expectation represents our policy's average performance. We can therefore rewrite equation as:

Since our goal is to maximize this reward, gradient descent provides an effective optimization method.

Let's explore further to understand.

Let's examine how we derived steps 1 to 3 in . Here is the explanation:

Using the chain rule of differentiation, we know that . We can apply this principle to obtain .

The second question we need to examine is how to calculate .

Returning to , we can apply the logarithm to both sides of the equation to get:

Through these mathematical derivations, we arrive at the following formula:

We can now use the gradient descent to update the parameter.

where is teh learning rate.

Some techniques to achieve policy gradient

Baseline

The first technique is called baseline. While we could use directly for training, this approach can lead to high variance in the results. To mitigate this, we can subtract a baseline value, denoted as . So, we can get:

Typically, the baseline value is approximated using the average reward across the trajectory.

Reward-to-Go

Reward-to-Go is the second important technique in policy gradient methods.

Given a trajectory over time steps , the Reward-to-Go method calculates returns by summing rewards from the current time step onward () rather than using the total episode return. This means each action is evaluated based only on the rewards that follow it, not on rewards that came before it in the trajectory.

Therefore, we can express the formula as:

whereis the discount factor. This factor prioritizes recent rewards while diminishing the impact of future rewards. It reflects a practical principle—immediate outcomes generally carry more weight in decision-making than distant future consequences. The expressioncan be written as, which represents the advantage function.

REINFORCE

A basic policy gradient algorithm is called REINFORCE (or Monte Carlo Policy Gradient). It uses Monte Carlo estimates of the return to compute the gradients and update the policy parameters direcrly. So, Here is how we calculate the gradient of the reward:

A high-level outline of the REINFORCE algorithm looks like this:

Randomly initialize the policy parameters

For each iteration, there are two steps needs to be done：

Run the current policy in the environment to sample one or more episodes
Compute the return for each sampled episode.

Policy Update

Compute gradient estimates using the sampled episodes by
Perform the gradient ascent step on by (9)

Repeat

Iterate until convergence or for a fixed number of iterations/ episodes

Conclusion

In this note, I have explored key concepts and techniques of policy gradient algorithms. The key ideas we covered include the mathematical foundations of policy gradients, the importance of baseline subtraction for variance reduction, and the Reward-to-Go technique for more effective credit assignment. We also examined the REINFORCE algorithm as a practical implementation of these concepts. These methods form the foundation for more advanced policy-based approaches in reinforcement learning.