type
status
date
slug
summary
tags
category
password
icon
Author
Abstract

Introduction

In this blog, I will explore the Policy Gradient algorithm. This algorithm is a fundamental approach in reinforcement learning fields. Below are my previous learning notes in this reinforcement learning series:
 
Before we jump into Policy Gradient, Let’s quickly consider a question: why choose Policy-Based methods?
In RL, there are broadly two categories of methods:
  • Value-Based Method (like Q-learning, SARSA) : These methods learn to estimate value functuions (Q-values or V-values) which tells us how good it is to be in a certain state or take a certain action in a state. The policy is then derived from these value functions. (e.g. by choosing actions with the highest Q-value).
  • Policy-Based Method (like Policy Gradient, PPO): These methods directly leran the policy itself. Instead of learning value functions and then deriving a policy, we directly parameterize and optimize the policy.
So, Policy-Based Methods have following advantages:
  • Handling Continuous Action Spaces: Value based methods often struggle with continuous action spaces because they

Policy Gradient

 
The policy gradient algorithm is a foundational policy-based method. The core idea is to directly adjust the policy parameters in the direction that improves performance, as measured by the expected return.
Let's first define the concept of a trajectory, denoted as . A trajectory is a sequence of states, actions, and rewards that an agent experiences while interacting with an environment. It represents a single episode of interaction from a starting state to either a terminal state or a time limit.
A trajectory is formally defined as:
How do we calculate the probability that a specific trajectory occurs?
where represents the parameters of the policy network.
The logic of policy gradient is straightforward: The parameter determines the agent's actions, which in turn determine the received rewards. Our goal is to maximize the expected reward. We can calculate the expected reward for a given set of parameters as:
This formula is equivalent to the expectation in statistics. We calculate the expected value of the reward across all possible trajectories, with each trajectory having a probability of . This expectation represents our policy's average performance. We can therefore rewrite equation as:
Since our goal is to maximize this reward, gradient descent provides an effective optimization method.
Let's explore further to understand.
Let's examine how we derived steps 1 to 3 in . Here is the explanation:
Using the chain rule of differentiation, we know that . We can apply this principle to obtain .
The second question we need to examine is how to calculate .
Returning to , we can apply the logarithm to both sides of the equation to get:
Through these mathematical derivations, we arrive at the following formula:
We can now use the gradient descent to update the parameter.
where is teh learning rate.

Some techniques to achieve policy gradient

Baseline
The first technique is called baseline. While we could use directly for training, this approach can lead to high variance in the results. To mitigate this, we can subtract a baseline value, denoted as . So, we can get:
Typically, the baseline value is approximated using the average reward across the trajectory.
Reward-to-Go
Reward-to-Go is the second important technique in policy gradient methods.
Given a trajectory over time steps , the Reward-to-Go method calculates returns by summing rewards from the current time step onward () rather than using the total episode return. This means each action is evaluated based only on the rewards that follow it, not on rewards that came before it in the trajectory.
Therefore, we can express the formula as:
whereis the discount factor. This factor prioritizes recent rewards while diminishing the impact of future rewards. It reflects a practical principle—immediate outcomes generally carry more weight in decision-making than distant future consequences. The expressioncan be written as, which represents the advantage function.

REINFORCE

A basic policy gradient algorithm is called REINFORCE (or Monte Carlo Policy Gradient). It uses Monte Carlo estimates of the return to compute the gradients and update the policy parameters direcrly. So, Here is how we calculate the gradient of the reward:
A high-level outline of the REINFORCE algorithm looks like this:
  • Randomly initialize the policy parameters
  • For each iteration, there are two steps needs to be done:
    • Run the current policy in the environment to sample one or more episodes
    • Compute the return for each sampled episode.
  • Policy Update
    • Compute gradient estimates using the sampled episodes by
    • Perform the gradient ascent step on by (9)
  • Repeat
    • Iterate until convergence or for a fixed number of iterations/ episodes

Conclusion

In this note, I have explored key concepts and techniques of policy gradient algorithms. The key ideas we covered include the mathematical foundations of policy gradients, the importance of baseline subtraction for variance reduction, and the Reward-to-Go technique for more effective credit assignment. We also examined the REINFORCE algorithm as a practical implementation of these concepts. These methods form the foundation for more advanced policy-based approaches in reinforcement learning.
 
 
 
Feb 20, Notes on Grok3Jan 23, Notes on Bespoke and NovaSky
Loading...
Chengsheng Deng
Chengsheng Deng
Chengsheng Deng
Latest posts
Sep 19, Bellman Equation
Feb 24, 2025
Feb 20, Notes on Grok3
Feb 20, 2025
Jan 23, Notes on Bespoke and NovaSky
Feb 20, 2025
Jan 21, Notes on Sarsa & Q-Learning
Feb 20, 2025
Feb 8, Notes on Policy Gradient
Feb 20, 2025
August 17, Instruction Data Generation
Jan 23, 2025
Announcement
🎉Welcome to my blog🎉 
To find me:
Twitter/X:My X
👏Have fun in my blog👏