Dec 16, Notes on DP, Monte Carlo, TD in Reinforcement Learning

type

status

date

slug

summary

1. Introduction

Welcome to the fourth post in my series on reinforcement learning, where I document my learning journey. Here are the previous posts in the series:

Basic Concepts in Reinforcement Learning

Markov Decision Process

Bellman Equation

If you notice any errors in this series, please leave a comment!

In this post, I'll explore three fundamental reinforcement learning algorithms: Dynamic Programming (DP), Monte Carlo methods, and Temporal Difference (TD) learning.

2. Previous Looking

Before diving into Monte Carlo methods, let's review the Markov Decision Process (MDP).

An MDP consists of 5 key components:

State Space: The set of all possible states

Action Space: The set of all possible actions an agent can take at any time. In a maze game, for example, the action space would include up, down, left, and right.

Transition Probability: The probability that an agent taking action in state will transition to new state . This is expressed as

Reward Function: The reward an agent receives after taking action in state . This can be expressed as or

Discount Factor: A value between 0 and 1 that balances immediate and future rewards.

We also know that an MDP has the Markov Property (MP). This property states that the future state depends only on the current state, not on any past state information. This can be described as follows:

The goal of an MDP is to find the optimal policy that maximizes the agent's rewards when interacting with the environment. A policy is a function that determines which action an agent should take in a given state. It can be expressed as , representing the probability that the agent will take action when in state .

To find the optimal policy , we need to know the state or action’s value. So, this is Value Function( ) or State-Value Function( ).

is the expected return when the agent starts at state and measures the long-term value of that state.

represents the expected return when the agent takes action at state while following policy

So, here are the Bellman Expected Equation and Bellman Optimality Equation to help us to evaluate the policy and find the optimal policy.

Let’s see the Bellman Expected Equation first:

For the ,

: The value of state under policy

: Sum over all possible actions weighted by their probability under policy

: Sum over all possible next states and rewards, weighted by their transition probabilities

: The immediate reward plus the discounted value of the next state

In simpler terms, this equation tells us that the value of a state is the expected sum of immediate reward and discounted future value, considering all possible actions we might take and their outcomes under our current policy.

For the ,

: The value of taking action in state under policy

: Sum over all possible next states and rewards, weighted by their transition probabilities

: The immediate reward plus the discounted sum of future action values, weighted by the policy probabilities

This equation represents the action-value function, showing that the value of taking an action in a state is equal to the expected immediate reward plus the discounted future value of all possible next actions, weighted by our policy's probabilities.

Now, Let’s see the Bellman Optimality Equation.

For the , this is known as the Bellman optimality equation for the state-value function. It states that the optimal value of a state is equal to the maximum expected return we can get by taking any action in that state.

: The optimal value of state

: Taking the maximum value over all possible actions

: Sum over all possible next states and rewards, weighted by their probabilities

: The immediate reward plus the discounted optimal value of the next state

For the , this equation defines the optimal policy. It states that the optimal action in any state is the one that maximizes the optimal action-value function.

: The optimal policy for state

: The action that gives the maximum value

: The optimal action-value function

These equations are fundamental to understanding how we can find optimal policies in reinforcement learning. They form the theoretical basis for many practical algorithms, including Dynamic Programming, which we'll explore next.

<ins/>

3. Dynamic Programming (DP)

As we discussed, Dynamic Programming (DP) is a powerful technique for solving complex problems by breaking them down into simpler, overlapping subproblems, solving each once and storing the results to avoid redundant computation. In the context of reinforcement learning, DP is used to find optimal policies in Markov Decision Processes (MDPs). There are two main processes used in DP for RL problems: policy evaluation and policy improvement.

3-1 Policy Evaluation (Prediction)

Policy evaluation is about determing the value function (or ) for a given policy . In other words, given a policy, we try to determine how good it is to be in a certain state. This value represents the expected long-term reward if you start in state and follow the policy from that point on. This can be calculated using Bellman Expectation Eauation we saw earlier in equations and . In practice, with DP, we use an iterative method, updating the value of each state based on its neighbors, until the state values converage.

3-2 Policy Improvement (Control)

Once we have an estimated value function for a given function, we can then consider if that policy is optimal. We can do this using policy improvement which attempts to find an improved policy which yields a higher value. To do this, we can use the Bellman Optimality Equation we saw earlier in equations and . The key here is that after policy evaluation, we look for actions that would improve the value of a given state, in turn creating a better policy. This process can be repeated until the policy and value function no longer change - and we have found an optimal policy.

3-3 Policy Iteration with DP

Initialization

We start by initializing the value function for all states in the all state space . Often these values are initialized to zero or to small random numbers. These values represent our initial estimate of the value of each state.

We also initialize a policy for all states which represents the initial probability of taking certain actions in state . This is often initialized as a random policy. This can be a deterministic policy or a probabilitic policy.

Policy Evaluation (Prediction)

We repeatedly estimate the value function under our current policy.

A variable is initialized to . This will track changes in the value function.

We loop through every state in the state space

We update the value of the state using the Bellman expectation equation:

This means that the new value of is computed using the expected reward and value of the next state, following the current policy . Note that means transition probability from to next state with a reward by taking action from policy .

We calculate the change in value by gets . This tracks the maximum change across all states.

We repeat the previous steps until the becomes less than a small value that represents a threshold for convergence (our decision of when to stop updating the value function). At this point, we have the true value function for our current policy.

Policy Improvement (Control)

Here, we look for an improved policy from the evaluated value function.

A variable named policy-stable is initialized to true . This helps us to keep track of whether our policy has changed in this iteration.

We loop through each state in the state space .

We store our current action choice for state : old-action gets

We improve the policy by setting to the action that maximizes the expected value:

Note that we are using our newly computed value function here.

If the old-action differes from the new , we set policy-stable to false .

If policy-stable remians true after looping through all the states(meaning the policy has not changed), we can stop the algorithm and output the optimal value function and policy . Otherwise, if any state changed their policy, we start again from step 2.

<ins/>

4. Monte Carlo (MC)

Before we delve into Monte Carlo, I think it is necessary to review the Law of the Large Number (LLN).

LLN is a fundamental concept in probability and dtatistics that states that as the number of trails in a random experiment increases, the average of the results will converge to the expected value. In simpler terms, the more data you collect, the more reliable your estimate of the true value becomes.

The Connection Between LLN and Monte Carlo

The Law of Large Numbers provides the theoretical justification for using Monte Carlo methods. Essentially, Monte Carlo methods apply the principle of the LLN by treating each episode of an agent’s interation with the environment as a random sample. We then use the observed rewards and returns from each episode to estimate the value of a state (or state-action pair).

According to LLN, as we simulate more and more episodes, and compute the average return, these average returns become a better and better estimate of the true expected value for a given state or state-action pair. In essence, Monte Carlo methods leverage the LLN to transform noisy, random experiences into reliable estimates of value functions.

4-1 What are Monte Carlo Methods?

Monte Carlo methods are a class of model-free algorithms that learn from experience. Unlike DP, MC methods do not require a complete model of the environment(ie., the transition probabilities and rewards). Instead, they learn by directly interacting with the environment, accumulating data from episodes of experience, and using this data to evaluate and improve a policy.

4-2 Key Charcteristics of Monte Carlo Methods:

Episodic Learning: MC methods learn from complete episodes of interaction. An episode is a sequence of states, actions, and rewards that terminates in a terminal state.

No Bootstrapping: MC methods do not update estimates based on other estimates, rather updates are based on the actual rewards experienced in an episode. This is why we need to wait until the end of the episode to do any value updates.

Model-Free: Monte Carlo does not require the underlying model (state transition probabilities, reward functions) of the environment, instead it relies on observing the actions and outcomes of the agents behavior in the environment.

Averaging Returns: MC methods estimate the value of a state (or state-action pair) by averaging the returns (total discounted rewards) observed in many episodes starting from that state (or state-action pair).

4-3 How Monte Carlo Methods Work

At the core of MC methods is the idea of estimating value function by averaging observed returns. Let’s review some key definitions:

Episode: An episode is a complete sequence of states, actions, and rewards, starting from an initial state and ending in a terminal state. For example, in a game of chess, a single game would be an episode. In a robot navigation task, a single trajectory from start to goal would be an episode.

Return ( ): The return at time , denoted as , is the total discounted reward received from time step until the end of the episode. It's defined as:

where is the reward received at time step , is the discount factor and is the time step where the episode terminates.

There are two types of MC methods:

First-Visit MC: In this method, only the first occurence of a state within an episode is used to update the value function.

Every-Visit MC: In this method, all occurrences of a state within an episode are used to update the value function.

Value Function Update: After an episode terminates, the value function for a state (or state-action pair ) can be updated based on the observed return . The update is done by averageing the returns observed from the relevant visits to state or state-action pair across multiple episodes.

For state-value function, :

where is the number of times state has been visited, and is the return observed on the -th visit to state .

For state-action value function, :

where is the number of times action has been taken in stae , and is the return on the -th visit to state-action pair .

4-4 Monte Carlo Policy Control

While the above discussion is for policy evaluation, for policy control, we are interested in finding the optimal policy by improving it iteratively. To accomplish this, MC control methods typically follow a generated policy iteration scheme using the following steps:

Policy Evaluation

Using MC methods to estimate the state-action value function using observed returns.

Policy Improvement

Update the policy to make it greedy or epsilon-greedy with respect to . That is, we choose the action that appears to produce the best value with some probability.

Repeat the steps 1 and 2 until the policy converages to a stable optimal polic, meaning the action chosen for every state does not change anymore.

5. Temporal Difference (TD)

Having explored Dynamic Programming and Monte Carlo methods, we now turn to Temporal Difference (TD) learning, another fundamental approach in reinforcement learning. TD learning combines ideas from both DP and MC methods, offering a powerful and often more efficient way to learn value functions and policies.

5-1 What is Temporal Difference Learning?

TD learning is a model-free learning approach, like Monte Carlo methods, which means it doesn't require knowledge of the environment's transition probabilities and rewards. However, unlike MC, TD learning doesn't need to wait until the end of an episode to learn. Instead, it can learn from incomplete episodes by bootstrapping, meaning it updates its value estimates based on other value estimates. This allows TD learning to be faster and more efficient than MC in many situations.

The term "Temporal Difference" refers to the fact that TD learning learns by comparing the predicted value of a state with the temporally next estimated value of the state that was experienced. In other words, it learns by looking at the "difference" between value predictions at different times within the same episode.

Key Characteristics of TD Learning:

Model-Free: Like MC methods, TD does not require a model of the environment.

Bootstrapping: TD methods update value estimates based on other estimates. This means that the value function is updated using an estimate of the value, rather than waiting for the actual return.

Learning from Incomplete Episodes: Unlike MC, TD learning can learn from incomplete episodes, meaning it can update after each step instead of waiting until the end of the episode. This allows TD methods to be more sample efficient compared to MC.

Updates at Each Step: TD methods learn by updating value function estimates after each transition in the environment, taking into account the reward received and the estimated value of the next state.

5-2 How Temporal Differenece Learning works

The core concept of TD learning is to update a value estimate based on the difference between the current estimate and a more informed estimate obtained after taking a step in the environment. This can be explained with the help of a specific TD update called or one step TD.

Update for State Value Function, :

The update rule for the state value function in learning is as follows:

where is the current state, is the next state; is the reward received after transitioning from to ; is the discount factor; is the learning rate; is the current value estimate of state and is the current value estimate of the next state and the term is the TD error.

vs Monte Carlo

and MC both try to estimate a value function, however they do it in different ways.

TD methods use bootstrapping to update the value function based on estimates of the next state or the next action-value, while MC methods use complete returns to update the value function.

TD methods can update the value function after every step (or even multiple steps), while MC methods need to wait until the end of the episode.

As such, TD methods can learn faster than MC methods, since the value function is updated more frequently.

5-4 TD Learning for Policy Control

Just like in Monte Carlo methods, TD learning can also be used to solve control problems, which means finding an optimal policy that maximizes the agent's expected rewards. Some common control algorithms that use TD learning include:

SARSA: SARSA (State-Action-Reward-State-Action) is an on-policy TD learning algorithm where the current policy is used to estimate the action value in the next state .

Q-learning: Q-learning is an off-policy TD learning algorithm where the max action value in the next state is used .

These algorithms learn by iteratively updating their action-value function using TD updates and improve the policy in a similar manner to MC methods, by selecting the action with the best expected value.

6-Conclusion

In this post, we explored three fundamental approaches to reinforcement learning: Dynamic Programming (DP), Monte Carlo (MC) methods, and Temporal Difference (TD) learning. Each method offers distinct characteristics and trade-offs—DP requires a complete model of the environment, MC learns from complete episodes without needing a model, and TD combines aspects of both by learning from incomplete episodes through bootstrapping. Understanding these methods builds a solid foundation for tackling more advanced reinforcement learning concepts and applications.

In the following posts, I will explore SARSA and Q-learning in depth, examining their advantages and disadvantages while comparing them to MC methods and Dynamic Programming.

<ins/>