Jan 21, Notes on Sarsa & Q-Learning

type

status

date

slug

summary

Introduction

In this blog, I will explore two famous reinforcement learning algorithms: SARSA and Q-Learning. Below are my previous learning notes in this reinforcement learning series:

Basic Concepts in Reinforcement Learning

Markov Decision Process

Bellman Equation

Dec 16, Notes on DP, Monte Carlo, TD in Reinforcement Learning

SARSA

SARASA updates its Q-value (the estimated value of taking action in state ) based on the actual action it takes and the reward it receives, following its current policy.

The Update Rule:

Let’s break down each component:

is the current estimate of the Q-value for taking action in state .

is the learning rate, determining how much the new information updates the old estimates. (typically 0 and 1)

is the immediate reward received after taking action in state and transition to state

(gamma): the discount factor, determing the importance of the future rewards (typically between 0 and 1). A value closer to 1 emphasizes long-term rewards

is the estimated Q-value for taking action in the next state . Crucially is the actual action the agent takes in state according to its current policy.

The SARSA Process:

Initialize Q-values: Start with initial estimates for all state-action pairs ( often zeros or small random values)

Observe the current state (s).

Choose an action (a) based on the current policy (e.g. -greedy) . The -greedy poliocy selects the action with the highest estmated Q-value with probability (1- ) and a random action with probability

Take the action (a) and observe the reward (r) and the next state ( )

Choose the next action ( ) based on the current policy in the new state ( ). This is a key point -SARSA looks at the actual next action.

Update the Q-value for the current state-action pair (s,a) using formula (1)

Set and

Repeat steps 2-7 until the agent reaches a terminal state or the learning process is reached.

Expanding on SARSA: Introducing SARSA( )

Basic SARSA can be slow in propagating rewards backward through a sequence of actions. To address this limitation, we can use eligibility traces, which leads to SARSA(), a powerful extension of the original algorithm.

Referring back to , we have:

This represents the Q-value for the next step when the current time is . However, this single-step approach is inefficient. Let's consider using n-step returns instead (where n=1,2,3,…,):

Generally, for n-step SARSA, the n-step Q-value is:

By introducing , the decay-rate parameter for eligibility traces, and performing weighted summation, we can derive the Q-value for SARSA():

Therefore, the update rule for n-step SARSA() becomes:

Q-Learning

Q-Learning updates its Q-value based on the maximum possible Q-value in the next state, regardless of the action the agent actually takes. Think of it as considering the best possible outcome, even if the agent chooses a different action in practice. Based on this, we can define the target policy:

So now, we can define the Temporal Difference Target (TD target):

The updated rule for the Q-Learning now becomes:

On-Policy & Off-Policy

In reinforcement learning, the distinction between on-policy and off-policy methods lies in how the agent learns and updates its policy. The core difference revolves around whether the policy used to generate behavior (the behavior policy) is the same as the policy being evaluated and improved (the target policy)

SARSA is an on-policy algorithm. On-policy methods learn the value of the same policy that is being used to make decisions and interact with the environment. The agent is essentially learning about the policy it’s currently following.

Back to the , SARSA uses the Q-value of the action that is actually taken in the next state according to the current policy (e.g., -greedy)

Example: Imagine a student learning to drive by strictly following their instructor’s directions and learning from the consequences of those specific actions. They are learning the value of the policy they are executing.

Q-Learning is a classical off-policy algorithm. The key difference from SARSA is that Q-Learning uses the maximum Q-value over all possible actions in the next state , regardless of the action actually taken by the behavior policy. It’s learning about the optimal policy.

Example: Imagine a student learning to drive by observing expert drivers and learnng from their optimal maneuvers, even if the studnet is currently practicing a different, more cautious driving style. They are learning the optimal policy regardless of their current behavior policy.

Example: Maze Navigation

Imagine a simple maze where an agent needs to navigate from a starting point (S) to a goal (G) which provides a positive reward. The maze also contains a trap(H) that gives a negative reward. The maze can be visualized as a grid:

The agent can perform actions (Actions) like moving Up, Down, Left, and Right. If a move is blocked (e.g, by a wall), the robot stays in the same position.

Let’s consider training our robot using both Q-Learning and SARSA algorithms.

Q-Learning(Off-policy)

Exploration: The robot starts at the initial state . It explores the environment using an - greedy policy. For example, with =0.1, the robot will choose a random action % of the time, and 90% of the time it will choose the action with the highest Q-value for the current state. Let’s assume, in this case, it randomly chooses to move down.

Update Q-value: The robot moves to the next state, . Q-Learning then looks at all possible actions in and identifies the action with the highest Q-value, . It uses this maximum Q-value to update the Q-value of taking the action “Down” in state , . Crucially, Q-Learning doesn’t care what action is actually taken in .

Repeat: Steps 1 and 2 are repeated until the robot reaches either the goal G or the trap H.

SARSA(On-policy)

Exploration: The robot starts at the initial state and explorfes using the same -greedy policy as Q-learning. Let’s assume it randomly chooses to move “down”.

Choose Next Action: The robot arrives at the next state . Now, SARSA, based on its current policy (e.g., -greedy), selects the next action that it will actually execute in state .

Update Q-value: SARSA uses the Q-value of the actually executed action in state to update the Q-value of taking the action “Down” in state

Repeat Steps 1 to 3 are repeated until the robot reaches either the goal or the trap .

<ins/>

Conclusion

In this blog, we explored two fundamental reinforcement learning algorithms: SARSA (on-policy) and Q-Learning (off-policy). While both algorithms aim to learn optimal policies, they differ in how they update their Q-values and handle the exploration-exploitation trade-off. SARSA learns from actual experiences and tends to be more conservative, while Q-Learning learns about the optimal policy regardless of the actions taken, potentially leading to more aggressive optimization.

<ins/>