Sep 23, Markov Decision Process

type

status

date

slug

summary

Introduction

In previous blog, Sep 19, Bellman Equation, I noted some knowledge about Markov Property, Markov Chain, and Markov reward Process(MRP) but I did not talk about the Markov Decision Process(MDP). So, in this blog, I will note something about the Markov Decision Process(MDP).

What is MDP?

A Markov Decision Process (MDP) is a mathematical framework used for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It's widely used in artificial intelligence, reinforcement learning, operations research, and other fields. The key difference between MDP and MRP is that in MDP, the future state depends not only on the current state but also on the action taken by the agent. This condition can be expressed as:

As a result, the reward function in an MDP is defined as:

This equation represents the expected immediate reward an agent receives when taking action in state .

Policy in MDP

The policy in an MDP defines an agent's behavior as it interacts with its environment. Understanding policies is crucial for designing agents that can make optimal decisions to achieve specific goals. There are two types of policies: Deterministic Policy and Stochastic Policy.

A deterministic policy specifies a single action to take in each state. Formally:

Example:

This means that in state , the agent will always take action .

Stochastic Policy

A stochastic policy assigns a probability distribution over actions for each state. Formally:

Example:

This means that in state , the agent takes Action₁ with a probability of , and Action₂ with a probability of .

Now that we’ve introduced the definition of the policy. We can update the formula of the probability of transition to state from the current state under the policy :

This formula succinctly captures how a given policy influences the transition probabilities between states.

Let’s first decompose the components in the equation .

is a policy-induced probability. It qualifies how likely the system is to move to state from when the agent follows policy

determines which actions are more probable in each state. For example, a policy might favor actions that lead to higher rewards, assigning higher probabilities to those actions.

is inherent to the environment and are independent of the agent’s policy. It captures the stochastic nature of the environment, where the same action can lead to different outcomes.

Aggregating Over Actions: By summing over all possible actions, the formula accounts for all potential pathways the agent might take from the state tp under policy . This ensures a comprehensive calculation of the transition probability, considering every action’s influence.

Each term represents the contribution of action to the transition probability to state . The policy’s action probabilities serve as weights, determing how much each action influences the overall transition.

Here is an Example to illustrate how the equation works. Suppose the agent is in state , and there are two possible actions , and . The policy dictats that and . The environment dynamics are such that , and . Then, we can conclude that

This means that there is % chance of transistion to state from state under policy

Now that we've discussed the transition probability considering the policy in MDP in , let's examine the reward function.

It becomes that :

This formula encapsulates the expected immediate reward an agent receives when it is in a particular state and follows a specific policy .

Here's an example to illustrate the formula in :

A grid where an agent can move Up, Down, Left, or Right. Certain actions yield positive rewards(eg. reachinga goal) or negative rewards(eg., hitting a wall). Suppose the agent is at position on the grid at the state . The agent might have a stochastic policy, such as , , , . The reward function is that moving Up from yields an expected reward of , moving Down yields , moving Left yields , moving Right yields .

Now, Let’s calculate the

This result means that on average, the agent can expect an immediate of reward of 5.3 follow policy .

Differences in MDP and MP / MRP

As discussed in my previous blog, I introduced Markov Processes (MP) and Markov Reward Processes (MRP). Let's revisit these concepts to highlight the key differences between them and Markov Decision Processes (MDP).

The defining characteristic of an MP is its memoryless property, which states that

This means that the future state dependes only on the current state , not on the sequence of events that preceded it.

An MRP extends the Markov Process by incorporating rewards associated with state transitions. It models not only the stochastic transitions but also the immediate rewards received when moving between states.

The reward function defines the expected immediate reward received upon transitioning from one state to another. Formally , where is the reward received when transitioning from state to state .

The MRP enables the evaluation of the long-term performance of a state sequence based on the accumulated rewards and the value function is introduced to represent the expected cumulatice rewards from a given state onward. The value function, for an MRP is defined as

where is the discount factor that prioritizes immediate rewards over distant future rewards and is the reward received at time step

After several derivations (the process can be checked in the previous blog), we can obtain a more detailed value function for MRP:

A MDP is formally defined by the tuple , where the core differences between MDP and MRP is the action space, which represented by . There is a policy involved in the MDP which is a strategy that specifices the action to take in each state. The agent’s goal is to find an optimal policy that maximizes the expected cumulative discounted rewards from each state. This involves balancing immediate rewards with future rewards, considering the stochastic nature of state transistions. Following is a table generated by o1-mini to clearly illustrate the differences.

Value Function in MDP

As with MRPs, we need to discuss the value function in MDPs. MDPs have two value functions: the state value function and the action value function. Let's first examine the state value function:

where means the return at the time . This formula means that the expected return if the agent starts at the time and follows the policy .

For the action value function, we define a Q-function first, and it is:

This Q-function quantifies the expected return when an agent takes a specific action in a given state and then follows the policy thereafter. To calculate the overall state value, we sum the Q-values of all possible actions, weighted by their probability under the current policy. This approach elegantly accounts for the policy's influence on the expected returns:

<ins/>

We can also derive the Bellman equation from .

is also called the Bellman Expectation Equation for state value function and we can also get the Bellman Expectation Equation for Q function from .

We can also substitute into to derive another form of the Bellman Expectation Equation.

This equation represents the connection between the current state of the Q function and future states. It illustrates how the value of a state-action pair is related to potential future outcomes.

Let's now visualize the decomposition of the state value function using a diagram.

The formula of the graph(b) is the and the formula of graph(c) can be represented as . Let’s put to , then we get:

We can also use a diagram to illustrate the decomposition of the Q function which is .

Conclusion

In this blog, we explored the Markov Decision Process (MDP), focusing on its key equations and discussing its differences from Markov Reward Processes (MRP) and Markov Processes (MP). We also delved into the state value function and Q function, which are crucial components of MDPs.

<ins/>