Coders Packet

Implementing SARSA Algorithm in Machine Learning using Python

By R. Gayathri

Implementing state-action-reward-state-action Algorithm by Reinforcement learning technique in Python.

A Machine can be trained to make a sequence of decisions to achieve a definite goal. In artificial intelligence, this is usually done by a giving a reward or penalty for every action taken by the machine. This method is called reinforcement learning and it belongs to the category of unsupervised learning.

To understand SARSA algorithm which is based on Markov Decision process, we need to understand the concept of Temporal Difference. SARSA is related to Q-learning which are both TD based, but SARSA is on-policy (agent learns value function from action of current policy) and Q-learning is off-policy (agent learns value function from action of another policy)

The main advantage of SARSA is that, the reward/penalty of an action can be updated after each state, and need not wait till the end of the episode.



A Markov Decision Process (MDP) model contains:
• S = set of possible states.
• A = set of actions.
• r (S, A) = reward value
• A policy the solution of Markov Decision Process.

The aim is to achieve at the best possible solution for a problem

TD(0) Update Rule:

This involves taking an action at the starting state, move to next state and update the value of the previous state.

S1 -> a -> S2 to -> r

V(S) => V(S1) +α [ r + γ * V(S2) – V(S1) ]

Where V(S) is the new value of state, V(S1) is the current value of state and V(S2) is the estimated value of the new state.

α is the learning rate, γ is the discount factor and r is the reward

[ r + γ * V(S2) – V(S1) ] is the TD error

The aim is to move the current state value towards the sum of reward value and γ times the value of new state ( r + γ * V(S2) ) and this done at the rate of α.


This aims to achieve an optimal Q value and not just the optimal value of the state.

α, which is the learning rate needs to be initialized, which determines how quickly we make changes to the Q function.

  • Q (Si, Ai) is initialized where, Si be the state, Ai be the action. The Q-table can be initialized with ‘0’ for every state-action pair. This is the agent’s estimate of its discounted feature reward, starting from state Si and doing an action Ai.
  • Choose an action Ai from state Si based on epsilon greedy strategy for which we get a reward ri. Epsilon greedy is a strategy to maximize the Q(Si, Ai) value with a probability of 1- ↋ or apply a random action Ai at state Si with a probability of ↋, where ↋ is a small number.
  • The Q (Si, Ai) is calculated using the formula:

        Q (Si, Ai)) = Q (Si, Ai)) + α(ri+1 + γ * Q (Si+1, Ai+1)Q (Si, Ai))

  • Jump to the next state and perform the next action
  • Repeat from step 2 iteratively as i=i+1 for all the states in that episode.

This procedure is used for all the episodes taking one episode at a time.



Chosen Environment:

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.


Importing all the needed libraries.

import numpy as np 
import gym 

numpy is used for dealing with arrays gym library is used for developing reinforcement algorithms

env = gym.make('FrozenLake-v0')
epsilon = 0.9
total_episodes = 100
max_steps = 10
alpha = 0.85
gamma = 0.95
Q = np.zeros((env.observation_space.n, env.action_space.n)) 

The FrozenLake-v0 environment is set and loaded into a variable. The necessary variables  are initialized. The number of episodes is 100 and the maximum number of steps per episode is 10

def choose_action(state): 
    if np.random.uniform(0, 1) < epsilon: 
        action = env.action_space.sample() 
        action = np.argmax(Q[state, :]) 
    return action 
def update(state, state2, reward, action, action2): 
    predict = Q[state, action] 
    target = reward + gamma * Q[state2, action2] 
    Q[state, action] = Q[state, action] + alpha * (target - predict) 

The function to choose the net action and the function to learn the Q value is defined.


for episode in range(total_episodes): 
    t = 0
    state1 = env.reset() 
    action1 = choose_action(state1) 
    while t < max_steps: 
        state2, reward, done, info = env.step(action1) 
        action2 = choose_action(state2)  
        update(state1, state2, reward, action1, action2) 
        state1 = state2 
        action1 = action2 
        t += 1
        reward += 1
        if done: 

The machine is trained using SARSA algorithm by iterating through all the episodes.


The red mark is the current posion of the agent and the direction in the brackets is the direction of movement of the agent.

print ("Performace : ", reward/total_episodes) 

The performance of the algorithm is calculated by dividing the final reward by the total number of episodes.

The updated Q-table and the performance is printed



Reinforcement learning is a very important concept that has great potential and plays a critical role in the technological industry. This can be used  to arrive at solutions for complex problems with proper training and testing. Thus, SARSA Algorithm can be used to serve this purpose effeciently and successfully

Download Complete Code


No comments yet