Implementing Epsilon-Greedy Algorithm in Python

In this project, We will implement the Multi-Armed Bandit Analysis of Epsilon-Greedy in Python

Introduction:

Epsilon-Greedy Algorithm is one of the key algorithms in taking decisions. The most common application is the Multi-Armed Bandit Problem. The two main key features in this algorithm are ‘Explore’ and ‘Exploit’. For example, If You were given an opportunity to invest in 3 different mutual funds and You have a limited amount of money. First, with very little, you’d try and find out which produces more profit. This phase is called Explore. Next, you’d want to gain the most, so You’d invest in the particular mutual fund which will give You the most profit. This is the Exploit phase. We will see about the Algorithm in detail.

Algorithm:

Action -Value Function:

For an object to take a decision on performing what action gives the maximum profit, every action that is taken is given a value. We use the action-value function in Probability.

The value of selecting an action is defined as the resultant profit gained from performing that action from the set of all possible actions. Then we use the sample average to find the value of an action.

Exploration vs Exploitation:

Exploration allows an object to know more about every action performed. Improving the accuracy of the estimated action-values, helps an object to take better decisions in the future.

Exploitation chooses the greedy action to obtain the highest profit by exploiting the object’s current action-value estimates. But by being greedy with respect to action-value estimates, may not actually get the most profit and not produce optimal results.

When an object explores, it gets more accurate estimates of action-values. And when it exploits, it might get more profit. An object cannot do both the Exploit ad Explore phase simultaneously, and this is known as the exploration-exploitation dilemma.

Epsilon-Greedy Action Selection:

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly. The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring.

Implementation of Epsilon-Greedy in Python

Import the libraries:

```import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
```

Define Actions and Action-Value Estimates:

```# Define Action class
class Actions:
def __init__(self, m):
self.m = m
self.mean = 0
self.N = 0

# Choose a random action
def choose(self):
return np.random.randn() + self.m

# Update the action-value estimate
def update(self, x):
self.N += 1
self.mean = (1 - 1.0 / self.N)*self.mean + 1.0 / self.N * x
```

Creating Epsilon-Greedy Function:

```for i in range(N):
# epsilon greedy
p = np.random.random()
if p < eps:
j = np.random.choice(3)
else:
j = np.argmax([a.mean for a in actions])
x = actions[j].choose()
actions[j].update(x)
```

Visualizing the Output:

```    # for the plot
data[i] = x
cumulative_average = np.cumsum(data) / (np.arange(N) + 1)

# plot moving average ctr
plt.plot(cumulative_average)
plt.plot(np.ones(N)*m1)
plt.plot(np.ones(N)*m2)
plt.plot(np.ones(N)*m3)
plt.xscale('log')
plt.show()
```

OUTPUT:

Conclusion:

The greater the number of actions, the algorithm has more options to explore which may affect the rate of finding the best action. Through this implementation, I hope You have gained a sufficient amount of understanding about the Epsilon-Greedy Algorithm.