You are on page 1of 8

In [1]:

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optim
import scipy.stats as scs
import pandas as pd

In [2]:

%load_ext autoreload
%autoreload 2
%cd src
%reload_ext autoreload

/Users/elliottsaslow/gaalvanize/dsi-multi-armed-bandit/src

Multi-Arm Bandit:

A/B testing with bayesian updating

Exploration Testing out the different options to determine how good each is. This includes aquiring
more knowledge about the reward for each option.

Exploitation Leveraging your current knowledge about the options to get the higherst expected
reward at the that time.

A/B Testing in terms of Exploration & Exploitation

Initially, you start with exploration where the same number of users are assigned to see each option.
Then once the test is done, go for exploitation where all of the users see the final result that you
chose.

Multi - Arm Bandit Approach

Show each user the site that you think is best more of the time
As the experiment runs and you send users to different sites, update your beliefs about each
site
Run until there is a clear and distinct winner.

Motivation and origins


Originally came from slot machines where the gambler faces the problem about playing a smart
strategy of which machines to play and what order to play them. All that is known is that each will
provide a reward when the lever is pulled.

Use cases
Dynamic A/B
Budget allocation amongst competing projects
Budget allocation amongst competing projects
Clinical trials
Adaptive routing in attempts to minimize network delays
Reinforcement learning

Applying different methods of Multi Arm bandit:

We can easily imagine 2 fairly simple ways of trying to maximize ones winnings with a slot machine.

Set up: you have 3 slot machines which you are playing with and one of them pays out more often
than the others. You do not know which machine is the one that pays better but you can keet track of
how much on average you win each time. How do you maximize your winnings?

1. The most basic way is just keep choosing randomly until you run out of money or are happy
with the amount of money that you've made.
2. Keep track of how much your winnings are with each machine and just choose the one that
has best average payout over and over agian. This would be called the 'Max Mean Method'

Lets take a look at how we perform with just randomized stategy:

In [3]:
from bandits import Bandits
from banditstrategy import BanditStrategy
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'random_choice')
strat.sample_bandits(10000)

print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))
We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [3331.0, 3358.0, 3314.0]
Number of wins for each machine: [428.0, 91.0, 206.0]
Conversion rates for each machine: [0.12848994296007205,
0.027099463966646812, 0.062160531080265542]
A total of 725 wins of 10003 trials.

Looks pretty much like we split between all the machines pretty evenly and got the expected
result.

Lets try and use the Max Mean method where we choose the machine based on which one has been
performing the best:

In [4]:
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'max_mean')
strat.sample_bandits(10000)

print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))

We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [9803.0, 100.0, 100.0]
Number of wins for each machine: [1230.0, 4.0, 6.0]
Conversion rates for each machine: [0.12547179434866879,
0.040000000000000001, 0.059999999999999998]
A total of 1240 wins of 10003 trials.

Here we can see that we are starting to better than just random! Out of 10000 trials, we where able
increase our wins from 821 to 1202. This was done by running 100 tests, and then choosing the the
machine that had the highest payout in our initial tests!

Next: Lets take a look at some of the possible algorithms we can run to make this even more
efficient!

Epsilon - Greedy

This algorithm uses a similar method to that of the Max Mean Method We test out all of the machines
about 10% of the time and then we use our knowledge from this exporation to choose the one with
the highest payout! As well, the exploration parameter of 10% is changeable and we will be calling
that parameter epsilon for this algorithm. I have implimented it, so lets take a look at the results:

In [5]:
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'epsilon_greedy')
strat.sample_bandits(10000)

print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))
We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [7975.0, 1677.0, 351.0]
Number of trials for each machine: [7975.0, 1677.0, 351.0]
Number of wins for each machine: [1062.0, 57.0, 15.0]
Conversion rates for each machine: [0.13316614420062695,
0.033989266547406083, 0.042735042735042736]
A total of 1134 wins of 10003 trials.

Ok so lets start to compare how each of these perform compared to each other:

In [30]:
eps_greedy = []
max_mean = []
rando = []
softmax = []
ucb1 = []
Baysian = []
machines = [0.13, 0.05, 0.26];
bandits = Bandits(machines)
for i in range(1000):
strat = BanditStrategy(bandits, 'epsilon_greedy')
strat.sample_bandits(1000)
eps_greedy.append(strat.wins.sum())
strat2 = BanditStrategy(bandits, 'max_mean')
strat2.sample_bandits(1000)
max_mean.append(strat2.wins.sum())
strat3 = BanditStrategy(bandits, 'random_choice')
strat3.sample_bandits(1000)
rando.append(strat3.wins.sum())
strat4 = BanditStrategy(bandits, 'softmax')
strat4.sample_bandits(1000)
softmax.append(strat4.wins.sum())
strat5 = BanditStrategy(bandits, 'ucb1')
strat5.sample_bandits(1000)
ucb1.append(strat5.wins.sum())
strat6 = BanditStrategy(bandits, 'ucb1')
strat6.sample_bandits(1000)
Baysian.append(strat6.wins.sum())
banditstrategy.py:173: RuntimeWarning: divide by zero encountered in log
confidence_bounds = np.sqrt((2. * np.log(self.N)) / self.trials)
banditstrategy.py:173: RuntimeWarning: invalid value encountered in sqrt
confidence_bounds = np.sqrt((2. * np.log(self.N)) / self.trials)

In [31]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
plt.hist(max_mean,alpha = .45, color = 'b', bins = 30,normed=1);
plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
plt.hist(rando,alpha = .45, color = 'r',bins = 30,normed=1);

plt.grid()
legend = ['Max Mean Alg','Epsilon Greedy Alg','Random Choice Alg']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries for each algorithm')
plt.show()
Looking at the perfomance of each of these algorithms, it is easily apparent to see that the random
choice is by far performing the worst. In second place, we have the max mean, but it is a close
second to the epsilon greedy algorithm which performed the best.

Lets look at a couple more algorithms that may be able to perform even better!

Soft max:

We are going to try another algorithm called softmax. This is very interesting because it chooses the
best machine probabilistically. This means that it takes all of the means and chooses the one based
on a calculated probability. If the machine performs better, the probaility that it is chosen is higher!
The equation for the soft max algorithm is:
^
μ
e j(t ) / τ
k
^

μ
j = 1e j ( t ) / τ
P t(choosing bandit j) =

In [32]:
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'softmax')
strat.sample_bandits(10000)

print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))
We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [9944.0, 21.0, 38.0]
Number of wins for each machine: [1309.0, 1.0, 2.0]
Conversion rates for each machine: [0.13163716814159293,
0.047619047619047616, 0.052631578947368418]
A total of 1312 wins of 10003 trials.

As can be seen below, the softmax algorithm performs better than epsilon greedy.
In [33]:

fig, ax = plt.subplots(1, 1, figsize=(8, 4))


plt.hist(softmax,alpha = .45, color = 'r',bins = 30,normed=1);
plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Epsilon Greedy Alg']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries Epsilon Greedy & SoftMax')
plt.show()

Upper Confidence Bound Algorithm

Upper confidence bound chooses the machne that has the highest payout but there is a term
automatically balances exploration and exploitation.

pick the machine that maximizes:

^ 2ln ( t)
μj
(t) +
√ nj

^
μ
where j(t) is the best performing machine

nj is the number of times that the machine has been pulled

t is the total number of rounds

In [34]:
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'ucb1')
strat.sample_bandits(10000)

print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))
We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [7665.0, 1008.0, 1330.0]
Number of wins for each machine: [953.0, 37.0, 72.0]
Conversion rates for each machine: [0.12433137638617091,
0.036706349206349208, 0.054135338345864661]
A total of 1062 wins of 10003 trials.

In [35]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
plt.hist(softmax,alpha = .45, color = 'r',bins = 30,normed=1);
plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
plt.hist(ucb1,alpha = .45, color = 'b',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Epsilon Greedy Alg','Ucb1']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries Epsilon Greedy & SoftMax')
plt.show()

Bayesian approach.

Finally, It is possible to use a bayesian probability approach to create a probability update using
conjugate pairs of probability distributions that allow us to update our knowledge after each iteration.
Below is a plot of performance compared with the SoftMax algorithm.

In [41]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
plt.hist(softmax,alpha = .45, color = 'r',bins = 30,normed=1);
#plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
#plt.hist(ucb1,alpha = .45, color = 'b',bins = 30,normed=1);
plt.hist(Baysian,alpha = .45, color = 'k',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Bayes']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries Epsilon Greedy & SoftMax')
plt.show()
plt.show()

Conclusion:
It looks like the performance of SoftMax algorithm is by far the best and perfroms the best every time.
i would love to dig deeper into this

You might also like