Multi-Arm Bandit Testing

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optim
import scipy.stats as scs
import pandas as pd
In [2]:
%load_ext autoreload
%autoreload 2
%cd src
%reload_ext autoreload
/Users/elliottsaslow/gaalvanize/dsi-multi-armed-bandit/src
Multi-Arm Bandit:
A/B testing with bayesian updating
Exploration Testing out the different options to determine how good each is. This includes aquiring
more knowledge about the reward for each option.
Exploitation Leveraging your current knowledge about the options to get the higherst expected
reward at the that time.
A/B Testing in terms of Exploration & Exploitation
Initially, you start with exploration where the same number of users are assigned to see each option.
Then once the test is done, go for exploitation where all of the users see the final result that you
chose.
Multi - Arm Bandit Approach
Show each user the site that you think is best more of the time
As the experiment runs and you send users to different sites, update your beliefs about each
site
Run until there is a clear and distinct winner.
Motivation and origins

Originally came from slot machines where the gambler faces the problem about playing a smart
strategy of which machines to play and what order to play them. All that is known is that each will
provide a reward when the lever is pulled.
Use cases
Dynamic A/B
Budget allocation amongst competing projects
Budget allocation amongst competing projects
Clinical trials
Adaptive routing in attempts to minimize network delays
Reinforcement learning
Applying different methods of Multi Arm bandit:
We can easily imagine 2 fairly simple ways of trying to maximize ones winnings with a slot machine.
Set up: you have 3 slot machines which you are playing with and one of them pays out more often
than the others. You do not know which machine is the one that pays better but you can keet track of
how much on average you win each time. How do you maximize your winnings?
1. The most basic way is just keep choosing randomly until you run out of money or are happy
with the amount of money that you've made.
2. Keep track of how much your winnings are with each machine and just choose the one that
has best average payout over and over agian. This would be called the 'Max Mean Method'
Lets take a look at how we perform with just randomized stategy:
In [3]:
from bandits import Bandits
from banditstrategy import BanditStrategy
machines = [0.13, 0.03, 0.06];
bandits = Bandits(machines)
strat = BanditStrategy(bandits, 'random_choice')
strat.sample_bandits(10000)
print('We have three slot machines that pay out with the proababilities\
1: {x}, 2: {y}, 3: {z}'.format(x = machines[0],y = machines[1],\
z = machines[2]))
print("Number of trials for each machine: {x}".format(x = list(strat.trials
)))
print("Number of wins for each machine: {x}".format(x = list(strat.wins)))
print("Conversion rates for each machine: {x}".format(x =list(strat.wins /
strat.trials)))
print("A total of %d wins of %d trials." % \
(strat.wins.sum(), strat.trials.sum()))
We have three slot machines that pay out with the proababilities 1: 0.13, 2
: 0.03, 3: 0.06
Number of trials for each machine: [3331.0, 3358.0, 3314.0]
Number of wins for each machine: [428.0, 91.0, 206.0]
Conversion rates for each machine: [0.12848994296007205,
0.027099463966646812, 0.062160531080265542]
A total of 725 wins of 10003 trials.
Looks pretty much like we split between all the machines pretty evenly and got the expected
result.
Lets try and use the Max Mean method where we choose the machine based on which one has been
performing the best:
In [4]:
machines = [0.13, 0.03, 0.06];
strat = BanditStrategy(bandits, 'max_mean')
z = machines[2]))
)))
strat.trials)))
: 0.03, 3: 0.06
0.040000000000000001, 0.059999999999999998]
Here we can see that we are starting to better than just random! Out of 10000 trials, we where able
increase our wins from 821 to 1202. This was done by running 100 tests, and then choosing the the
machine that had the highest payout in our initial tests!
Next: Lets take a look at some of the possible algorithms we can run to make this even more
efficient!
Epsilon - Greedy
This algorithm uses a similar method to that of the Max Mean Method We test out all of the machines
about 10% of the time and then we use our knowledge from this exporation to choose the one with
the highest payout! As well, the exploration parameter of 10% is changeable and we will be calling
that parameter epsilon for this algorithm. I have implimented it, so lets take a look at the results:
In [5]:
machines = [0.13, 0.03, 0.06];
strat = BanditStrategy(bandits, 'epsilon_greedy')
z = machines[2]))
)))
strat.trials)))
: 0.03, 3: 0.06
0.033989266547406083, 0.042735042735042736]
Ok so lets start to compare how each of these perform compared to each other:
In [30]:
eps_greedy = []
max_mean = []
rando = []
softmax = []
ucb1 = []
Baysian = []
machines = [0.13, 0.05, 0.26];
for i in range(1000):
strat = BanditStrategy(bandits, 'epsilon_greedy')
eps_greedy.append(strat.wins.sum())
strat2 = BanditStrategy(bandits, 'max_mean')
strat2.sample_bandits(1000)
max_mean.append(strat2.wins.sum())
strat3 = BanditStrategy(bandits, 'random_choice')
rando.append(strat3.wins.sum())
strat4 = BanditStrategy(bandits, 'softmax')
softmax.append(strat4.wins.sum())
strat5 = BanditStrategy(bandits, 'ucb1')
ucb1.append(strat5.wins.sum())
strat6 = BanditStrategy(bandits, 'ucb1')
Baysian.append(strat6.wins.sum())
banditstrategy.py:173: RuntimeWarning: divide by zero encountered in log
confidence_bounds = np.sqrt((2. * np.log(self.N)) / self.trials)
banditstrategy.py:173: RuntimeWarning: invalid value encountered in sqrt
confidence_bounds = np.sqrt((2. * np.log(self.N)) / self.trials)
In [31]:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
plt.hist(max_mean,alpha = .45, color = 'b', bins = 30,normed=1);
plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
plt.hist(rando,alpha = .45, color = 'r',bins = 30,normed=1);
plt.grid()
legend = ['Max Mean Alg','Epsilon Greedy Alg','Random Choice Alg']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries for each algorithm')
plt.show()
Looking at the perfomance of each of these algorithms, it is easily apparent to see that the random
choice is by far performing the worst. In second place, we have the max mean, but it is a close
second to the epsilon greedy algorithm which performed the best.
Lets look at a couple more algorithms that may be able to perform even better!
Soft max:
We are going to try another algorithm called softmax. This is very interesting because it chooses the
best machine probabilistically. This means that it takes all of the means and chooses the one based
on a calculated probability. If the machine performs better, the probaility that it is chosen is higher!
The equation for the soft max algorithm is:
^
μ
e j(t ) / τ
k
^
∑
μ
j = 1e j ( t ) / τ
P t(choosing bandit j) =
In [32]:
machines = [0.13, 0.03, 0.06];
strat = BanditStrategy(bandits, 'softmax')
z = machines[2]))
)))
strat.trials)))
: 0.03, 3: 0.06
0.047619047619047616, 0.052631578947368418]
As can be seen below, the softmax algorithm performs better than epsilon greedy.
In [33]:

plt.hist(softmax,alpha = .45, color = 'r',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Epsilon Greedy Alg']
plt.legend(legend)
plt.title('Histogram of winnings with 100 tries Epsilon Greedy & SoftMax')
plt.show()
Upper Confidence Bound Algorithm
Upper confidence bound chooses the machne that has the highest payout but there is a term
automatically balances exploration and exploitation.
pick the machine that maximizes:
^ 2ln ( t)
μj
(t) +
√ nj
^
μ
where j(t) is the best performing machine
nj is the number of times that the machine has been pulled
t is the total number of rounds
In [34]:
machines = [0.13, 0.03, 0.06];
strat = BanditStrategy(bandits, 'ucb1')
z = machines[2]))
)))
strat.trials)))
: 0.03, 3: 0.06
0.036706349206349208, 0.054135338345864661]
In [35]:
plt.hist(ucb1,alpha = .45, color = 'b',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Epsilon Greedy Alg','Ucb1']
plt.legend(legend)
plt.show()
Bayesian approach.
Finally, It is possible to use a bayesian probability approach to create a probability update using
conjugate pairs of probability distributions that allow us to update our knowledge after each iteration.
Below is a plot of performance compared with the SoftMax algorithm.
In [41]:
#plt.hist(eps_greedy,alpha = .45, color = 'g',bins = 30,normed=1);
#plt.hist(ucb1,alpha = .45, color = 'b',bins = 30,normed=1);
plt.hist(Baysian,alpha = .45, color = 'k',bins = 30,normed=1);
plt.grid()
legend = ['SoftMax','Bayes']
plt.legend(legend)
plt.show()
plt.show()
Conclusion:
It looks like the performance of SoftMax algorithm is by far the best and perfroms the best every time.
i would love to dig deeper into this

Multi-Arm Bandit Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Arm Bandit Testing

Uploaded by

Copyright:

Available Formats

In [1]:

A/B testing with bayesian updating

A/B Testing in terms of Exploration & Exploitation

Multi - Arm Bandit Approach

Motivation and origins

Applying different methods of Multi Arm bandit:

Lets take a look at how we perform with just randomized stategy:

fig, ax = plt.subplots(1, 1, figsize=(8, 4))

Upper Confidence Bound Algorithm

pick the machine that maximizes:

nj is the number of times that the machine has been pulled

t is the total number of rounds

You might also like