-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAB-vs-MAB--making-a-choice.html
238 lines (225 loc) · 12.4 KB
/
AB-vs-MAB--making-a-choice.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
<!DOCTYPE html>
<html lang="en">
<head>
<!-- ## for client-side less
<link rel="stylesheet/less" type="text/css" href="/theme/css/style.less">
<script src="http://cdnjs.cloudflare.com/ajax/libs/less.js/1.7.3/less.min.js" type="text/javascript"></script>
-->
<link rel="stylesheet" type="text/css" href="/theme/css/style.css">
<link rel="stylesheet" type="text/css" href="/theme/css/pygments.css">
<link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=PT+Sans|PT+Serif|PT+Mono">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="author" content="Gaël">
<meta name="description" content="Posts and writings by Gaël">
<meta name="keywords" content="">
<title>
And yet it moves
– Choosing the best variant, Multi-armed bandit vs. A/B testing for web service optimization </title>
</head>
<body>
<aside>
<div id="user_meta">
<div id="metalogo">
<a href="/">
<img src="/images/logo.svg" alt="logo" id="logo">
</a>
<a href="/" id="title">And yet it moves</a>
</div>
<p></p>
<ul>
<li><a href="/category/data-processing.html" class="active">Data processing</a></li>
</ul>
</div>
</aside>
<main>
<header>
<p>
<a href="/">Index</a> ¦ <a href="/archives.html">Archives</a>
</p>
</header>
<article>
<div class="series_toc">
<p>This post is part 3 of the "AB vs. MAB" series:</p>
<ol class="parts">
<li >
<a href='/AB-vs-MAB.html'>Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--basics.html'>Basics of testing for web optimisation, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li class="active">
<a href='/AB-vs-MAB--making-a-choice.html'>Choosing the best variant, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--pros-of-AB-testing.html'>The pros of the A/B testing method, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--reward-maximisation.html'>Reward maximisation, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
<li >
<a href='/AB-vs-MAB--other-drawbacks-conclusion.html'><span class="caps">MAB</span> other drawbacks and conclusion, Multi-armed bandit vs. A/B testing for web service optimization</a>
</li>
</ol>
</div>
<div class="article_meta">
<p style="text-align: right; margin: 0;">Posted on: Sat, 14 May 2016</p>
</div>
<div class="article_title">
<h3><a href="/AB-vs-MAB--making-a-choice.html">Choosing the best variant, Multi-armed bandit vs. A/B testing for web service optimization</a></h3>
</div>
<div class="article_text">
<p>Have you ever been given a choice that you found very hard to make? In
some situations, that choice is hard because you don’t like any of the
outcomes. In other situations, it is hard because you just <strong>don’t
know</strong> what these outcomes would be. In that latter case, you would
generally try to predict what these outcomes would be either using pure
intuition, common sense, your knowledge and experience, or even fancy
mathematical stuff.</p>
<p>In the web-service realm, there is one such a question that haunts a lot
of people, day after day, and even at night in their worst nightmares:</p>
<blockquote>
Is there anything we can do to make more money out of my website?</blockquote>
<div class="section" id="website-optimization">
<h2 id="website-optimization">Website optimization</h2>
<p><em>What is he talking about?</em> could you ingenuously say… People, like a
lot of other things, tend to react the same way <em>on average</em> given the
same stimulus: people tend to take an umbrella or a rain coat with them
when it is raining. Nothing really surprising there.</p>
<p>But did you know that a simple <a class="reference external" href="http://dx.doi.org/10.1111/j.1745-459X.2012.00397.x">cup color could change the way you taste
chocolate</a>? In a
very similar way, changing a simple button color on a website for
instance could increase the total number of clicks, yielding more
incomes. The process of finding such an optimal design is called
<em>website optimization</em>.</p>
</div>
<div class="section" id="of-time-machines-and-statistics">
<h2 id="of-time-machines-and-statistics">Of time machines and statistics</h2>
<p>But how do we know that one variant yields a higher reward? Strictly
speaking, knowing it <em>exactly</em> would require a time machine to test all
the variants in the <em>exact</em> same conditions. In one timeline, you could
provide the first variant. In the other timeline you could provide the
alternative variant and then use your time machine a last time to settle
on the best-performing variant.</p>
<p>However, if we can afford to be wrong <em>sometimes</em>, there is another way
to answer that question without breaking any physical law:
<em>statistics</em>.</p>
<p>In our button color example we want to get the maximum number of clicks.
This is our objective as well as our reward. That number of click is not
easy to handle: obviously if 5 people out of 10 clicked, it is not the
same as 5 people out of 10000. That is why we are really looking at is
the click-through rate: the number of people who clicked over the number
of people who visited the website.</p>
<p>However, if you spend your time (and your visitors) assessing which
variant is performing the best, your reward (i.e. total number of
clicks) might be rather disappointing. This is why the testing campaign
is usually performed on a subset of the total visitor <strong>population</strong>.
That subset is called a <strong>sample</strong> of the population. That sample will
be subjected to each variant of the website, in our example:</p>
<ul class="simple">
<li>one variant of the website, would contain an orange <object data="/images/orange_btn.svg" type="image/svg+xml">*Buy Now!*</object>
button</li>
<li>while the other variant would contain a green <object data="/images/green_btn.svg" type="image/svg+xml">image1</object> one.</li>
</ul>
<p>Once the testing period is over, the seemingly best-performing variant
is served to all the remaining visitors (this is sometimes called the
exploitation period).</p>
<p><span class="dquo">“</span>Problem solved!” you said? Unfortunately, nope. At this stage, we have
only reached a precision which is akin to “people die from hunger, so
let’s give them food”: if this was enough, people wouldn’t die from
hunger anymore. The devil<strong>s</strong> are in the details: how large should
that sample be? how often are we expecting to be wrong? how should be
provide the variants exactly? <em>etc.</em></p>
<p>This is exactly why some <em>specific</em> methods were devised, such as A/B
testing (<a class="reference external" href="https://en.wikipedia.org/wiki/A/B_testing">wiki</a>) and the
multi-armed bandit strategy
(<a class="reference external" href="https://en.wikipedia.org/wiki/Multi-armed_bandit">wiki</a>). And these
are the one I am going to study.</p>
<p>If that part is not clear enough, here is a longer description.</p>
</div>
<div class="section" id="a-b-testing">
<h2 id="ab-testing">A/B testing</h2>
<p>The purpose for which A/B testing was designed is to answer a single question:</p>
<blockquote>
Is any of my variants (i.e. <object data="/images/orange_btn.svg" type="image/svg+xml">the orange</object> and <object data="/images/green_btn.svg" type="image/svg+xml">the green button in our example</object>) inducing a significantly different outcome (i.e.
click-through rate here)?</blockquote>
<p>As explained above, A/B testing is a two-staged method:</p>
<ol class="arabic simple">
<li><em>Testing stage</em>: this is where the data-gathering and analysis take place.</li>
<li><em>Exploitation stage</em>: this is where only the best-performing variant
is served.</li>
</ol>
<p>Proper A/B testing requires all the variants to be served almost as
often at any point of time. Even though this is in no way a hard
requirement, we will see that is is actually a very efficient way to get
the highest confidence in our data.</p>
<p>So where is the “testing” of <em>A/B testing</em> coming from? The actual
testing is performed at the end of the <em>testing stage</em> to provide an
actual answer to the question above. That test in called a <em>contingency
test</em> and for this kind of <em>Bernoulli processes</em> (i.e. binary data like
“has clicked or not?”) we are going to use <em>Fisher’s exact test</em>
(<a class="reference external" href="https://en.wikipedia.org/wiki/Fisher's_exact_test">wiki</a>).</p>
<p>Why is such a test important? Because it answers the question above
while providing an estimate of what we could call <em>confidence</em>. With A/B
testing we can say:</p>
<blockquote>
We proved that the green button yields higher click-through rates
with only a <span class="math">\(1\%\)</span> chance of being wrong here.</blockquote>
<p>This is a very precious piece of information if the existence of a
difference matters (like in drug trials to name one).</p>
</div>
<div class="section" id="the-multi-armed-bandit-method">
<h2 id="the-multi-armed-bandit-method">The multi-armed bandit method</h2>
<p>The multi-armed bandit method was allegedly designed for only one purpose:</p>
<blockquote>
Maximizing the reward (i.e. the total number of clicks in our case
study).</blockquote>
<p>The particular multi-armed bandit strategy presented in the original
blog post was not specifically designed in a two-staged fashion (even
though it <em>could</em> be used that way). It was instead presented in a
<em>set-and-forget</em> mode, in which there is no testing stage: the method is
applied on the total population.</p>
<p>It works by first determining which variant worked the best so far. That
variant is then favoured and is served more often than the alternative.
The instant another variant is found better (as more data are gathered),
it is served more often.</p>
<p>In contrast with A/B testing, there seem to be no fool-proof way to
estimate a suitable sample size without making a few assumptions. In
that sense, the set-and-forget mode is more a workaround than a feature
of the method.</p>
<p>There is also no test associated with that strategy. A contingency test
could be applied but we will see that there are drawbacks to that.</p>
<p>To sum it up, the multi-armed bandit method is more accurately described
as a strategy: it does not answer any question. This is generally not a
problem because the sample sizes people use are usually large enough to
provide <strong>a</strong> correct answer.</p>
</div>
<div class="section" id="main-differences-between-the-methods">
<h2 id="main-differences-between-the-methods">Main differences between the methods</h2>
<p>In practice, both methods could be used in the <em>two-stepped</em> mode and a
contingency test can be performed in both.</p>
<p>However, in contrast with a few claims of the original blog post:</p>
<ul class="simple">
<li>A/B testing is only a specific name given to a 2-variant <strong>bucket
tests</strong> (<span class="math">\(A\)</span> and <span class="math">\(B\)</span>, you guessed it). However nothing
prevents us from using it on more variants (Fisher’s exact test can
be used on any number of variants or all the variants can be compared 2-by-2).</li>
<li>Variants can be added or removed any time. In particular, A/B testing
does not <em>require</em> the variants to be served as often to work.
However, serving them as often at any given time is actually more
efficient and prevents a whole class of biases (I will come to that later).</li>
</ul>
<p>So what is the real core difference?</p>
<blockquote>
A/B testing serves each variant as often whereas the <span class="caps">MAB</span> strategy
favours one.</blockquote>
<p>That’s it.</p>
</div>
</div>
</article>
<div id="ending_message">
<p>© Gaël. Built using <a href="http://getpelican.com" target="_blank">Pelican</a>. Theme by Giulio Fidente on <a href="https://github.com/gfidente/pelican-svbhack" target="_blank">github</a>. </p>
</div>
</main>
</body>
</html>