AB-vs-MAB--reward-maximisation.html

<!DOCTYPE html>
<html lang="en">

<head>
  <!-- ## for client-side less
  <link rel="stylesheet/less" type="text/css" href="/theme/css/style.less">
  <script src="http://cdnjs.cloudflare.com/ajax/libs/less.js/1.7.3/less.min.js" type="text/javascript"></script>
  -->
  <link rel="stylesheet" type="text/css" href="/theme/css/style.css">
  <link rel="stylesheet" type="text/css" href="/theme/css/pygments.css">
  <link rel="stylesheet" type="text/css" href="//fonts.googleapis.com/css?family=PT+Sans|PT+Serif|PT+Mono">

  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="author" content="Gaël">
  <meta name="description" content="Posts and writings by Gaël">


<meta name="keywords" content="">

  <title>
    And yet it moves
&ndash; Reward&nbsp;maximisation, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization  </title>

</head>

<body>
  <aside>
    <div id="user_meta">
      <div id="metalogo">
        <a href="/">
          <img src="/images/logo.svg" alt="logo" id="logo">
        </a>
        <a href="/" id="title">And yet it moves</a>
      </div>
      <p></p>
      <ul>
            <li><a href="/category/data-processing.html" class="active">Data processing</a></li>
      </ul>
    </div>
  </aside>

  <main>
    <header>
      <p>
      <a href="/">Index</a> &brvbar; <a href="/archives.html">Archives</a>
      </p>
    </header>

<article>
      <div class="series_toc">
          <p>This post is part 5 of the "AB vs. MAB" series:</p>
          <ol class="parts">
                  <li >
    <a href='/AB-vs-MAB.html'>Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
                  <li >
    <a href='/AB-vs-MAB--basics.html'>Basics of testing for web&nbsp;optimisation, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
                  <li >
    <a href='/AB-vs-MAB--making-a-choice.html'>Choosing the best&nbsp;variant, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
                  <li >
    <a href='/AB-vs-MAB--pros-of-AB-testing.html'>The pros of the A/B testing&nbsp;method, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
                  <li class="active">
    <a href='/AB-vs-MAB--reward-maximisation.html'>Reward&nbsp;maximisation, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
                  <li >
    <a href='/AB-vs-MAB--other-drawbacks-conclusion.html'><span class="caps">MAB</span> other drawbacks and&nbsp;conclusion, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a>
                  </li>
          </ol>
      </div>
  <div class="article_meta">
    <p style="text-align: right; margin: 0;">Posted on: Wed, 18 May 2016</p>
  </div>
  <div class="article_title">
    <h3><a href="/AB-vs-MAB--reward-maximisation.html">Reward&nbsp;maximisation, Multi-armed bandit vs. A/B testing for web service&nbsp;optimization</a></h3>
  </div>
  <div class="article_text">
    <p>I showed in the previous section that the A/B testing data-gathering
method by itself has some advantages over the <span class="caps">MAB</span> strategy.</p>
<p>But how does this translates to <em>what really matters</em>™, clicks?</p>
<p>In contrast with the previous section, we are not just interested in the
<em>testing stage</em> anymore: we want to know which method would provide the
highest click-through rate overall (so during testing and after, once
the alleged best-performing variant is put in production). I will first
give an example in a specific setup and then explore the difference in
behavior in a much broader scope to avoid making “general” conclusions
out of a single specific example.</p>
<div class="section" id="effect-of-the-sample-size">
<h2 id="effect-of-the-sample-size">Effect of the sample size</h2>
<p>So, I first simulated a large number of complete campaigns (i.e. testing
stage followed by an exploitation stage using the testing results). I
assumed here that the population is composed of <span class="math">\(20000\)</span> visitors
in total. In this particular example, I used the two-stepped variant of
the <span class="caps">MAB</span> strategy provided that the set-and-forget mode is reached when
the sample size is equal to the population size. I obtained the
following figure:</p>
<div class="figure">
<object data="/images/clicks_vs_sample_size.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>As we can see, there are different areas where each method shines but we
are going to get back to that in detail a bit later.</p>
<p>The asymptotic behavior of the two two-stepped methods are however similar:</p>
<ul class="simple">
<li>We can see that the total click-through rate rises quickly as smaller
sample sizes (left of the figure), yet A/B testing systematically
reaches higher click rates at smaller sample sizes than in the
two-stepped <span class="caps">MAB</span> strategy.</li>
<li>A maximum is then reached in both methods: that maximum is reached,
again, at smaller sample sizes in the A/B testing framework.</li>
<li>Finally, the total click rate decreases linearly toward the final
value in each case (which corresponds to the set-and-forget
scenario). These limit values are known and could be calculated in
each case: for A/B testing this is generally the average of the
individual click-through rates whereas in the <span class="caps">MAB</span> strategy, this
average is weighted so that higher values are obtained.</li>
</ul>
<p>This kind of curves seems to be generally misunderstood: as the <span class="caps">MAB</span>
strategy provides higher values on the largest range, people tend to
think that the <span class="caps">MAB</span> strategy is therefore generally better. <strong>This is
incorrect</strong>.</p>
<p>What this curve really means is that A/B testing does actually compete
fairly well with the <span class="caps">MAB</span> strategy in terms of click-through rates. But
let me explain using the following close-up:</p>
<div class="figure">
<object data="/images/closer_clicks_vs_sample_size.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>Roughly the same total number of clicks can be obtained in A/B testing,
using smaller sample sizes (and therefore shorter testing periods). So
there are conditions in which A/B testing performs as good as the <span class="caps">MAB</span>
strategy click-wise! All we have to do is to find these conditions.
Easier said than done, but it should be possible using the same tools
that allows us to determine the sample size until now.</p>
<p>That said, the <span class="caps">MAB</span> strategy behaves better than A/B testing
asymptotically: this is actually what this algorithm is good at and what
it was designed for. That property could come in handy: it could be
considered as a safety net in many situations and it makes the
set-and-forget mode something that can be actually used.</p>
<p>However, telling when to stop in the two-stepped case is even harder
than with A/B testing (because we don’t know how many trials the
algorithm is going to spend for each variant, especially for smaller
sample sizes).</p>
<p>The alleged click-through rate superiority of the <span class="caps">MAB</span> strategy is not
that clear anymore, in fact A/B testing seems to provide equivalent
results at even smaller sample sizes. However, this does not mean that
the <span class="caps">MAB</span> strategy performs badly click-wise. Moreover, these results were
obtained in a single setup. It is now time to explore a bit more.</p>
</div>
<div class="section" id="population-size">
<h2 id="population-size">Population size</h2>
<p>In the above example, the sample size was chosen equal to <span class="math">\(20000\)</span>.
But what would happen to the total click-through rate if that population
size was different? This is what I assessed in the following figure:</p>
<div class="figure">
<object data="/images/diff_vs_sample_size_vs_tot.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>The blue patches represent conditions in which A/B testing provide
significantly better click-through rate on average than the <span class="caps">MAB</span>
strategy. The opposite situation is represented by the green patches.
The white patches represent conditions in which both methods give
roughly the same click-through rates.</p>
<p>It appears clearly that A/B testing is systematically better at lower
sample sizes. As the population increases, the range of sample sizes in
which A/B testing performs better becomes wider. Though not represented
on the figure, the difference in the maximum click-through rate between
the methods is negligible. This means that A/B testing can
systematically reach the same number of clicks as the <span class="caps">MAB</span> strategy.
Again, the <span class="caps">MAB</span> strategy performs obviously as well at higher sample sizes.</p>
<p>This can be explained by the asymptotic behaviour of the two methods as
illustrated previously: as the population size increases, the worse-case
scenario (i.e. the set-and-forget mode where all the time is spent in
the testing stage) is reached further and further on the right resulting
in a much gentler slope in both methods. However, as the slope in the
A/B testing method was much steeper to begin with, the improvement is
more spectacular there, resulting in a wider range of superiority.</p>
</div>
<div class="section" id="mab-asymmetry">
<h2 id="mab-asymmetry"><span class="caps">MAB</span> asymmetry</h2>
<p>As I have explained earlier, the core of the <span class="caps">MAB</span> strategy consists in
favouring the best-performing variant. The strength of this favour is
quantified by a variable called <span class="math">\(\varepsilon\)</span> (the Greek letter
“epsilon”). When <span class="math">\(\varepsilon = 0\%\)</span>, the <span class="caps">MAB</span> strategy favours the
seemingly best-performing variant by <span class="math">\(0\%\)</span>: this is equivalent to
proper A/B testing data-gathering (no variant is ever favored). When
<span class="math">\(\varepsilon = 100\%\)</span>, the algorithm is trapped: whichever variant
was found the best initially will be the only one served.</p>
<div class="figure">
<object data="/images/epsilon_vs_sample_size_vs_tot.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>Again, A/B testing seems to perform better at lower sample sizes. Though
not shown in the figure, the maximal value reached with each method for
each <span class="math">\(\varepsilon\)</span> is again nearly equal (i.e. click-wise, the two
methods perform as well, again).</p>
<p>It also appears clearly that, in this particular setup, the two methods
are roughly equivalent below <span class="math">\(\varepsilon = 50\%\)</span> at smaller
sample sizes. Yet the <span class="caps">MAB</span> strategy remains better at higher sample
sizes. This means that, click-wise, the <span class="caps">MAB</span> strategy in these setups is
always as good or better than A/B testing on average. The price to pay
is obviously that the <span class="caps">MAB</span> strategy gives worse and worse results at even
higher sample sizes (especially you don’t want a small
<span class="math">\(\varepsilon\)</span> in the set-and-forget mode).</p>
<p>This can be explained by considering yet again the first figure of this
post: <span class="math">\(\varepsilon\)</span> basically determine the right-hand asymptotic
value: the higher <span class="math">\(\varepsilon\)</span>, the higher the value provided the
population size is large enough. “Large enough” depends on
<span class="math">\(\varepsilon\)</span> too: the higher <span class="math">\(\varepsilon\)</span>, the larger the
population should be. This is a trade-off and an optimum
<span class="math">\(\varepsilon\)</span> can be computed given the population size using simulations.</p>
</div>
<div class="section" id="click-through-rate-reference-value">
<h2 id="click-through-rate-reference-value">Click-through rate reference value</h2>
<p>Until now, I have basically always set one click-through rate at
<span class="math">\(10\%\)</span> and the other at <span class="math">\(15\%\)</span> (corresponding to a
difference of <span class="math">\(15 - 10 = 5\%\)</span>) and I want to see what would happen
if I introduced an <strong>offset</strong> to shift these values upward. This is what
I obtained:</p>
<div class="figure">
<object data="/images/POXorig_vs_sample_size_vs_tot.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>It appears very clearly that the largest range of A/B testing
superiority is obtained for an offset of <span class="math">\(\sim 35\%\)</span>: this
corresponds to the actual click-through rates <span class="math">\(45\%\)</span> and
<span class="math">\(60\%\)</span>, i.e. almost centered on <span class="math">\(50\%\)</span>.</p>
<p>This result is actually cause by a change in the relative power of the
two data-gathering strategies: when the click-through rates are around
<span class="math">\(50\%\)</span>, the A/B testing method finds the correct variant even
faster than in other cases while the <span class="caps">MAB</span> strategy is not really better
because it keeps favouring one variant at the expense of the others.</p>
</div>
<div class="section" id="difference-in-click-through-rates">
<h2 id="difference-in-click-through-rates">Difference in click-through rates</h2>
<p>The last parameter I wanted to study is the difference itself between
the click-through rates. This is what I represented in the following
figure (the <span class="math">\({\Delta}P(O|X)\)</span> notation represents this difference):</p>
<div class="figure">
<object data="/images/POXdiff_vs_sample_size_vs_tot.svg" type="image/svg+xml">
</object>
<p class="caption">If you can’t see the figure, please try another web browser which
supports <span class="caps">SVG</span> images</p>
</div>
<p>It appears clearly that the <span class="caps">MAB</span> strategy provides a far broader range of
superiority than A/B testing. Should we conclude that the <span class="caps">MAB</span> strategy
is simply better for large differences? Not this time: there is always a
range of values in which A/B testing provides better results but it is
smaller: we would need a far finer grid to see it. For instance, a
difference of <span class="math">\(50\%\)</span> can be correctly identified <span class="math">\(95\%\)</span> of
the time with only a few dozens of trials while the minimum sample size
represented here is actually <span class="math">\(100\)</span>.</p>
<p>The wider range of click-wise superiority of the <span class="caps">MAB</span> strategy here can
be explained again by the power of the strategies: higher differences
are much easier to identify (it requires far fewer trials to do so).
Therefore the <span class="caps">MAB</span> strategy gets correct for smaller sample sizes and
gets the advantage. Yet, there is no inversion in the relative power of
the methods: A/B testing remains better but the <span class="caps">MAB</span> strategy catches up faster.</p>
</div>
<div class="section" id="summary-what-is-the-mab-strategy-good-at">
<h2 id="summary-what-is-the-mab-strategy-good-at">Summary: what is the <span class="caps">MAB</span> strategy good at?</h2>
<p>In practice, saying that this particular implementation of the
multi-armed bandit strategy has been developed to maximise the number of
clicks appears misleading. This is <strong>not</strong> what it does. What it was
actually designed for (and does very well) is keeping that total number
of clicks very high for larger samples sizes. See the difference?</p>
<p>It is even possible to use the <span class="caps">MAB</span> strategy in a set-and-forget mode and
yet get acceptable results: no initial calculations, basically nothing
to do, a real no-brainer. However, this comes at a price: that mode
yields worse results than both two-stepped strategies. Again, this is a trade-off.</p>
<p><em>Is that all? Then why would the original blogger present the <span class="caps">MAB</span>
strategy as something that always beats A/B testing?</em> Simple: a fair
amount of zealotry coupled with overconfidence and a shallow knowledge.
<em>Really?!</em> Well, most likely yes, but that’s not all: there is not just
one single way to design an A/B campaign.</p>
<p>Depending on the context, the optimal sample size calculated to be used
in A/B testing can vary greatly and fall off pretty far away from the
optimal settings from the maximum reward point of view. The reason is
that gathering the maximal number of clicks on average (i.e. over
multiple campaigns) is not the same as providing a correct answer with a
particular “level of confidence”.</p>
<p>Just one example: assume that you defined the sample size to that you
have a <span class="math">\(95\%\)</span> chance to correctly identify a <span class="math">\(1\%\)</span>
difference. As it turned out, there was actually a difference of
<span class="math">\(5\%\)</span>.</p>
<p>From the test perspective, this is good: you are even more confident
that the variant you found os the best is indeed the best. Click-wise,
this is another story: you spent more time testing at the expense of the
total number of clicks. This is the huge drawback of A/B testing: it has
not been designed to yield a high click-through rate in a wide range of
sample sizes so even a few hundred extra trials can lead to a dramatic
decrease of the total number of clicks.</p>
<p>In that specific way, the <span class="caps">MAB</span> strategy do behave better in general
provided that the population size is large enough. This is just a
different trade-off.</p>
<p><em>So, in practice, does that mean that using the <span class="caps">MAB</span> strategy is all
right?</em> Most likely, yes. Though there are other drawbacks to consider:
the next (and last post) of this series is dedicated to the description
of a few of them.</p>
</div>

  </div>

</article>


    <div id="ending_message">
      <p>&copy; Gaël. Built using <a href="http://getpelican.com" target="_blank">Pelican</a>. Theme by Giulio Fidente on <a href="https://github.com/gfidente/pelican-svbhack" target="_blank">github</a>. </p>
    </div>
  </main>
</body>
</html>