You're at a casino standing in front of a row of 10 slot machines. You have $100 to spend, each machine costs $1 to play and gives you the same payoff. How do you decide which machines to play to get the most money back? There's a line of angry seniors behind you waiting impatiently for the answer, so choose wisely.

Do you play, say, a random 5 machines 4 times each, and stick to whichever pays off the most until you run out of money? Or maybe you play all 10 machines 5 times each? Remember every play costs you $1. Who knows - maybe you'll get lucky from a crappy machine a couple of times and erroneously select that one as the best, but it might burn you in the long run. Then you have to start the search again.

Finding the optimal machine in this manner is known as exploration vs. exploitation - and it's a fundamental problem in a subfield of machine learning called reinforcement learning. The slot machine set up, known as the Multi-Armed Bandit (MAB) problem, occurs all the time - testing multiple user experiences, clinical trials, game theory, and other trivial decision making problems.

Recently I worked on a project with some fellow students in which we outlined a number of approaches to solving this problem. In this post, I'd like to talk about one of them known as Upper Confidence Bounds (the equally rewarding UCB). This approach, preferred by computer science types, is like a worst-case scenario. It provides upper bounds on our losses. And in general, worst case scenarios are nice to know but we can all agree that they aren't very accurate estimates of reality. So while other approaches to solving MAB may be more accurate, knowing the worst case (with high confidence) is valuable too.

Essentially, we want to find out how many times we should play each machine such that we are confident that we know its true pay off probabilities. Once we know that, we can just stick to the best machine.

Clearly there's a cost to exploring all the machines. In the literature, this cost is known as regret - every time I play a machine that isn't the best one, I'm effectively incurring an opportunity cost. At every step my regret either stays flat (because I picked the best machine) or increases. The idea is to construct an algorithm that minimizes the cumulative regret after a certain number of plays.

In reality the person playing the game doesn't know the best machine (or else they'd pick it all the time), so regret is a theoretical cost, but its a convenient metric to compare algorithms.

This idea behind the UCB family of algorithms is:

Every time you play a machine, you observe an outcome and get closer to knowing its true payoff probability (e.g. like flipping a coin over and over again to determine if it's fair or biased)
As you play, your uncertainty in the estimate goes down. Hopefully it goes down fast enough that you have enough money left over to play the best machine and recoup some losses

Pretty intuitive right? And it turns out that we can use something called a concentration inequality to figure out how quickly our uncertainty in the estimate goes down. When the uncertainty decreases (confidence increases), we're more likely to play the best machine. And playing the best machine means our cumulative regret grows much slower, maybe even flattens out. In fact, the growth of regret is logarithmic in the number of plays of the game. This is a great quality for an algorithm to have.

The formula which expresses this relationship is:

$\underset{i}{\text{argmax}}\:[\:\bar{\mu_i} + \sqrt{\frac{2\times ln(t)}{n_i}}\:]\:$ Let's quickly break it down. The first term, $\bar{\mu_i}$ , is the average winnings for machine i. For example, if I play machine 2, 30 times and win 10 times, $\bar{\mu_2}$ = 1/3

The second term is the uncertainty.

t is the total number of plays (remember, $1 per play)
$n_i$ is the number of plays for the $i^{th}$ machine
If you have $K$ machines, $t = n_1 + n_2 + ... + n_K$

The derivation and proof can get hairy, but what's important to note is that as I play certain machines more and more, the denominator gets larger than the numerator because $n_i > ln(t)$ at some point:

That makes my uncertainty smaller and smaller the more I play. That's a good thing! Finally, argmax means "play the machine which maximizes this expression", where the specific machine is the argument.

To summarize so far:

Assuming there is a best machine, we want to find it
The more we play, the closer we are to finding the best
Once we find it, we play it forever

The next step is implementation, which happens to be easy since the hard part is proving the bounds are theoretically guaranteed - something we skipped because, well, math can be hard and scary.

We just have to keep track of the average payout for each machine, plug it into the little formula, and it will tell us which machine to play next (that's the argmax part). Let's see an example.

Imagine I have 5 slot machines, each paying $1 with probabilities: 5%, 20%, 25%, 10%, 30%. Of course, I don't know this in real life but the casino does.

1) If I play 1000 times, how often should I expect to play each machine?

That makes sense - I play the best machine the most often. And you start to see separation around 400 plays. But even earlier, around 200, you see the worst machines get played even less often.

2) What's my total regret after 1000 plays?

As before, this chart says that by playing machine 5 more often, my regret grows much more slowly.

3) How much money have we made after 1000 plays?

In a world with perfect information, I'd play machine 5 every time, and make 1000*$1*0.3 = $300. If I follow the UCB algorithm, after 1000 plays, it has me earning, on average, $231. Not bad for an algorithm that protects against the worst case!*

And that's the UCB algorithm:

When you need to decide among several competing options, one model for making decisions is called Multi-Armed Bandit
There are many applications of this type of model - A/B testing websites, clinical trials, etc...
Upper Confidence Bounds (UCB) is one approach to figuring out how to choose which bandit is best
There are also many variants of the MAB problem - contextual bandit, adversarial bandit, etc... and each one has a slightly different way of approaching it, but in essence they all try to solve the same exploration vs. exploitation dilemma
If you take nothing else from this post, know that, as far as making long term profits go, casinos are for suckers

Cheers!

*Note however, that when I do a similar analysis using something called Thompson-Sampling, I get $270. As I said, UCB provides an upper bound.

May 19 So many options!

Calling Brooklyn home, Ilan Man is a seasoned data scientist, triathlete and bike enthusiast. His blogs are about simplifying the complexity around data science. He's also learning the harmonica.

STACKOVERFLOW Github D3

May 19 So many options!

Aug 17 Learnings from teaching data science

Apr 29 How to land a Spaceship

Calling Brooklyn home, Ilan Man is a seasoned data scientist, triathlete and bike enthusiast. His blogs are about simplifying the complexity around data science. He's also learning the harmonica.

STACKOVERFLOW Github D3