Reinforcement Learning: Multi Armed bandits
Explore the exploration-exploitation dilemma and understand ε-greedy and UCB algorithms in Multi-Armed Bandits.
Imagine you walk into a dazzling casino with 1,000 shiny gold coins jingling in your pocket. Each machine looks identical.
The rules are simple:
- Each pull of a lever costs 1 coin.
- When you pull, you might either win 3 coins back (a net gain of +2 coin) or get nothing (a net loss of −1 coin).
- Every machine has its own hidden probability of giving you that reward.
Your goal is straightforward but tricky:
Maximize the total coins you have after 1,000 plays.
For a moment, let’s break the suspense.
Imagine that somehow, you secretly know the success probabilities of all four machines:
| Machine | Success Probability |
|---|---|
| Bandit 1 | 25% |
| Bandit 2 | 40% |
| Bandit 3 | 70% |
| Bandit 4 | 90% |
Each pull costs 1 coin, and when you win, you get 3 coins back (so your net gain is +2 coin; otherwise, it’s –1 coin).
If success rate = 90%, so out of 1,000 plays, you’d win 900 times and lose 100 times.
Each win gives you 3 coins, so from 900 wins, you earn:
But remember, every play costs you 1 coin, so you’ve spent:
That means if you played the 90% machine a thousand times. you’d end up with 1700 coins of profit on average.
But in Real Life, You Don’t Know the Probabilities
Everything we did so far was easy because we knew the success rates of each machine. But walk back into that casino for real, the numbers are gone. No labels. No hints. Just four identical bandits waiting to eat your coins.
You still have 1,000 coins in your hand.
You pull one lever, you win! You pull again, you lose.
Try another machine, win twice, then lose five times.
Slowly, you start to form a feeling:
Hmm… this one seems luckier than that one.
That’s exactly what happens in real life when facing uncertainty.
You don’t know the true odds, so you must learn by trying. each pull gives you one small clue.
But here is the dilemma every gambler (and every learning algorithm) faces:
-
If you stick to the machine that’s been good so far, you might miss out on discovering an even better one.
Suppose you start cautiously and you try each of the four machines twice just to “see how they feel.”
That’s only 8 total plays, so your results are tiny and full of luck.
Machine True Win Rate Your 2 Test Pulls Observed Wins Bandit 1 25% Win, Win 2 Bandit 2 40% Lose, Lose 0 Bandit 3 70% Win, Lose 1 Bandit 4 90% Win, Lose 1 From your limited experience, Bandit 1 looks amazing, 2 wins out of 2! (100%)
So you decide, “This must be my lucky machine!” and use all your remaining coins on it and you will end up loosing your coins (and will not make any profit). We call these the greedy action. When you select one of these actions, we say that you are exploiting your current knowledge.
-
If you keep switching machines to test all of them, you’ll learn which one is best but you might waste coins on bad ones.
You decide to split your 1,000 coins equally among all four machines so you play 250 times on each.
Machine Success Probability Expected wins (out of 250) Coins Earned (3 per win) Net Profit Bandit 1 25% 62.5 62.5 × 3 = 187.5 -62.5 Bandit 2 40% 100 100 × 3 = 300 +50 Bandit 3 70% 175 175 × 3 = 525 +275 Bandit 4 90% 225 225 × 3 = 675 +425 Now add them all up:
So after 1,000 plays, your total profit = +688 coins. If you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the probability of the bandit.
We’ll dive deeper into exploration vs. exploitation soon,
but before that, let’s translate our casino game into simple mathematical language.
This will help us reason clearly about what’s happening behind the flashing lights and spinning reels.
Mathematical Framing: Actions and Expected Rewards
Let’s look at our casino through the eyes of a learner making decisions over time.
At each time step :
-
You choose an action : that is, which bandit to play.
-
You receive a reward : the number of coins you get back from that pull.
-
The value then of an arbitrary action ( denoted ) is the expected reward given that is selected:
If you knew the value of each action, then it would be trivial to solve the -armed bandit problem. You would always select the action with highest value (as we saw in our earlier example with the 90% bandit) but here you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action at time step as . we would like to be close to .
If you maintain estimates of each action’s value, you can update them incrementally after every play, using the observed rewards as feedback.
Suppose you have selected action several times. Let be the number of times action has been chosen up to time . Then the simplest way to estimate its value is by taking the sample average of all rewards you’ve received from that action:
where is an indicator function that equals 1 if you chose action at time , and 0 otherwise.
Every time you pull a lever, you refine your belief about its value. The more you play one bandit, the more accurate your estimate becomes but the less time you spend testing other machines.
This creates a tension:
- Play the one that seems best so far (exploit), or
- Try others to make sure you’re not missing something better (explore).
This is known as the exploration–exploitation dilemma. the heart of the multi-armed bandit problem.
If we always choose the action with the highest , we might get stuck with a suboptimal machine just because it got lucky early on. If we always explore randomly, we’ll learn everything but waste too many coins.
We need a balance and that’s where the ε-greedy approach comes in.
The Idea
At every time step :
-
With probability → choose the best action so far (exploit),
i.e. the one with the highest estimated value:
-
With probability → choose a random action (explore).
Updating Without Storing Everything
We know that the value of an action can be estimated as the average of all rewards received from that action:
This is simple, but not efficient, you’d have to store every past reward and recompute the entire average after each pull.
That’s both memory-heavy and slow.
Instead, we can update the average incrementally using only the last estimate and the new reward.
Let’s separate the last term:
The first part is just , because is the average of the first rewards.
So:
Now rearrange this equation:
And that’s the incremental update rule we were looking for!
- : the old estimate
- : the error (difference between actual reward and expected reward)
- : the step size, determining how much we adjust the estimate
So each new estimate is the old one plus a small correction based on how “surprising” the new result was.
General Form (Learning Template)
This rule fits a general pattern used throughout reinforcement learning:
-
StepSize controls how fast we adapt.
(Here it’s , but can also be a fixed constant ).
-
Target − OldEstimate is the error signal that drives learning.
This pattern shows up again later in temporal-difference learning, Q-learning, and policy gradients all variations of this same core idea.
A Simple Bandit Algorithm (ε-Greedy with Incremental Update)
Below is the complete structure of a simple ε-greedy bandit algorithm. It balances exploration and exploitation while keeping computation efficient.
Algorithm
Initialize, for each action :
Loop forever (or for steps):
-
Action selection
With probability :
With probability :
-
Take action and observe reward
-
Update counts
-
Incremental update of estimated value
Let’s take the same four bandits again, each with its true success probability :
| Bandit | True Win Probability pa | True Expected Reward q*(a) = 3pa |
|---|---|---|
| 1 | 0.25 | 0.75 |
| 2 | 0.40 | 1.20 |
| 3 | 0.70 | 2.10 |
| 4 | 0.90 | 2.70 |
We’ll run a ε-greedy algorithm for a few rounds with:
- (10 % exploration)
- Step size (sample average update)
- All initially
Iterations:
| t | Mode | Chosen (At) | Reward (Rt) | Updated (Q(At)) | Argmax after update (for next exploit) |
|---|---|---|---|---|---|
| 1 | Explore | 2 | 3 | (Q₂ = 3.00) | (Q₂) |
| 2 | Exploit | 2 | 0 | (Q₂ = (3 + 0)/2 = 1.5) | (Q₂) (1.5) |
| 3 | Explore | 4 | 3 | (Q₄ = 3.00) | (Q₄) (3.00) |
| 4 | Exploit | 4 | 3 | (Q₄ = (3 + 3)/2 = 3.00) | (Q₄) (3.00) |
| 5 | Exploit | 4 | 0 | (Q₄ = (3×2 + 0)/3 = 2) | (Q₄) (2) vs (Q₂) (1.5) → (Q₄) |
| 6 | Exploit | 4 | 3 | (Q₄ = (2×3 + 3)/4 = 2.25) | (Q₄) (2.25) |
| 7 | Explore | 3 | 3 | (Q₃ = 3.00) | (Q₃) (3.00) |
| 8 | Exploit | 3 | 0 | (Q₃ = (3 + 0)/2 = 1.5) | (Q₄) (2.25) |
| 9 | Exploit | 4 | 3 | (Q₄ = (2.25×4 + 3)/5 = 2.4) | (Q₄) (2.4) |
| 10 | Exploit | 4 | 3 | (Q₄ = (2.4×5 + 3)/6 ≈ 2.5) | (Q₄) (≈2.5) |
- Q4 moves steadily toward . By ~10 steps it’s already ≈2.5, and will keep inching closer with more pulls.
- Occasional exploration (t=1,3,7) checks other arms so we don’t miss a better option.
The Problem with ε-Greedy
Our ε-greedy strategy works well, it keeps exploring occasionally and mostly exploits what seems best. But there’s still a subtle problem hiding in plain sight. When ε-greedy explores, it chooses completely random actions. That means it might keep wasting plays on bandits we already know are bad, just because exploration demands randomness.
It’s like walking into the casino, knowing Machine 1 almost never pays out, I know this slot machine is broken… but let’s waste a coin anyway, because rules are rules. 😅
We can do better. What if we explore smartly, focusing on machines we’re less sure about rather than just picking randomly?
That’s where the Upper Confidence Bound (UCB) approach comes in.
Think of every bandit as having two qualities:
- Its average reward so far: how well it’s performed based on what you’ve seen.
- Your uncertainty about it: how sure you are about that average.
Some machines you’ve played hundreds of times. you’re confident in their averages.
Others you’ve barely touched, maybe they’re bad, but maybe they’re secretly amazing. You just don’t know yet.
At each step , we assign each machine (action) a score that combines:
- What we know (its average reward so far), and
- What we don’t know (our uncertainty about that estimate).
Formally, we compute for each action :
and pick the action with the highest UCB value:
where:
- : current estimated value of action . This is our current best guess of how good the action is.
- : number of times action has been selected so far
- : total number of steps so far
- : exploration parameter (controls how much confidence bonus we add)
the second term
acts as an uncertainty bonus, a mathematical cushion that rewards exploration when you don’t have enough information.
Uncertainty bonus is large when:
- is small → the bandit is rarely tried, so we’re uncertain.
- is large → more total rounds increase the need for confidence control.
grows slowly as total time increases. Ensures even late in the game, unexplored machines still get some attention.
As grows, the denominator increases, making the bonus shrink. Eventually, well-explored actions stop getting extra attention. So UCB balances exploration and exploitation dynamically, the algorithm keeps revisiting uncertain actions until their estimates stabilize.
Algorithm
Initialize, for each action :
Set exploration constant (controls how strongly we favor uncertain actions)
Loop forever (or for steps):
-
Action selection
If any action has never been tried yet (),
play it once to initialize estimates.
Otherwise, for each action , compute its upper confidence bound:
Then select the action with the highest bound:
-
Take action and observe reward
-
Update counts
-
Incremental update of estimated value
How Do We Measure “Good” Learning?
In bandit problems, we often measure how much reward we could have earned if we always played the best machine from the start. This difference is called regret.
Formally:
where:
- : total number of plays
- : the optimal machine
- : its true expected reward
- : the action chosen at time
So regret quantifies how much reward you lost because you had to learn.
- If regret grows linearly, you’re learning poorly, you keep making bad choices.
- If regret grows logarithmically, you’re learning efficiently, the more you play, the smaller your per-step loss.
Mathematically, it’s proven that UCB achieves logarithmic regret. That means as the number of plays increases:
- Your total missed reward grows slowly.
- Your strategy becomes nearly optimal over time.
ε-greedy, on the other hand, typically suffers higher regret because of blind exploration — wasting some plays even when it already knows enough.
Why Multi-Armed Bandits Matter
Though we began in a casino, the same principle drives decisions everywhere:
| Domain | Bandit Analogy |
|---|---|
| Online advertising | Choosing which ad to show (each ad = one bandit). |
| News or product recommendation | Picking which article or product to display to maximize clicks or sales. |
| A/B testing | Trying different versions of a webpage to see which converts better. |
| Clinical trials | Allocating patients to promising treatments while still testing others. |
Bandits allow learning systems to adapt in real time and not just gather data, but act optimally while learning. The beauty of the Multi-Armed Bandit problem lies in its simplicity. It teaches us a universal lesson:
To succeed in an uncertain world, we must balance curiosity and confidence.Explore enough to learn but not so much that you forget to win.
Conclusion
The multi-armed bandit problem may have started as a simple casino story, but it captures one of the deepest truths in learning and decision-making and the trade-off between exploring the unknown and exploiting what works. From slot machines to search engines, every intelligent system faces this tension. Algorithms like ε-greedy and UCB show us different philosophies of balancing that trade-off from randomness, to optimism.
In the end, mastering bandits isn’t just about maximizing coins. it’s about learning how to make better choices when the world hides its probabilities.