Reinforcement Learning: Multi Armed bandits

Explore the exploration-exploitation dilemma and understand ε-greedy and UCB algorithms in Multi-Armed Bandits.

Imagine you walk into a dazzling casino with 1,000 shiny gold coins jingling in your pocket. Each machine looks identical.

The rules are simple:

Each pull of a lever costs 1 coin.
When you pull, you might either win 3 coins back (a net gain of +2 coin) or get nothing (a net loss of −1 coin).
Every machine has its own hidden probability of giving you that reward.

Your goal is straightforward but tricky:

Maximize the total coins you have after 1,000 plays.

For a moment, let’s break the suspense.

Imagine that somehow, you secretly know the success probabilities of all four machines:

Machine	Success Probability
Bandit 1	25%
Bandit 2	40%
Bandit 3	70%
Bandit 4	90%

Each pull costs 1 coin, and when you win, you get 3 coins back (so your net gain is +2 coin; otherwise, it’s –1 coin).

If success rate = 90%, so out of 1,000 plays, you’d win 900 times and lose 100 times.

Each win gives you 3 coins, so from 900 wins, you earn:

900 \times 3 = 2700 \text{ coins back.}

But remember, every play costs you 1 coin, so you’ve spent:

1000 \times 1 = 1000 \text{ coins.}

\text{Total coins after 1000 plays} = 2700 - 1000 = 1700.

That means if you played the 90% machine a thousand times. you’d end up with 1700 coins of profit on average.

But in Real Life, You Don’t Know the Probabilities

Everything we did so far was easy because we knew the success rates of each machine. But walk back into that casino for real, the numbers are gone. No labels. No hints. Just four identical bandits waiting to eat your coins.

You still have 1,000 coins in your hand.

You pull one lever, you win! You pull again, you lose.

Try another machine, win twice, then lose five times.

Slowly, you start to form a feeling:

Hmm… this one seems luckier than that one.

That’s exactly what happens in real life when facing uncertainty.

You don’t know the true odds, so you must learn by trying. each pull gives you one small clue.

But here is the dilemma every gambler (and every learning algorithm) faces:

If you stick to the machine that’s been good so far, you might miss out on discovering an even better one.

Suppose you start cautiously and you try each of the four machines twice just to “see how they feel.”

That’s only 8 total plays, so your results are tiny and full of luck.

Machine True Win Rate Your 2 Test Pulls Observed Wins
Bandit 1 25% Win, Win 2
Bandit 2 40% Lose, Lose 0
Bandit 3 70% Win, Lose 1
Bandit 4 90% Win, Lose 1

From your limited experience, Bandit 1 looks amazing, 2 wins out of 2! (100%)

So you decide, “This must be my lucky machine!” and use all your remaining coins on it and you will end up loosing your coins (and will not make any profit). We call these the greedy action. When you select one of these actions, we say that you are exploiting your current knowledge.

Machine	True Win Rate	Your 2 Test Pulls	Observed Wins
Bandit 1	25%	Win, Win	2
Bandit 2	40%	Lose, Lose	0
Bandit 3	70%	Win, Lose	1
Bandit 4	90%	Win, Lose	1

If you keep switching machines to test all of them, you’ll learn which one is best but you might waste coins on bad ones.

You decide to split your 1,000 coins equally among all four machines so you play 250 times on each.

Machine	Success Probability	Expected wins (out of 250)	Coins Earned (3 per win)	Net Profit
Bandit 1	25%	62.5	62.5 × 3 = 187.5	-62.5
Bandit 2	40%	100	100 × 3 = 300	+50
Bandit 3	70%	175	175 × 3 = 525	+275
Bandit 4	90%	225	225 × 3 = 675	+425

Now add them all up:

\text{Total coins earned} = 187.5 + 300 + 525 + 675 = 1687.5

\text{Coins spent} = 1000

\text{Net profit} = 1687.5 - 1000 = 687.5

So after 1,000 plays, your total profit = +688 coins. If you select one of the nongreedy actions, then we say you are exploring, because this enables you to improve your estimate of the probability of the bandit.

We’ll dive deeper into exploration vs. exploitation soon,

but before that, let’s translate our casino game into simple mathematical language.

This will help us reason clearly about what’s happening behind the flashing lights and spinning reels.

Mathematical Framing: Actions and Expected Rewards

Let’s look at our casino through the eyes of a learner making decisions over time.

At each time step $t = 1, 2, 3, \dots, T$ :

You choose an action $A_t$ : that is, which bandit to play.
You receive a reward $R_t$ : the number of coins you get back from that pull.
The value then of an arbitrary action $a$ ( denoted $q_*(a)$ ) is the expected reward given that $a$ is selected:
$q_*(a) = E[R_t \mid A_t = a]$

If you knew the value of each action, then it would be trivial to solve the $k$ -armed bandit problem. You would always select the action with highest value (as we saw in our earlier example with the 90% bandit) but here you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action $a$ at time step $t$ as $Q_t(a)$ . we would like to $Q_t(a)$ be close to $q_*(a)$ .

If you maintain estimates of each action’s value, you can update them incrementally after every play, using the observed rewards as feedback.

Suppose you have selected action $a$ several times. Let $N_t(a)$ be the number of times action $a$ has been chosen up to time $t$ . Then the simplest way to estimate its value is by taking the sample average of all rewards you’ve received from that action:

Q_t(a) = \frac{\sum_{i=1}^{t} R_i \cdot \mathbb{1}_{\{A_i = a\}}}{N_t(a)}

where $\mathbb{1}_{\{A_i = a\}}$ is an indicator function that equals 1 if you chose action $a$ at time $i$ , and 0 otherwise.

Every time you pull a lever, you refine your belief about its value. The more you play one bandit, the more accurate your estimate $Q_t(a)$ becomes but the less time you spend testing other machines.

This creates a tension:

Play the one that seems best so far (exploit), or
Try others to make sure you’re not missing something better (explore).

This is known as the exploration–exploitation dilemma. the heart of the multi-armed bandit problem.

If we always choose the action with the highest $Q_t(a)$ , we might get stuck with a suboptimal machine just because it got lucky early on. If we always explore randomly, we’ll learn everything but waste too many coins.

We need a balance and that’s where the ε-greedy approach comes in.

The Idea

At every time step $t$ :

With probability $1 - \varepsilon$ → choose the best action so far (exploit),

i.e. the one with the highest estimated value:
$A_t = \arg\max_a Q_t(a)$
With probability $\varepsilon$ → choose a random action (explore).

Updating Without Storing Everything

We know that the value of an action $Q_t(a)$ can be estimated as the average of all rewards received from that action:

Q_{n-1} = \frac{R_1 + R_2 + \dots + R_{n-1}}{n-1}

This is simple, but not efficient, you’d have to store every past reward and recompute the entire average after each pull.

That’s both memory-heavy and slow.

Instead, we can update the average incrementally using only the last estimate and the new reward.

Let’s separate the last term:

Q_{n} = \frac{(R_1 + R_2 + \dots + R_{n-1}) + R_n}{n}

The first part $(R_1 + R_2 + \dots + R_{n-1})$ is just $(n-1)Q_n$ , because $Q_n$ is the average of the first $n-1$ rewards.

So:

Q_{n} = \frac{(n-1)Q_{n-1} + R_n}{n}

Now rearrange this equation:

Q_n = Q_{n-1} + \frac{1}{n}(R_n - Q_{n-1})

And that’s the incremental update rule we were looking for!

$Q_{n-1}$ : the old estimate
$R_n - Q_{n-1}$ : the error (difference between actual reward and expected reward)
$\frac{1}{n}$ : the step size, determining how much we adjust the estimate

So each new estimate is the old one plus a small correction based on how “surprising” the new result was.

General Form (Learning Template)

This rule fits a general pattern used throughout reinforcement learning:

\text{NewEstimate} = \text{OldEstimate} + \text{StepSize} \times (\text{Target} - \text{OldEstimate})

StepSize controls how fast we adapt.

(Here it’s $\frac{1}{n}$ , but can also be a fixed constant $\alpha$ ).
Target − OldEstimate is the error signal that drives learning.

This pattern shows up again later in temporal-difference learning, Q-learning, and policy gradients all variations of this same core idea.

A Simple Bandit Algorithm (ε-Greedy with Incremental Update)

Below is the complete structure of a simple ε-greedy bandit algorithm. It balances exploration and exploitation while keeping computation efficient.

Algorithm

Initialize, for each action $a = 1, 2, \dots, k$ :

Q(a) \leftarrow 0 \quad \text{(estimated value of action \(a\))}

N(a) \leftarrow 0 \quad \text{(number of times action \(a\) has been chosen)}

Loop forever (or for $T$ steps):

Action selection

With probability $1 - \varepsilon$ :
$A_t \leftarrow \arg\max_a Q(a) \quad \text{(greedy action — exploit)}$
With probability $\varepsilon$ :
$A_t \leftarrow \text{a random action} \quad \text{(explore)}$
Take action and observe reward
$R_t \leftarrow \text{bandit}(A_t)$
Update counts
$N(A_t) \leftarrow N(A_t) + 1$
Incremental update of estimated value
$Q(A_t) \leftarrow Q(A_t) + \frac{1}{N(A_t)} \big( R_t - Q(A_t) \big)$

Let’s take the same four bandits again, each with its true success probability $q_*(a)$ :

Bandit	True Win Probability p_a	True Expected Reward q_(a) = 3p_a*
1	0.25	0.75
2	0.40	1.20
3	0.70	2.10
4	0.90	2.70

We’ll run a ε-greedy algorithm for a few rounds with:

$\varepsilon = 0.1$ (10 % exploration)
Step size $\alpha = \frac{1}{N(a)}$ (sample average update)
All $Q(a) = 0$ initially

Iterations:

t	Mode	Chosen (A_t)	Reward (R_t)	Updated (Q(A_t))	Argmax after update (for next exploit)
1	Explore	2	3	(Q₂ = 3.00)	(Q₂)
2	Exploit	2	0	(Q₂ = (3 + 0)/2 = 1.5)	(Q₂) (1.5)
3	Explore	4	3	(Q₄ = 3.00)	(Q₄) (3.00)
4	Exploit	4	3	(Q₄ = (3 + 3)/2 = 3.00)	(Q₄) (3.00)
5	Exploit	4	0	(Q₄ = (3×2 + 0)/3 = 2)	(Q₄) (2) vs (Q₂) (1.5) → (Q₄)
6	Exploit	4	3	(Q₄ = (2×3 + 3)/4 = 2.25)	(Q₄) (2.25)
7	Explore	3	3	(Q₃ = 3.00)	(Q₃) (3.00)
8	Exploit	3	0	(Q₃ = (3 + 0)/2 = 1.5)	(Q₄) (2.25)
9	Exploit	4	3	(Q₄ = (2.25×4 + 3)/5 = 2.4)	(Q₄) (2.4)
10	Exploit	4	3	(Q₄ = (2.4×5 + 3)/6 ≈ 2.5)	(Q₄) (≈2.5)

Q4 moves steadily toward $q_*(4)=2.7$ . By ~10 steps it’s already ≈2.5, and will keep inching closer with more pulls.
Occasional exploration (t=1,3,7) checks other arms so we don’t miss a better option.

The Problem with ε-Greedy

Our ε-greedy strategy works well, it keeps exploring occasionally and mostly exploits what seems best. But there’s still a subtle problem hiding in plain sight. When ε-greedy explores, it chooses completely random actions. That means it might keep wasting plays on bandits we already know are bad, just because exploration demands randomness.

It’s like walking into the casino, knowing Machine 1 almost never pays out, I know this slot machine is broken… but let’s waste a coin anyway, because rules are rules. 😅

We can do better. What if we explore smartly, focusing on machines we’re less sure about rather than just picking randomly?

That’s where the Upper Confidence Bound (UCB) approach comes in.

Think of every bandit as having two qualities:

Its average reward so far: how well it’s performed based on what you’ve seen.
Your uncertainty about it: how sure you are about that average.

Some machines you’ve played hundreds of times. you’re confident in their averages.

Others you’ve barely touched, maybe they’re bad, but maybe they’re secretly amazing. You just don’t know yet.

At each step $t$ , we assign each machine (action) a score that combines:

What we know (its average reward so far), and
What we don’t know (our uncertainty about that estimate).

Formally, we compute for each action $a$ :

UCB_t(a) = Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}

and pick the action with the highest UCB value:

A_t = \arg\max_a \, UCB_t(a)

where:

$Q_t(a)$ : current estimated value of action $a$ . This is our current best guess of how good the action is.
$N_t(a)$ : number of times action $a$ has been selected so far
$t$ : total number of steps so far
$c$ : exploration parameter (controls how much confidence bonus we add)

the second term

c \sqrt{\frac{\ln t}{N_t(a)}}

acts as an uncertainty bonus, a mathematical cushion that rewards exploration when you don’t have enough information.

Uncertainty bonus is large when:

$N_t(a)$ is small → the bandit is rarely tried, so we’re uncertain.
$t$ is large → more total rounds increase the need for confidence control.

$\ln t$ grows slowly as total time $t$ increases. Ensures even late in the game, unexplored machines still get some attention.

As $N_t(a)$ grows, the denominator increases, making the bonus shrink. Eventually, well-explored actions stop getting extra attention. So UCB balances exploration and exploitation dynamically, the algorithm keeps revisiting uncertain actions until their estimates stabilize.

Algorithm

Initialize, for each action $a = 1, 2, \dots, k$ :

Q(a) \leftarrow 0 \quad \text{(estimated value of action \(a\))}

N(a) \leftarrow 0 \quad \text{(number of times action \(a\) has been chosen)}

Set exploration constant $c > 0$ (controls how strongly we favor uncertain actions)

Loop forever (or for $T$ steps):

Action selection

If any action has never been tried yet ( $N(a) = 0$ ),

play it once to initialize estimates.

Otherwise, for each action $a$ , compute its upper confidence bound:
$UCB_t(a) = Q(a) + c \sqrt{\frac{\ln t}{N(a)}}$
Then select the action with the highest bound:
$A_t = \arg\max_a UCB_t(a)$
Take action and observe reward
$R_t \leftarrow \text{bandit}(A_t)$
Update counts
$N(A_t) \leftarrow N(A_t) + 1$
Incremental update of estimated value
$Q(A_t) \leftarrow Q(A_t) + \frac{1}{N(A_t)} \big(R_t - Q(A_t)\big)$

How Do We Measure “Good” Learning?

In bandit problems, we often measure how much reward we could have earned if we always played the best machine from the start. This difference is called regret.

Formally:

\text{Regret}(T) = T q_*(a^*) - \sum_{t=1}^{T} q_*(A_t)

where:

$T$ : total number of plays
$a^*$ : the optimal machine
$q_*(a^*)$ : its true expected reward
$A_t$ : the action chosen at time $t$

So regret quantifies how much reward you lost because you had to learn.

If regret grows linearly, you’re learning poorly, you keep making bad choices.
If regret grows logarithmically, you’re learning efficiently, the more you play, the smaller your per-step loss.

Mathematically, it’s proven that UCB achieves logarithmic regret. That means as the number of plays $T$ increases:

Your total missed reward grows slowly.
Your strategy becomes nearly optimal over time.

ε-greedy, on the other hand, typically suffers higher regret because of blind exploration — wasting some plays even when it already knows enough.

Why Multi-Armed Bandits Matter

Though we began in a casino, the same principle drives decisions everywhere:

Domain	Bandit Analogy
Online advertising	Choosing which ad to show (each ad = one bandit).
News or product recommendation	Picking which article or product to display to maximize clicks or sales.
A/B testing	Trying different versions of a webpage to see which converts better.
Clinical trials	Allocating patients to promising treatments while still testing others.

Bandits allow learning systems to adapt in real time and not just gather data, but act optimally while learning. The beauty of the Multi-Armed Bandit problem lies in its simplicity. It teaches us a universal lesson:

To succeed in an uncertain world, we must balance curiosity and confidence.Explore enough to learn but not so much that you forget to win.

Conclusion

The multi-armed bandit problem may have started as a simple casino story, but it captures one of the deepest truths in learning and decision-making and the trade-off between exploring the unknown and exploiting what works. From slot machines to search engines, every intelligent system faces this tension. Algorithms like ε-greedy and UCB show us different philosophies of balancing that trade-off from randomness, to optimism.

In the end, mastering bandits isn’t just about maximizing coins. it’s about learning how to make better choices when the world hides its probabilities.