Reinforcement Learning: Multi Armed bandits

Explore the exploration-exploitation dilemma and understand ε-greedy and UCB algorithms in Multi-Armed Bandits.

MAB Cover

Imagine you walk into a dazzling casino with 1,000 shiny gold coins jingling in your pocket. Each machine looks identical.

The rules are simple:

Your goal is straightforward but tricky:

Maximize the total coins you have after 1,000 plays.

For a moment, let’s break the suspense.

Imagine that somehow, you secretly know the success probabilities of all four machines:

MachineSuccess Probability
Bandit 125%
Bandit 240%
Bandit 370%
Bandit 490%

Each pull costs 1 coin, and when you win, you get 3 coins back (so your net gain is +2 coin; otherwise, it’s –1 coin).

If success rate = 90%, so out of 1,000 plays, you’d win 900 times and lose 100 times.

Each win gives you 3 coins, so from 900 wins, you earn:

900×3=2700 coins back.900 \times 3 = 2700 \text{ coins back.}

But remember, every play costs you 1 coin, so you’ve spent:

1000×1=1000 coins.1000 \times 1 = 1000 \text{ coins.} Total coins after 1000 plays=27001000=1700.\text{Total coins after 1000 plays} = 2700 - 1000 = 1700.

That means if you played the 90% machine a thousand times. you’d end up with 1700 coins of profit on average.

But in Real Life, You Don’t Know the Probabilities

Everything we did so far was easy because we knew the success rates of each machine. But walk back into that casino for real, the numbers are gone. No labels. No hints. Just four identical bandits waiting to eat your coins.

You still have 1,000 coins in your hand.

You pull one lever, you win! You pull again, you lose.

Try another machine, win twice, then lose five times.

Slowly, you start to form a feeling:

Hmm… this one seems luckier than that one.

That’s exactly what happens in real life when facing uncertainty.

You don’t know the true odds, so you must learn by trying. each pull gives you one small clue.

But here is the dilemma every gambler (and every learning algorithm) faces:

We’ll dive deeper into exploration vs. exploitation soon,

but before that, let’s translate our casino game into simple mathematical language.

This will help us reason clearly about what’s happening behind the flashing lights and spinning reels.

Mathematical Framing: Actions and Expected Rewards

Let’s look at our casino through the eyes of a learner making decisions over time.

At each time step t=1,2,3,,Tt = 1, 2, 3, \dots, T:

If you knew the value of each action, then it would be trivial to solve the kk-armed bandit problem. You would always select the action with highest value (as we saw in our earlier example with the 90% bandit) but here you do not know the action values with certainty, although you may have estimates. We denote the estimated value of action aa at time step tt as Qt(a)Q_t(a). we would like to Qt(a)Q_t(a) be close to q(a)q_*(a).

If you maintain estimates of each action’s value, you can update them incrementally after every play, using the observed rewards as feedback.

Suppose you have selected action aa several times. Let Nt(a)N_t(a) be the number of times action aa has been chosen up to time tt. Then the simplest way to estimate its value is by taking the sample average of all rewards you’ve received from that action:

Qt(a)=i=1tRi1{Ai=a}Nt(a)Q_t(a) = \frac{\sum_{i=1}^{t} R_i \cdot \mathbb{1}_{\{A_i = a\}}}{N_t(a)}

where 1{Ai=a}\mathbb{1}_{\{A_i = a\}} is an indicator function that equals 1 if you chose action aa at time ii, and 0 otherwise.

Every time you pull a lever, you refine your belief about its value. The more you play one bandit, the more accurate your estimate Qt(a)Q_t(a) becomes but the less time you spend testing other machines.

This creates a tension:

This is known as the exploration–exploitation dilemma. the heart of the multi-armed bandit problem.

If we always choose the action with the highest Qt(a)Q_t(a), we might get stuck with a suboptimal machine just because it got lucky early on. If we always explore randomly, we’ll learn everything but waste too many coins.

We need a balance and that’s where the ε-greedy approach comes in.

The Idea

At every time step tt:

Updating Without Storing Everything

We know that the value of an action Qt(a)Q_t(a) can be estimated as the average of all rewards received from that action:

Qn1=R1+R2++Rn1n1Q_{n-1} = \frac{R_1 + R_2 + \dots + R_{n-1}}{n-1}

This is simple, but not efficient, you’d have to store every past reward and recompute the entire average after each pull.

That’s both memory-heavy and slow.

Instead, we can update the average incrementally using only the last estimate and the new reward.

Let’s separate the last term:

Qn=(R1+R2++Rn1)+RnnQ_{n} = \frac{(R_1 + R_2 + \dots + R_{n-1}) + R_n}{n}

The first part (R1+R2++Rn1)(R_1 + R_2 + \dots + R_{n-1}) is just (n1)Qn(n-1)Q_n, because QnQ_n is the average of the first n1n-1 rewards.

So:

Qn=(n1)Qn1+RnnQ_{n} = \frac{(n-1)Q_{n-1} + R_n}{n}

Now rearrange this equation:

Qn=Qn1+1n(RnQn1)Q_n = Q_{n-1} + \frac{1}{n}(R_n - Q_{n-1})

And that’s the incremental update rule we were looking for!

So each new estimate is the old one plus a small correction based on how “surprising” the new result was.

General Form (Learning Template)

This rule fits a general pattern used throughout reinforcement learning:

NewEstimate=OldEstimate+StepSize×(TargetOldEstimate)\text{NewEstimate} = \text{OldEstimate} + \text{StepSize} \times (\text{Target} - \text{OldEstimate})

This pattern shows up again later in temporal-difference learning, Q-learning, and policy gradients all variations of this same core idea.

A Simple Bandit Algorithm (ε-Greedy with Incremental Update)

Below is the complete structure of a simple ε-greedy bandit algorithm. It balances exploration and exploitation while keeping computation efficient.

Algorithm

Initialize, for each action a=1,2,,ka = 1, 2, \dots, k:

Q(a)0(estimated value of action a)Q(a) \leftarrow 0 \quad \text{(estimated value of action \(a\))} N(a)0(number of times action a has been chosen)N(a) \leftarrow 0 \quad \text{(number of times action \(a\) has been chosen)}

Loop forever (or for TT steps):

  1. Action selection

    With probability 1ε1 - \varepsilon:

    AtargmaxaQ(a)(greedy action — exploit)A_t \leftarrow \arg\max_a Q(a) \quad \text{(greedy action — exploit)}

    With probability ε\varepsilon:

    Ata random action(explore)A_t \leftarrow \text{a random action} \quad \text{(explore)}
  2. Take action and observe reward

    Rtbandit(At)R_t \leftarrow \text{bandit}(A_t)
  3. Update counts

    N(At)N(At)+1N(A_t) \leftarrow N(A_t) + 1
  4. Incremental update of estimated value

    Q(At)Q(At)+1N(At)(RtQ(At))Q(A_t) \leftarrow Q(A_t) + \frac{1}{N(A_t)} \big( R_t - Q(A_t) \big)

Let’s take the same four bandits again, each with its true success probability q(a)q_*(a):

BanditTrue Win Probability paTrue Expected Reward q*(a) = 3pa
10.250.75
20.401.20
30.702.10
40.902.70

We’ll run a ε-greedy algorithm for a few rounds with:

Iterations:

tModeChosen (At)Reward (Rt)Updated (Q(At))Argmax after update (for next exploit)
1Explore23(Q₂ = 3.00)(Q₂)
2Exploit20(Q₂ = (3 + 0)/2 = 1.5)(Q₂) (1.5)
3Explore43(Q₄ = 3.00)(Q₄) (3.00)
4Exploit43(Q₄ = (3 + 3)/2 = 3.00)(Q₄) (3.00)
5Exploit40(Q₄ = (3×2 + 0)/3 = 2)(Q₄) (2) vs (Q₂) (1.5) → (Q₄)
6Exploit43(Q₄ = (2×3 + 3)/4 = 2.25)(Q₄) (2.25)
7Explore33(Q₃ = 3.00)(Q₃) (3.00)
8Exploit30(Q₃ = (3 + 0)/2 = 1.5)(Q₄) (2.25)
9Exploit43(Q₄ = (2.25×4 + 3)/5 = 2.4)(Q₄) (2.4)
10Exploit43(Q₄ = (2.4×5 + 3)/6 ≈ 2.5)(Q₄) (≈2.5)

The Problem with ε-Greedy

Our ε-greedy strategy works well, it keeps exploring occasionally and mostly exploits what seems best. But there’s still a subtle problem hiding in plain sight. When ε-greedy explores, it chooses completely random actions. That means it might keep wasting plays on bandits we already know are bad, just because exploration demands randomness.

It’s like walking into the casino, knowing Machine 1 almost never pays out, I know this slot machine is broken… but let’s waste a coin anyway, because rules are rules. 😅

We can do better. What if we explore smartly, focusing on machines we’re less sure about rather than just picking randomly?

That’s where the Upper Confidence Bound (UCB) approach comes in.

Think of every bandit as having two qualities:

  1. Its average reward so far: how well it’s performed based on what you’ve seen.
  2. Your uncertainty about it: how sure you are about that average.

Some machines you’ve played hundreds of times. you’re confident in their averages.

Others you’ve barely touched, maybe they’re bad, but maybe they’re secretly amazing. You just don’t know yet.

At each step tt, we assign each machine (action) a score that combines:

Formally, we compute for each action aa:

UCBt(a)=Qt(a)+clntNt(a)UCB_t(a) = Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}

and pick the action with the highest UCB value:

At=argmaxaUCBt(a)A_t = \arg\max_a \, UCB_t(a)

where:

the second term

clntNt(a)c \sqrt{\frac{\ln t}{N_t(a)}}

acts as an uncertainty bonus, a mathematical cushion that rewards exploration when you don’t have enough information.

Uncertainty bonus is large when:

lnt\ln t grows slowly as total time tt increases. Ensures even late in the game, unexplored machines still get some attention.

As Nt(a)N_t(a) grows, the denominator increases, making the bonus shrink. Eventually, well-explored actions stop getting extra attention. So UCB balances exploration and exploitation dynamically, the algorithm keeps revisiting uncertain actions until their estimates stabilize.

Algorithm

Initialize, for each action a=1,2,,ka = 1, 2, \dots, k:

Q(a)0(estimated value of action a)Q(a) \leftarrow 0 \quad \text{(estimated value of action \(a\))} N(a)0(number of times action a has been chosen)N(a) \leftarrow 0 \quad \text{(number of times action \(a\) has been chosen)}

Set exploration constant c>0c > 0 (controls how strongly we favor uncertain actions)

Loop forever (or for TT steps):

  1. Action selection

    If any action has never been tried yet (N(a)=0N(a) = 0),

    play it once to initialize estimates.

    Otherwise, for each action aa, compute its upper confidence bound:

    UCBt(a)=Q(a)+clntN(a)UCB_t(a) = Q(a) + c \sqrt{\frac{\ln t}{N(a)}}

    Then select the action with the highest bound:

    At=argmaxaUCBt(a)A_t = \arg\max_a UCB_t(a)
  2. Take action and observe reward

    Rtbandit(At)R_t \leftarrow \text{bandit}(A_t)
  3. Update counts

    N(At)N(At)+1N(A_t) \leftarrow N(A_t) + 1
  4. Incremental update of estimated value

    Q(At)Q(At)+1N(At)(RtQ(At))Q(A_t) \leftarrow Q(A_t) + \frac{1}{N(A_t)} \big(R_t - Q(A_t)\big)

How Do We Measure “Good” Learning?

In bandit problems, we often measure how much reward we could have earned if we always played the best machine from the start. This difference is called regret.

Formally:

Regret(T)=Tq(a)t=1Tq(At)\text{Regret}(T) = T q_*(a^*) - \sum_{t=1}^{T} q_*(A_t)

where:

So regret quantifies how much reward you lost because you had to learn.

Mathematically, it’s proven that UCB achieves logarithmic regret. That means as the number of plays TT increases:

ε-greedy, on the other hand, typically suffers higher regret because of blind exploration — wasting some plays even when it already knows enough.

Why Multi-Armed Bandits Matter

Though we began in a casino, the same principle drives decisions everywhere:

DomainBandit Analogy
Online advertisingChoosing which ad to show (each ad = one bandit).
News or product recommendationPicking which article or product to display to maximize clicks or sales.
A/B testingTrying different versions of a webpage to see which converts better.
Clinical trialsAllocating patients to promising treatments while still testing others.

Bandits allow learning systems to adapt in real time and not just gather data, but act optimally while learning. The beauty of the Multi-Armed Bandit problem lies in its simplicity. It teaches us a universal lesson:

To succeed in an uncertain world, we must balance curiosity and confidence.Explore enough to learn but not so much that you forget to win.

Conclusion

The multi-armed bandit problem may have started as a simple casino story, but it captures one of the deepest truths in learning and decision-making and the trade-off between exploring the unknown and exploiting what works. From slot machines to search engines, every intelligent system faces this tension. Algorithms like ε-greedy and UCB show us different philosophies of balancing that trade-off from randomness, to optimism.

In the end, mastering bandits isn’t just about maximizing coins. it’s about learning how to make better choices when the world hides its probabilities.