Reinforcement Learning: Q Learning

Explore how Q-learning emerges from SARSA by shifting from real actions to optimal ones.

MAB Cover

Imagine you are back in Grid City, watching your little drone learn how to navigate the windy streets. By now the drone has learned from full episodes with Monte Carlo methods, and then from immediate corrections with SARSA. If you watch the drone closely during SARSA training, you notice a certain rhythm: in each step, it behaves according to its ε-greedy policy, takes an action, receives a reward, picks the next action from the same ε-greedy policy, and then updates using that next action. That next action is always present in the SARSA update. It is the “A” in the name SARSA.

But something strange begins to happen as the drone gets better. Even when it learns that, say, “RIGHT” is usually an excellent move from state 13, the ε-greedy policy still forces it to occasionally pick “UP” a clumsy, wind-blown, battery-draining mistake. And because SARSA uses the action the drone actually chooses in its update, the value it learns for state 13 becomes slightly pessimistic. The drone wants to learn how good the best meaningful action is, but SARSA keeps whispering, “Don’t forget, sometimes you explore and take the bad action too.”

To understand exactly what this means, it helps to zoom in on a single, simple moment in the drone’s life and slow time almost to a stop.

Imagine state 13 in Grid City again. The drone has two actions that matter: UP and RIGHT. Over many flights, it has learned something like

Q(13,RIGHT)0.8,Q(13,UP)0.2.Q(13,\text{RIGHT}) \approx 0.8,\qquad Q(13,\text{UP}) \approx -0.2.

So if you just look at the Q-table, the story is very clear: RIGHT is good, UP is bad. If the drone were perfectly greedy, it would choose RIGHT from 13 every single time and sail happily toward the goal.

But the drone is not perfectly greedy. It is ε-greedy. That means that even when it completely “knows” that RIGHT is better, it will still choose UP with some small probability ε. Maybe ε is 0.1. Maybe it is shrinking over time but not yet zero. Either way, every now and then, the drone sighs, shrugs, and takes the bad action just to keep exploring.

Now picture one specific step during SARSA training. The drone is at state 13, it uses ε-greedy and chooses RIGHT this time. It receives some reward rtr_t, lands in a new state st+1s_{t+1}, and then uses ε-greedy again to pick the next action at+1a_{t+1}. Only after choosing this next action does SARSA update Q(13,RIGHT)Q(13,\text{RIGHT}):

Q(13,RIGHT)Q(13,RIGHT)+α(rt+γQ(st+1,at+1)Q(13,RIGHT)).Q(13,\text{RIGHT}) \leftarrow Q(13,\text{RIGHT}) + \alpha \Big( r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(13,\text{RIGHT}) \Big).

That Q(st+1,at+1)Q(s_{t+1}, a_{t+1}) term is the key. The next action at+1a_{t+1} is sampled from the same ε-greedy policy, which means sometimes it is a great action and sometimes it is a silly exploratory move. Over time, the update for Q(13,RIGHT)Q(13,\text{RIGHT}) averages over all of that behavior. It learns not just “what happens if I take RIGHT and then play optimally,” but rather “what happens if I take RIGHT and then continue following this slightly noisy ε-greedy policy forever.” Mathematically, SARSA is learning the value of the ε-greedy policy itself.

This is exactly what “on-policy” means in the TD world. The behavior policy (ε-greedy with respect to Q) and the policy being evaluated in the update rule are one and the same. At first, this is comforting. The drone is learning about how it truly behaves, randomness and all. But in some environments, especially ones where exploration is dangerous, it starts to feel limiting.

One afternoon, you imagine a different kind of learner, a more determined drone. This drone still behaves ε-greedily, because exploration is important, but in its heart, in its internal calculations, it no longer cares about the sometimes-stupid action it just chose. Instead, it imagines what it should have done next, not what it did. When it lands in the next state st+1s_{t+1}, it peeks at all possible actions and selects the best one, not to execute, but to use in its value update. It dreams of the optimal future even while behaving imperfectly.

To understand exactly what this means, it helps to compare what’s happening inside this new learner’s head with what SARSA was doing.

With SARSA, the update target was always tied to the action actually taken in the next state. After stepping from sts_t to st+1s_{t+1} with reward rtr_t, SARSA looked at the next action at+1a_{t+1} sampled from the ε-greedy policy and used

rt+γQ(st+1,at+1)r_t + \gamma Q(s_{t+1}, a_{t+1})

as the “snapshot of the future” for that update.

Our new, more stubborn learner wants to break that dependency. It still uses ε-greedy to act, but when it updates, it quietly asks a different question:

“If I stood in this next state st+1s_{t+1} and behaved perfectly from now on, what is the best I could hope to get?”

In symbols, that means that instead of using Q(st+1,at+1)Q(s_{t+1}, a_{t+1}), it looks at all possible actions in st+1s_{t+1},

Q(st+1,a1),Q(st+1,a2),Q(st+1,a3),Q(s_{t+1}, a'_1),\quad Q(s_{t+1}, a'_2),\quad Q(s_{t+1}, a'_3),\ldots

and takes the largest of them:

maxaQ(st+1,a)\max_{a'} Q(s_{t+1}, a')

The moment the drone swaps

Q(st+1,at+1)formaxaQ(st+1,a)Q(s_{t+1}, a_{t+1}) \quad\text{for}\quad \max_{a'} Q(s_{t+1}, a')

the entire personality of the learning algorithm changes.

It is as if the drone has finally separated its behavior from its beliefs. Its behavior is still messy, still exploratory, still ε-greedy. But its beliefs, its Q-values, now evolve according to a world in which it expects itself to act optimally from the next step onward.

This is a startling transformation. The same real experience is flowing in, the same transitions, the same rewards, the same occasional crashes into walls… but the internal interpretation of that experience is now entirely different.

Once this mental shift happens, the updated formula that emerges is almost inevitable. The drone still wants to correct its old estimate Q(st,at)Q(s_t, a_t) using a one-step TD target. But the target is no longer “what happened next according to the ε-greedy policy.” It is “what would happen next if I were perfect from here onward.”

If you plug this idea into the familiar TD update pattern, you get a new target:

Target=rt+γmaxaQ(st+1,a)\text{Target} = r_t + \gamma \max_{a'} Q(s_{t+1}, a')

And the update becomes

Q(st,at)Q(st,at)+α(rt+γmaxaQ(st+1,a)Q(st,at))Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \Big( r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \Big)

This is the defining rule of Q-learning.

Notice how similar it looks to SARSA’s update. The structure is identical:

new=old+α(targetold),\text{new} = \text{old} + \alpha (\text{target} - \text{old}),

but the meaning of the target has changed. Under the hood, this tiny change flips the algorithm from on-policy to off-policy. The behavior policy, the way the drone behaves in Grid City is still ε-greedy. It still occasionally flails UP from state 13, still sometimes crashes into buildings, still wastes battery in windy loops. But the target that shapes the Q-values is tied to a different policy: the greedy one that always picks argmaxaQ(s,a)\arg\max_a Q(s,a). The drone is living one life and dreaming of another.

Algorithm: TD learner percept (off-policy, Q-learning, local policy update)input: {previous state stprevious action atreward rtcurrent state st+1current action at+1persistent: {Q(s,a) current state–action value estimatesπ(s) deterministic greedy policyα (step-size parameter)γ (discount factor)A(s) set of available actions1.δtrt+γQ(st+1,at+1)Q(st,at)2.Q(st,at)Q(st,at)+αδt3.π(st)argmaxa^A(st)Q(st,a^)4.return Q, π\textbf{Algorithm: TD learner percept (off-policy, Q-learning, local policy update)} \\[6pt] \textbf{input: } \begin{cases} \text{previous state } s_t \\ \text{previous action } a_t \\ \text{reward } r_t \\ \text{current state } s_{t+1} \\ \text{current action } a_{t+1} \end{cases} \\[8pt] \textbf{persistent: } \begin{cases} Q(s,a)\ \text{current state--action value estimates} \\ \pi(s)\ \text{deterministic greedy policy} \\ \alpha\ \text{(step-size parameter)} \\ \gamma\ \text{(discount factor)} \\ A(s)\ \text{set of available actions} \end{cases} \\[10pt] 1.\hspace{0.5cm} \delta_t \leftarrow r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \\[6pt] 2.\hspace{0.5cm} Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\, \delta_t \\[8pt] 3.\hspace{0.5cm} \pi(s_t) \leftarrow \arg\max_{\hat{a} \in A(s_t)} Q(s_t, \hat{a}) \\[10pt] 4.\hspace{0.5cm} \textbf{return } Q,\ \pi

Algorithm: TD learner Actuate (Exploration–Exploitation)Input: state sPersistent: {π(s) (current policy, initially random)ε (exploration probability, initially 1.0)1.βrandom number in [0,1]2.if β<ε then3.arandom action from A4.else5.aπ(s)6.return a\textbf{Algorithm: TD learner Actuate (Exploration--Exploitation)} \\[10pt] \textbf{Input: } \text{state } s \\[6pt] \textbf{Persistent: } \begin{cases} \pi(s) \text{ (current policy, initially random)} \\ \varepsilon \text{ (exploration probability, initially } 1.0) \end{cases}\\[12pt] 1.\hspace{0.5cm} \beta \leftarrow \text{random number in }[0,1] \\[6pt] 2.\hspace{0.5cm} \textbf{if } \beta < \varepsilon \ \textbf{then} \\[6pt] 3.\hspace{1.2cm} a \leftarrow \text{random action from } A \\[6pt] 4.\hspace{0.5cm} \textbf{else} \\[6pt] 5.\hspace{1.2cm} a \leftarrow \pi(s) \\[6pt] 6.\hspace{0.5cm} \text{return } a

Algorithm: Training TD Learner (Online Q-Learning)Input: M, TD learner (percept, actuate), N, s0, SendOutput: R, ifwin1.Initialize R(i)0 i=1,,N2.Initialize ifwin(i)false i=1,,N3.for i1 to N do4.ss05.while sSend do6.aactuate(s)7.r, sMDP black box(s,a)8.R(i)R(i)+r9.percept(s,a,r,s)10.ss11.if sSwinning_end then12.ifwin(i)true13.return R, ifwin\textbf{Algorithm: Training TD Learner (Online Q-Learning)} \\[10pt] \textbf{Input: } \mathcal{M},\ \text{TD learner (percept, actuate)},\ N,\ s_0,\ S_{\text{end}} \\[6pt] \textbf{Output: } R,\ \text{ifwin} \\[12pt] 1.\hspace{0.5cm} \text{Initialize } R(i) \leftarrow 0 \ \forall i = 1,\dots,N \\[6pt] 2.\hspace{0.5cm} \text{Initialize } \text{ifwin}(i) \leftarrow \text{false} \ \forall i = 1,\dots,N \\[12pt] 3.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \textbf{ to } N \textbf{ do} \\[6pt] 4.\hspace{1.2cm} s \leftarrow s_0 \\[6pt] 5.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}} \textbf{ do} \\[6pt] 6.\hspace{2cm} a \leftarrow \text{actuate}(s) \\[6pt] 7.\hspace{2cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[6pt] 8.\hspace{2cm} R(i) \leftarrow R(i) + r \\[6pt] 9.\hspace{2cm} \text{percept}(s,a,r,s') \\[6pt] 10.\hspace{2cm} s \leftarrow s' \\[12pt] 11.\hspace{1.2cm} \textbf{if } s \in S_{\text{winning\_end}} \textbf{ then} \\[6pt] 12.\hspace{2cm} \text{ifwin}(i) \leftarrow \text{true} \\[12pt] 13.\hspace{0.5cm} \text{return } R,\ \text{ifwin}

What’s Next

And yet, even here, you can sense the next question emerging. Q-learning bootstraps from a one-step lookahead:

rt+γmaxaQ(st+1,a).r_t + \gamma \max_{a'} Q(s_{t+1}, a').

Monte Carlo bootstraps from the entire episode.

SARSA bootstraps from the next exploratory action.

Q-learning bootstraps from the next optimal action.

But what about everything in between?

What if the drone doesn’t want to look just one step ahead, nor wait until the end of the episode?

What if it wants to blend these approaches, peek two steps ahead, or five, or a flexible number of steps so it can shape its updates with richer, more nuanced information?

This is the doorway that leads to n-step TD methods, where the drone learns not from just one future, but from a horizon of futures, carefully balanced between immediate predictions and long-range intuition.

And that, naturally, leads to one of the most elegant and powerful ideas in reinforcement learning: a method that unifies Monte Carlo and TD, that uses all n-step returns at once, weighted by bootstrapped estimates.

A method called TD(λ).