Reinforcement Learning: SARSA

A simple guide to TD learning, step-size dynamics, and SARSA’s on-policy updates.

Before we introduce SARSA, let’s step back and look carefully at how our Monte Carlo (MC) learner updates its knowledge. In the MC setting, whenever a particular state–action pair $(s,a)$ appears in an episode, we eventually obtain a sample return:

G = r_t + r_{t+1} + \cdots + r_{T},

and after observing this return, we refine our estimate $Q(s,a)$ . The more often $(s,a)$ appears, the more returns we accumulate, and the more accurate our estimate becomes. Suppose the pair $(s,a)$ has appeared $k$ times across all episodes. Let the observed returns be:

G_1,\; G_2,\; \ldots,\; G_k.

The Monte Carlo principle says, The best estimate of $Q(s,a)$ is simply the average of all observed returns. So after $k$ visits:

Q_k(s,a) = \frac{1}{k}\sum_{i=1}^{k} G_i.

This looks simple, but computing the full sum every time is inefficient. We can rewrite it in an incremental form.

Q_k = Q_{k-1} + \frac{1}{k}(G_k - Q_{k-1})

Now the update depends only on the previous estimate, the new sample return and a single learning rate. The expression above can be viewed as a special case of a much more general learning rule used throughout reinforcement learning:

\text{NewEstimate}= \text{OldEstimate} + \text{StepSize}\big(\text{Target} - \text{OldEstimate}\big)

where the target is whatever new evidence we just observed. This “error correction” structure is central to almost every RL algorithm.

Before the $k$ -th visit, our estimate $Q_{k-1}(s,a)$ summarizes everything we’ve learned so far, our prior knowledge. From the new episode, we receive a fresh return $G_k$ , a single snapshot of the future, our new information.

Our new estimate should blend these two:

Not too much new information (because it’s noisy),
Not too much old information (or we’ll never improve).

If instead we took:

Q_k = G_k \quad \quad \quad (StepSize = 1)

we would be throwing away all past experience and trusting a single sample. That’s extremely unstable. If we set:

Q_k = Q_{k-1} \quad \quad \quad (StepSize = 0)

we would never learn anything new. To strike a balance, we use a step-size parameter $\alpha$ ( 0 to 1):

Q_k = Q_{k-1} + \alpha\big(G_k - Q_{k-1}\big).

Here:

$G_k$ is the target, what we observed now and the direction we want to move toward.
$Q_{k-1}$ is our current belief.
$(G_k - Q_{k-1})$ is the error between new information and old belief.
$\alpha$ controls how aggressively we learn. In MC learning, $\alpha = \frac{1}{k}$ .
$Q_k$ is how we blend the two.

This single formula is the core of many RL algorithms and it is exactly the door through which we now enter SARSA and Q-learning (will discuss more in the next blog about Q-learning).

Sample-average vs constant-step-size

In Monte Carlo methods, the step-size is strictly tied to how many times we have visited a state–action pair:

\alpha_k = \frac{1}{k}

This makes the estimate converge to the true average of all observed returns which is perfect if the environment is stationary and we want long-term averaging. The first few samples influence the estimate a lot, and later samples influence it less and less. This makes MC converge to the true average, but it also makes the learner slow to adapt if the environment changes. But in many RL problems (especially control), we want an agent that:

adapts quickly when the environment changes,
responds immediately to new rewards,
does not let old data dominate forever.

That’s why SARSA switches to a constant step-size:

\alpha = 0.1,\; 0.2,\; 0.5,\; \text{etc.}

Suppose we’re estimating the value of some state–action pair $(s,a)$ and the sequence of observed returns is:

Visit k	Return G_k
1	10
2	-4
3	8
4	2
5	0
6	6
7	-2
8	4
9	1
10	3

Let’s start with Initial estimate $Q_0 = 0$ and compare both update rules, MC $\alpha_k = 1/k$ and constant $\alpha = 0.5$ then update will look like something below:

k	Return G_k	Monte Carlo (α = 1/k)	Constant α = 0.5
1	10	Q₁ = 10	Q₁ = 5.00
2	-4	Q₂ = 10 + 0.5(-4 − 10) = 3.00	Q₂ = 5 + 0.5(-4 − 5) = 0.50
3	8	Q₃ = 3 + (1/3)(8 − 3) = 4.67	Q₃ = 0.5 + 0.5(8 − 0.5) = 4.25
4	2	Q₄ = 4.67 + 0.25(2 − 4.67) = 4.00	Q₄ = 4.25 + 0.5(2 − 4.25) = 3.12
5	0	Q₅ = 4.00 + 0.2(0 − 4.00) = 3.20	Q₅ = 3.12 + 0.5(0 − 3.12) = 1.56
6	6	Q₆ = 3.20 + (1/6)(6 − 3.20) = 3.70	Q₆ = 1.56 + 0.5(6 − 1.56) = 3.78
7	-2	Q₇ = 3.70 + (1/7)(-2 − 3.70) = 2.90	Q₇ = 3.78 + 0.5(-2 − 3.78) = 0.89
8	4	Q₈ = 2.90 + (1/8)(4 − 2.90) = 3.04	Q₈ = 0.89 + 0.5(4 − 0.89) = 2.44
9	1	Q₉ = 3.04 + (1/9)(1 − 3.04) = 2.82	Q₉ = 2.44 + 0.5(1 − 2.44) = 1.72
10	3	Q₁₀ = 2.82 + 0.1(3 − 2.82) = 2.84	Q₁₀ = 1.72 + 0.5(3 − 1.72) = 2.36

Let’s put it into a diagram to visualize it better

As we can see, Monte Carlo (1/k) smoothing gradually stabilizes as more data is collected, converging toward the true average of the returns. Over time, it becomes very stable after many samples, but as a result, it tends to react slowly to new information.

With a constant step-size of $\alpha = 0.5$ , the estimate adapts quickly and becomes more responsive to the latest returns. Instead of computing the true mean, it produces a running, exponentially weighted estimate, which makes it particularly effective in non-stationary environments.

Now that we’ve seen how $\alpha = 0.5$ behaves, let’s look at two more extreme cases: a very small learning rate and a very large one. This will help us understand why SARSA/Q-learning relies heavily on choosing an appropriate step-size. When we set:

\alpha = 0.1

we’re telling the algorithm:

Trust your old estimate much more than the new sample.

This makes the Q-value update extremely conservative. The agent changes its belief slowly, almost reluctantly. This is good when:

the environment is stable,
returns are noisy,
we want to smooth fluctuations.

But it also means:

sudden changes in reward patterns take a long time to be reflected,
early mistakes dominate for a while,
the learning curve becomes smooth but sluggish.

On the opposite end of the spectrum:

\alpha = 0.9

means:

Trust the new return almost completely. Forget most of the past.

This makes the Q-value jump dramatically with each new return, similar to:

a highly reactive trader who changes strategy after every win/loss,
an anxious driver who overreacts to every event on the road.

With $\alpha = 0.9$ :

the Q-value becomes very sensitive to recent samples,
noise dominates,
the curve becomes jagged and volatile,
the estimate can quickly adapt to changes (great for non-stationary tasks).

From one episode at a time to one step at a time

Up to this point, everything we’ve done has followed the Monte-Carlo philosophy, learn only after an entire episode finishes. In an MC learner:

During a game, Q-values never change.
We record rewards, keep track of returns, but do not update anything yet.
Only when the episode terminates, when we finally know the full return
$G(s,a) = r_t + r_{t+1} + \cdots + r_T$
we go back and update every visited state-action pair.

This creates a very particular learning rhythm:

Play whole game → compute all G-values → update all Q-values → start next game.

Imagine an MDP with four states:

s_1,\; s_2,\; s_3,\; s_4

and only one action:

a

Assume the agent plays a single episode and experiences the following transitions:

From $s_1$ , take action $a$ , get reward 3, move to $s_2$
From $s_2$ , take action $a$ , get reward 1, move to $s_3$
From $s_3$ , take action $a$ , get reward 5, move to terminal $s_4$

The episode ends when the agent reaches $s_4$ . In a Monte-Carlo learner, nothing is updated yet. We simply record the trajectory:

Time t	State	Action	Reward	Next State
1	s₁	a	3	s₂
2	s₂	a	-1	s₃
3	s₃	a	5	s₄

Now that the game is over, we compute the returns:

For $(s_1,a)$
$G(s_1,a) = 3 + (-1) + 5 = 7$
For $(s_2,a)$
$G(s_2,a) = -1 + 5 = 4$
For $(s_3,a)$
$G(s_3,a) = 5$

Then and only then we update (for all state-action):

Q(s,a) \leftarrow Q(s,a) + \alpha\big(G(s,a) - Q(s,a)\big)

This deep-backup approach requires the full trajectory to be known. Every return $G$ depends on all future rewards. Look again at the first step of the episode:

s_1 \xrightarrow{r=3} s_2

At that moment, we already know three important pieces of information:

The immediate reward:
$r=3$
The next state:
$s′=s2$
And crucially, our current estimate of the future from that next state:
$Q(s2,a)$

But Monte Carlo learning ignores all of this until the game ends. What if we took advantage of something we do know right away? When the agent moves from $s_1$ to $s_2$ , we have a perfectly valid estimate of the remaining return:

r + \gamma Q(s_2, a)

This is not the actual return. MC would say it's incomplete but it is a reasonable guess about the future based on our current knowledge. And unlike the Monte-Carlo return, we don’t have to wait for the game to end to compute it. And that raises a powerful idea, Maybe we don’t need the full return $G(s_1,a)$ . Maybe we can update immediately using

r + \gamma Q(s', a)

This is the heart of the transition From Monte-Carlo (wait for the full return) to Temporal-Difference (use a predicted return)

Step 1: Transition $s_1 \rightarrow s_2$

Reward = 3

Old value:

Q(s_1,a) = 3

TD (Temporal Difference) target:

r + \gamma Q(s_2,a) = 3 + 0.9 \cdot 7 = 3 + 6.3 = 9.3

Error:

9.3 - 3 = 6.3

Update:

Q(s_1,a) \leftarrow 3 + 0.2 \cdot 6.3 = 4.26

Table: After Step 1

	(s₁, a)	(s₂, a)	(s₃, a)
Old Q	3	7	2
Reward r_k	3	-	-
TD target r + γQ(s′)	9.3	-	-
Error	9.3 − 3 = 6.3	-	-
Update α · error	0.2 × 6.3 = 1.26	-	-
New Q	4.26	-	-

Step 2: Transition $s_2 \rightarrow s_3$

Reward = –1

Old value:

Q(s_2,a) = 7

TD (Temporal Difference) target:

-1 + 0.9 \cdot Q(s_3,a) = -1 + 0.9 \cdot 2 = -1 + 1.8 = 0.8

Error:

0.8 - 7 = -6.2

Update:

Q(s_2,a) \leftarrow 7 + 0.2(-6.2) = 7 - 1.24 = 5.76

Table: After Step 2

	(s₁, a)	(s₂, a)	(s₃, a)
Old Q	3	7	2
Reward r_k	3	−1	-
TD target r + γQ(s′)	9.3	0.8	-
Error	9.3 − 3 = 6.3	0.8 − 7 = −6.2	-
Update α · error	0.2 × 6.3 = 1.26	0.2 × −6.2 = −1.24	-
New Q	4.26	5.76	-

Step 3: Transition $s_3 \rightarrow s_4$

similarly, Table After Step 3

	(s₁, a)	(s₂, a)	(s₃, a)
Old Q	3	7	2
Reward r_k	3	−1	5
TD target r + γQ(s′)	9.3	0.8	5
Error	9.3 − 3 = 6.3	0.8 − 7 = −6.2	5 − 2 = 3
Update α · error	0.2 × 6.3 = 1.26	0.2 × −6.2 = −1.24	0.2 × 3 = 0.6
New Q	4.26	5.76	2.6

From One Action to Many Actions

So far, our TD examples involved a very simplified world: every state had only a single action. That made updates easy because the agent never had to choose, there was only one possible $Q(s,a)$ to consider. But real reinforcement learning problems involve multiple actions, and therefore:

a policy that selects among them,
$Q$ -values that must be learned for every state–action pair, not just one state–action pair.
and updates that depend on both the next state and the next action the agent actually takes.

This is exactly where SARSA comes in. In the single-action case, the TD target was simply

r + \gamma Q(s',a)

because the agent must take only action $a$ in the next state. But with multiple actions, we must ask a new question:

After reaching a new state $s'$ , which action will the agent actually take?

The answer depends on the policy the agent is following, usually something like an ε-greedy exploration policy. So now the TD target depends not only on the next state but also on the next chosen action, which we’ll denote as:

a'

This gives the SARSA update its name:

State — Action — Reward — State — Action

(the five elements involved in each update)

Update Rule

In the Monte-Carlo world, learning is “all-at-once”:

Finish the entire episode → compute all returns $G$ → update all $Q$ ’s.

Temporal-Difference learning, with SARSA as its 1-step control variant, fundamentally changes this learning rhythm. Instead of waiting for the full return $G_t$ , we approximate it on the fly using what we already know about the future.

After our “From One Action to Many Actions” discussion, we now think in terms of full SARSA tuples:

(s_t, a_t, r_t, s_{t+1}, a_{t+1})

At time step $t$ , the agent:

is in state $s_t$ ,
takes action $a_t$ ,
receives reward $r_t$ ,
lands in next state $s_{t+1}$ ,
then picks the next action $a_{t+1}$ according to its (typically ε-greedy) policy.

Right after this transition we can build a TD target:

\text{TD target} = r_t + \gamma Q(s_{t+1}, a_{t+1})

and then do an incremental update:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ \underbrace{r_t + \gamma Q(s_{t+1}, a_{t+1})}_{\text{TD target}} - Q(s_t, a_t) \right]

The bracketed term is the TD error:

\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)

So the update is simply:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \, \delta_t

Compare this with Monte Carlo:

MC target: full return
$G_t = r_t + r_{t+1} + r_{t+2} + \cdots + r_T$
TD target (SARSA): one-step “bootstrapped” guess
$r_t + \gamma Q(s_{t+1}, a_{t+1})$

And the core learning rule is the same pattern:

\text{NewEstimate} = \text{OldEstimate} + \alpha\big(\text{Target} - \text{OldEstimate}\big)

The only difference is what we use as the target.

MC: target is what actually happened (after the episode ends).
TD/SARSA: target is what we currently predict will happen (right now).

That’s why TD (Temporal-Difference) is called bootstrapping. it learns from its own predictions.

At this point it’s natural to worry that a TD is “less accurate” than a Monte Carlo learner because its target

r_t + \gamma Q(s_{t+1}, a_{t+1})

is only a rough guess of the full return

G_t = r_t + r_{t+1} + \cdots + r_T.

For example, suppose the Monte-Carlo return from the current episode is $G_t = 10 + 1 + \cdots = 41$ , while our TD learner (like SARSA) uses a target of about $40$ . It’s tempting to say “41 is the true value and 40 is just an approximation.” But from a statistical point of view, the Q-value can actually be more reliable.

The quantity $Q(s_{t+1}, a_{t+1})$ we plug into the TD target summarizes what happened in the previous $(k-1)$ episodes after visiting $(s_{t+1}, a_{t+1})$ . In contrast, the Monte-Carlo return $(G_t)$ from this episode is just one noisy sample. So MC uses a single fresh sample, while TD blends many past samples into its estimate.

In practice this tradeoff (slightly biased but lower-variance targets) often makes TD methods like SARSA more data-efficient and more effective than pure Monte-Carlo learning.

With this SARSA update rule, our agent can adjust its Q-values step by step as it interacts with the environment. In our on-policy Monte Carlo setup, we already saw how to turn Q-values into behavior using an ε-greedy policy to balance exploration and exploitation.

Now we’re ready to see how SARSA fits into that same on-policy control picture.

SARSA as an On-Policy Control Algorithm

In the on-policy Monte Carlo blog, we already built the full control loop:

use an ε-greedy policy w.r.t. current Q-values,
generate episodes by following that policy,
use the observed returns to update Q,
then make the policy greedier with respect to the new Q.

SARSA keeps exactly the same high-level idea and the same ε-greedy exploration strategy. The difference is not how we choose actions, but how and when we update Q.

Instead of waiting until the end of the episode and using the full return $G_t$ , SARSA updates on every step using the 1-step TD target:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\left(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right).

The tuple

(s_t,\; a_t,\; r_t,\; s_{t+1},\; a_{t+1})

is generated by the same ε-greedy policy. That’s why SARSA is called on-policy. It learns the value of the very policy it is using to act (ε-greedy w.r.t. Q), just like our on-policy MC learner but with bootstrapped, step-by-step TD updates instead of full-episode Monte Carlo returns.

$\textbf{Algorithm: TD learner percept (on-policy, local policy update)} \\[6pt] \textbf{input: } \begin{cases} \text{previous state } s_t \\ \text{previous action } a_t \\ \text{reward } r_t \\ \text{current state } s_{t+1} \\ \text{current action } a_{t+1} \end{cases} \\[8pt] \textbf{persistent: } \begin{cases} Q(s,a)\ \text{current state--action value estimates} \\ \pi(s)\ \text{deterministic greedy policy} \\ \alpha\ \text{(step-size parameter)} \\ \gamma\ \text{(discount factor)} \\ A(s)\ \text{set of available actions} \end{cases} \\[10pt] 1.\hspace{0.5cm} \delta_t \leftarrow r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \\[6pt] 2.\hspace{0.5cm} Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\, \delta_t \\[8pt] 3.\hspace{0.5cm} \pi(s_t) \leftarrow \arg\max_{\hat{a} \in A(s_t)} Q(s_t, \hat{a}) \\[10pt] 4.\hspace{0.5cm} \textbf{return } Q,\ \pi$

$\textbf{Algorithm: TD learner Actuate (Exploration--Exploitation)} \\[10pt] \textbf{Input: } \text{state } s \\[6pt] \textbf{Persistent: } \begin{cases} \pi(s) \text{ (current policy, initially random)} \\ \varepsilon \text{ (exploration probability, initially } 1.0) \end{cases}\\[12pt] 1.\hspace{0.5cm} \beta \leftarrow \text{random number in }[0,1] \\[6pt] 2.\hspace{0.5cm} \textbf{if } \beta < \varepsilon \ \textbf{then} \\[6pt] 3.\hspace{1.2cm} a \leftarrow \text{random action from } A \\[6pt] 4.\hspace{0.5cm} \textbf{else} \\[6pt] 5.\hspace{1.2cm} a \leftarrow \pi(s) \\[6pt] 6.\hspace{0.5cm} \text{return } a$

$\textbf{Algorithm: Training TD learner (online SARSA)} \\[6pt] \textbf{input: } \begin{cases} \text{MDP black box} \\ \text{TD learner (percept, actuate)} \\ N\ \text{training episodes} \\ s_0\ \text{starting state} \\ S_{\text{end}}\ \text{set of terminal states} \end{cases} \\[10pt] 1.\hspace{0.5cm} \text{initialize } R \text{ with zeros} \\[4pt] 2.\hspace{0.5cm} \text{initialize } \text{ifwin with false} \\[8pt] 3.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \text{ to } N\ \textbf{do} \\[4pt] 4.\hspace{1.2cm} s \leftarrow s_0 \\[4pt] 5.\hspace{1.2cm} a \leftarrow \text{TD learner actuate}(s) \\[8pt] 6.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}}\ \textbf{do} \\[4pt] 7.\hspace{2.0cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[4pt] 8.\hspace{2.0cm} R(i) \leftarrow R(i) + r \\[4pt] 9.\hspace{2.0cm} \textbf{if } s' \in S_{\text{end}}\ \textbf{then} \\[4pt] 10.\hspace{2.8cm} a' \leftarrow \text{dummy action (unused)} \\[4pt] 11.\hspace{2.8cm} \text{TD learner percept}(s,a,r,s',a') \\[4pt] 12.\hspace{2.8cm} \textbf{break} \\[6pt] 13.\hspace{2.0cm} \textbf{else} \\[4pt] 14.\hspace{2.8cm} a' \leftarrow \text{TD learner actuate}(s') \\[4pt] 15.\hspace{2.8cm} \text{TD learner percept}(s,a,r,s',a') \\[4pt] 16.\hspace{2.8cm} s \leftarrow s' \\[4pt] 17.\hspace{2.8cm} a \leftarrow a' \\[8pt] 18.\hspace{1.2cm} \textbf{if } s \in S_{\text{winning\_end}}\ \textbf{then} \\[4pt] 19.\hspace{2.0cm} \text{ifwin}(i) \leftarrow \text{true} \\[8pt] 20.\hspace{0.5cm} \textbf{return } R,\ \text{ifwin}$

What’s Next

When you first look at the SARSA update, it feels like everything finally fits together. On every step the agent adjusts its Q-value using the action it actually took, completing a tidy loop: act, observe, update, improve. Our drone or car now learns continuously rather than waiting for full episodes the way Monte Carlo does. But if you stare at SARSA’s update a little longer, something subtle appears. Because the next action comes from the same ε-greedy policy used to explore, every estimate is shaped by that small bit of randomness. In tricky or risky regions, SARSA ends up learning the value of a slightly jittery, exploration-heavy policy, not the clean, greedy policy we ultimately care about.

This naturally raises a deeper question. What if we want to keep exploring because exploration is essential, yet learn about a policy that behaves more confidently than our exploration-heavy one? Is there a way for the agent to act with curiosity but learn with conviction? In the next blog, we’ll follow this tension to a surprisingly elegant solution.