Reinforcement Learning: SARSA

A simple guide to TD learning, step-size dynamics, and SARSA’s on-policy updates.

MAB Cover

Before we introduce SARSA, let’s step back and look carefully at how our Monte Carlo (MC) learner updates its knowledge. In the MC setting, whenever a particular state–action pair (s,a)(s,a) appears in an episode, we eventually obtain a sample return:

G=rt+rt+1++rT,G = r_t + r_{t+1} + \cdots + r_{T},

and after observing this return, we refine our estimate Q(s,a)Q(s,a). The more often (s,a)(s,a) appears, the more returns we accumulate, and the more accurate our estimate becomes. Suppose the pair (s,a)(s,a) has appeared kk times across all episodes. Let the observed returns be:

G1,  G2,  ,  Gk.G_1,\; G_2,\; \ldots,\; G_k.

The Monte Carlo principle says, The best estimate of Q(s,a)Q(s,a) is simply the average of all observed returns. So after kk visits:

Qk(s,a)=1ki=1kGi.Q_k(s,a) = \frac{1}{k}\sum_{i=1}^{k} G_i.

This looks simple, but computing the full sum every time is inefficient. We can rewrite it in an incremental form.

Qk=Qk1+1k(GkQk1)Q_k = Q_{k-1} + \frac{1}{k}(G_k - Q_{k-1})

Now the update depends only on the previous estimate, the new sample return and a single learning rate. The expression above can be viewed as a special case of a much more general learning rule used throughout reinforcement learning:

NewEstimate=OldEstimate+StepSize(TargetOldEstimate)\text{NewEstimate}= \text{OldEstimate} + \text{StepSize}\big(\text{Target} - \text{OldEstimate}\big)

where the target is whatever new evidence we just observed. This “error correction” structure is central to almost every RL algorithm.

Before the kk-th visit, our estimate Qk1(s,a)Q_{k-1}(s,a) summarizes everything we’ve learned so far, our prior knowledge. From the new episode, we receive a fresh return GkG_k, a single snapshot of the future, our new information.

Our new estimate should blend these two:

If instead we took:

Qk=Gk(StepSize=1)Q_k = G_k \quad \quad \quad (StepSize = 1)

we would be throwing away all past experience and trusting a single sample. That’s extremely unstable. If we set:

Qk=Qk1(StepSize=0)Q_k = Q_{k-1} \quad \quad \quad (StepSize = 0)

we would never learn anything new. To strike a balance, we use a step-size parameter α\alpha ( 0 to 1):

Qk=Qk1+α(GkQk1).Q_k = Q_{k-1} + \alpha\big(G_k - Q_{k-1}\big).

Here:

This single formula is the core of many RL algorithms and it is exactly the door through which we now enter SARSA and Q-learning (will discuss more in the next blog about Q-learning).

Sample-average vs constant-step-size

In Monte Carlo methods, the step-size is strictly tied to how many times we have visited a state–action pair:

αk=1k\alpha_k = \frac{1}{k}

This makes the estimate converge to the true average of all observed returns which is perfect if the environment is stationary and we want long-term averaging. The first few samples influence the estimate a lot, and later samples influence it less and less. This makes MC converge to the true average, but it also makes the learner slow to adapt if the environment changes. But in many RL problems (especially control), we want an agent that:

That’s why SARSA switches to a constant step-size:

α=0.1,  0.2,  0.5,  etc.\alpha = 0.1,\; 0.2,\; 0.5,\; \text{etc.}

Suppose we’re estimating the value of some state–action pair (s,a)(s,a) and the sequence of observed returns is:

Visit kReturn Gk
110
2-4
38
42
50
66
7-2
84
91
103

Let’s start with Initial estimate Q0=0Q_0 = 0 and compare both update rules, MC αk=1/k\alpha_k = 1/k and constant α=0.5\alpha = 0.5 then update will look like something below:

kReturn GkMonte Carlo (α = 1/k)Constant α = 0.5
110Q1 = 10Q1 = 5.00
2-4Q2 = 10 + 0.5(-4 − 10) = 3.00Q2 = 5 + 0.5(-4 − 5) = 0.50
38Q3 = 3 + (1/3)(8 − 3) = 4.67Q3 = 0.5 + 0.5(8 − 0.5) = 4.25
42Q4 = 4.67 + 0.25(2 − 4.67) = 4.00Q4 = 4.25 + 0.5(2 − 4.25) = 3.12
50Q5 = 4.00 + 0.2(0 − 4.00) = 3.20Q5 = 3.12 + 0.5(0 − 3.12) = 1.56
66Q6 = 3.20 + (1/6)(6 − 3.20) = 3.70Q6 = 1.56 + 0.5(6 − 1.56) = 3.78
7-2Q7 = 3.70 + (1/7)(-2 − 3.70) = 2.90Q7 = 3.78 + 0.5(-2 − 3.78) = 0.89
84Q8 = 2.90 + (1/8)(4 − 2.90) = 3.04Q8 = 0.89 + 0.5(4 − 0.89) = 2.44
91Q9 = 3.04 + (1/9)(1 − 3.04) = 2.82Q9 = 2.44 + 0.5(1 − 2.44) = 1.72
103Q10 = 2.82 + 0.1(3 − 2.82) = 2.84Q10 = 1.72 + 0.5(3 − 1.72) = 2.36

Let’s put it into a diagram to visualize it better

MAB Cover

As we can see, Monte Carlo (1/k) smoothing gradually stabilizes as more data is collected, converging toward the true average of the returns. Over time, it becomes very stable after many samples, but as a result, it tends to react slowly to new information.

With a constant step-size of α=0.5\alpha = 0.5 , the estimate adapts quickly and becomes more responsive to the latest returns. Instead of computing the true mean, it produces a running, exponentially weighted estimate, which makes it particularly effective in non-stationary environments.

Now that we’ve seen how α=0.5\alpha = 0.5 behaves, let’s look at two more extreme cases: a very small learning rate and a very large one. This will help us understand why SARSA/Q-learning relies heavily on choosing an appropriate step-size. When we set:

α=0.1\alpha = 0.1

we’re telling the algorithm:

Trust your old estimate much more than the new sample.

This makes the Q-value update extremely conservative. The agent changes its belief slowly, almost reluctantly. This is good when:

But it also means:

MAB Cover

On the opposite end of the spectrum:

α=0.9\alpha = 0.9

means:

Trust the new return almost completely. Forget most of the past.

This makes the Q-value jump dramatically with each new return, similar to:

With α=0.9\alpha = 0.9:

MAB Cover

From one episode at a time to one step at a time

Up to this point, everything we’ve done has followed the Monte-Carlo philosophy, learn only after an entire episode finishes. In an MC learner:

This creates a very particular learning rhythm:

Play whole game → compute all G-values → update all Q-values → start next game.

Imagine an MDP with four states:

s1,  s2,  s3,  s4s_1,\; s_2,\; s_3,\; s_4

and only one action:

aa

Assume the agent plays a single episode and experiences the following transitions:

  1. From s1s_1, take action aa, get reward 3, move to s2s_2
  2. From s2s_2, take action aa, get reward 1, move to s3s_3
  3. From s3s_3, take action aa, get reward 5, move to terminal s4s_4

The episode ends when the agent reaches s4s_4. In a Monte-Carlo learner, nothing is updated yet. We simply record the trajectory:

Time tStateActionRewardNext State
1s1a3s2
2s2a-1s3
3s3a5s4

Now that the game is over, we compute the returns:

Then and only then we update (for all state-action):

Q(s,a)Q(s,a)+α(G(s,a)Q(s,a))Q(s,a) \leftarrow Q(s,a) + \alpha\big(G(s,a) - Q(s,a)\big)

This deep-backup approach requires the full trajectory to be known. Every return GG depends on all future rewards. Look again at the first step of the episode:

s1r=3s2s_1 \xrightarrow{r=3} s_2

At that moment, we already know three important pieces of information:

  1. The immediate reward:

    r=3r=3
  2. The next state:

    s=s2s′=s2
  3. And crucially, our current estimate of the future from that next state:

    Q(s2,a)Q(s2,a)

But Monte Carlo learning ignores all of this until the game ends. What if we took advantage of something we do know right away? When the agent moves from s1s_1 to s2s_2, we have a perfectly valid estimate of the remaining return:

r+γQ(s2,a)r + \gamma Q(s_2, a)

This is not the actual return. MC would say it's incomplete but it is a reasonable guess about the future based on our current knowledge. And unlike the Monte-Carlo return, we don’t have to wait for the game to end to compute it. And that raises a powerful idea, Maybe we don’t need the full return G(s1,a)G(s_1,a). Maybe we can update immediately using

r+γQ(s,a) r + \gamma Q(s', a)

This is the heart of the transition From Monte-Carlo (wait for the full return) to Temporal-Difference (use a predicted return)

Step 1: Transition s1s2s_1 \rightarrow s_2

Reward = 3

Old value:

Q(s1,a)=3Q(s_1,a) = 3

TD (Temporal Difference) target:

r+γQ(s2,a)=3+0.97=3+6.3=9.3r + \gamma Q(s_2,a) = 3 + 0.9 \cdot 7 = 3 + 6.3 = 9.3

Error:

9.33=6.39.3 - 3 = 6.3

Update:

Q(s1,a)3+0.26.3=4.26Q(s_1,a) \leftarrow 3 + 0.2 \cdot 6.3 = 4.26

Table: After Step 1

(s1, a)(s2, a)(s3, a)
Old Q372
Reward rk3--
TD target r + γQ(s′)9.3--
Error9.3 − 3 = 6.3--
Update α · error0.2 × 6.3 = 1.26--
New Q4.26--

Step 2: Transition s2s3s_2 \rightarrow s_3

Reward = –1

Old value:

Q(s2,a)=7Q(s_2,a) = 7

TD (Temporal Difference) target:

1+0.9Q(s3,a)=1+0.92=1+1.8=0.8-1 + 0.9 \cdot Q(s_3,a) = -1 + 0.9 \cdot 2 = -1 + 1.8 = 0.8

Error:

0.87=6.20.8 - 7 = -6.2

Update:

Q(s2,a)7+0.2(6.2)=71.24=5.76Q(s_2,a) \leftarrow 7 + 0.2(-6.2) = 7 - 1.24 = 5.76

Table: After Step 2

(s1, a)(s2, a)(s3, a)
Old Q372
Reward rk3−1-
TD target r + γQ(s′)9.30.8-
Error9.3 − 3 = 6.30.8 − 7 = −6.2-
Update α · error0.2 × 6.3 = 1.260.2 × −6.2 = −1.24-
New Q4.265.76-

Step 3: Transition s3s4s_3 \rightarrow s_4

similarly, Table After Step 3

(s1, a)(s2, a)(s3, a)
Old Q372
Reward rk3−15
TD target r + γQ(s′)9.30.85
Error9.3 − 3 = 6.30.8 − 7 = −6.25 − 2 = 3
Update α · error0.2 × 6.3 = 1.260.2 × −6.2 = −1.240.2 × 3 = 0.6
New Q4.265.762.6

From One Action to Many Actions

So far, our TD examples involved a very simplified world: every state had only a single action. That made updates easy because the agent never had to choose, there was only one possible Q(s,a)Q(s,a) to consider. But real reinforcement learning problems involve multiple actions, and therefore:

This is exactly where SARSA comes in. In the single-action case, the TD target was simply

r+γQ(s,a)r + \gamma Q(s',a)

because the agent must take only action aa in the next state. But with multiple actions, we must ask a new question:

After reaching a new state ss', which action will the agent actually take?

The answer depends on the policy the agent is following, usually something like an ε-greedy exploration policy. So now the TD target depends not only on the next state but also on the next chosen action, which we’ll denote as:

aa'

This gives the SARSA update its name:

State — Action — Reward — State — Action

(the five elements involved in each update)

Update Rule

In the Monte-Carlo world, learning is “all-at-once”:

Finish the entire episode → compute all returns GG → update all QQ’s.

Temporal-Difference learning, with SARSA as its 1-step control variant, fundamentally changes this learning rhythm. Instead of waiting for the full return GtG_t, we approximate it on the fly using what we already know about the future.

After our “From One Action to Many Actions” discussion, we now think in terms of full SARSA tuples:

(st,at,rt,st+1,at+1)(s_t, a_t, r_t, s_{t+1}, a_{t+1})

At time step tt, the agent:

Right after this transition we can build a TD target:

TD target=rt+γQ(st+1,at+1)\text{TD target} = r_t + \gamma Q(s_{t+1}, a_{t+1})

and then do an incremental update:

Q(st,at)Q(st,at)+α[rt+γQ(st+1,at+1)TD targetQ(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ \underbrace{r_t + \gamma Q(s_{t+1}, a_{t+1})}_{\text{TD target}} - Q(s_t, a_t) \right]

The bracketed term is the TD error:

δt=rt+γQ(st+1,at+1)Q(st,at)\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)

So the update is simply:

Q(st,at)Q(st,at)+αδtQ(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \, \delta_t

Compare this with Monte Carlo:

And the core learning rule is the same pattern:

NewEstimate=OldEstimate+α(TargetOldEstimate)\text{NewEstimate} = \text{OldEstimate} + \alpha\big(\text{Target} - \text{OldEstimate}\big)

The only difference is what we use as the target.

That’s why TD (Temporal-Difference) is called bootstrapping. it learns from its own predictions.

At this point it’s natural to worry that a TD is “less accurate” than a Monte Carlo learner because its target

rt+γQ(st+1,at+1)r_t + \gamma Q(s_{t+1}, a_{t+1})

is only a rough guess of the full return

Gt=rt+rt+1++rT.G_t = r_t + r_{t+1} + \cdots + r_T.

For example, suppose the Monte-Carlo return from the current episode is Gt=10+1+=41G_t = 10 + 1 + \cdots = 41, while our TD learner (like SARSA) uses a target of about 4040. It’s tempting to say “41 is the true value and 40 is just an approximation.” But from a statistical point of view, the Q-value can actually be more reliable.

The quantity Q(st+1,at+1)Q(s_{t+1}, a_{t+1}) we plug into the TD target summarizes what happened in the previous (k1)(k-1) episodes after visiting (st+1,at+1)(s_{t+1}, a_{t+1}). In contrast, the Monte-Carlo return (Gt)(G_t) from this episode is just one noisy sample. So MC uses a single fresh sample, while TD blends many past samples into its estimate.

In practice this tradeoff (slightly biased but lower-variance targets) often makes TD methods like SARSA more data-efficient and more effective than pure Monte-Carlo learning.

With this SARSA update rule, our agent can adjust its Q-values step by step as it interacts with the environment. In our on-policy Monte Carlo setup, we already saw how to turn Q-values into behavior using an ε-greedy policy to balance exploration and exploitation.

Now we’re ready to see how SARSA fits into that same on-policy control picture.

SARSA as an On-Policy Control Algorithm

In the on-policy Monte Carlo blog, we already built the full control loop:

SARSA keeps exactly the same high-level idea and the same ε-greedy exploration strategy. The difference is not how we choose actions, but how and when we update Q.

Instead of waiting until the end of the episode and using the full return GtG_t, SARSA updates on every step using the 1-step TD target:

Q(st,at)Q(st,at)+α(rt+γQ(st+1,at+1)Q(st,at)).Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\left(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right).

The tuple

(st,  at,  rt,  st+1,  at+1)(s_t,\; a_t,\; r_t,\; s_{t+1},\; a_{t+1})

is generated by the same ε-greedy policy. That’s why SARSA is called on-policy. It learns the value of the very policy it is using to act (ε-greedy w.r.t. Q), just like our on-policy MC learner but with bootstrapped, step-by-step TD updates instead of full-episode Monte Carlo returns.

Algorithm: TD learner percept (on-policy, local policy update)input: {previous state stprevious action atreward rtcurrent state st+1current action at+1persistent: {Q(s,a) current state–action value estimatesπ(s) deterministic greedy policyα (step-size parameter)γ (discount factor)A(s) set of available actions1.δtrt+γQ(st+1,at+1)Q(st,at)2.Q(st,at)Q(st,at)+αδt3.π(st)argmaxa^A(st)Q(st,a^)4.return Q, π\textbf{Algorithm: TD learner percept (on-policy, local policy update)} \\[6pt] \textbf{input: } \begin{cases} \text{previous state } s_t \\ \text{previous action } a_t \\ \text{reward } r_t \\ \text{current state } s_{t+1} \\ \text{current action } a_{t+1} \end{cases} \\[8pt] \textbf{persistent: } \begin{cases} Q(s,a)\ \text{current state--action value estimates} \\ \pi(s)\ \text{deterministic greedy policy} \\ \alpha\ \text{(step-size parameter)} \\ \gamma\ \text{(discount factor)} \\ A(s)\ \text{set of available actions} \end{cases} \\[10pt] 1.\hspace{0.5cm} \delta_t \leftarrow r_t + \gamma\, Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \\[6pt] 2.\hspace{0.5cm} Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\, \delta_t \\[8pt] 3.\hspace{0.5cm} \pi(s_t) \leftarrow \arg\max_{\hat{a} \in A(s_t)} Q(s_t, \hat{a}) \\[10pt] 4.\hspace{0.5cm} \textbf{return } Q,\ \pi

Algorithm: TD learner Actuate (Exploration–Exploitation)Input: state sPersistent: {π(s) (current policy, initially random)ε (exploration probability, initially 1.0)1.βrandom number in [0,1]2.if β<ε then3.arandom action from A4.else5.aπ(s)6.return a\textbf{Algorithm: TD learner Actuate (Exploration--Exploitation)} \\[10pt] \textbf{Input: } \text{state } s \\[6pt] \textbf{Persistent: } \begin{cases} \pi(s) \text{ (current policy, initially random)} \\ \varepsilon \text{ (exploration probability, initially } 1.0) \end{cases}\\[12pt] 1.\hspace{0.5cm} \beta \leftarrow \text{random number in }[0,1] \\[6pt] 2.\hspace{0.5cm} \textbf{if } \beta < \varepsilon \ \textbf{then} \\[6pt] 3.\hspace{1.2cm} a \leftarrow \text{random action from } A \\[6pt] 4.\hspace{0.5cm} \textbf{else} \\[6pt] 5.\hspace{1.2cm} a \leftarrow \pi(s) \\[6pt] 6.\hspace{0.5cm} \text{return } a

Algorithm: Training TD learner (online SARSA)input: {MDP black boxTD learner (percept, actuate)N training episodess0 starting stateSend set of terminal states1.initialize R with zeros2.initialize ifwin with false3.for i1 to N do4.ss05.aTD learner actuate(s)6.while sSend do7.r, sMDP black box(s,a)8.R(i)R(i)+r9.if sSend then10.adummy action (unused)11.TD learner percept(s,a,r,s,a)12.break13.else14.aTD learner actuate(s)15.TD learner percept(s,a,r,s,a)16.ss17.aa18.if sSwinning_end then19.ifwin(i)true20.return R, ifwin\textbf{Algorithm: Training TD learner (online SARSA)} \\[6pt] \textbf{input: } \begin{cases} \text{MDP black box} \\ \text{TD learner (percept, actuate)} \\ N\ \text{training episodes} \\ s_0\ \text{starting state} \\ S_{\text{end}}\ \text{set of terminal states} \end{cases} \\[10pt] 1.\hspace{0.5cm} \text{initialize } R \text{ with zeros} \\[4pt] 2.\hspace{0.5cm} \text{initialize } \text{ifwin with false} \\[8pt] 3.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \text{ to } N\ \textbf{do} \\[4pt] 4.\hspace{1.2cm} s \leftarrow s_0 \\[4pt] 5.\hspace{1.2cm} a \leftarrow \text{TD learner actuate}(s) \\[8pt] 6.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}}\ \textbf{do} \\[4pt] 7.\hspace{2.0cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[4pt] 8.\hspace{2.0cm} R(i) \leftarrow R(i) + r \\[4pt] 9.\hspace{2.0cm} \textbf{if } s' \in S_{\text{end}}\ \textbf{then} \\[4pt] 10.\hspace{2.8cm} a' \leftarrow \text{dummy action (unused)} \\[4pt] 11.\hspace{2.8cm} \text{TD learner percept}(s,a,r,s',a') \\[4pt] 12.\hspace{2.8cm} \textbf{break} \\[6pt] 13.\hspace{2.0cm} \textbf{else} \\[4pt] 14.\hspace{2.8cm} a' \leftarrow \text{TD learner actuate}(s') \\[4pt] 15.\hspace{2.8cm} \text{TD learner percept}(s,a,r,s',a') \\[4pt] 16.\hspace{2.8cm} s \leftarrow s' \\[4pt] 17.\hspace{2.8cm} a \leftarrow a' \\[8pt] 18.\hspace{1.2cm} \textbf{if } s \in S_{\text{winning\_end}}\ \textbf{then} \\[4pt] 19.\hspace{2.0cm} \text{ifwin}(i) \leftarrow \text{true} \\[8pt] 20.\hspace{0.5cm} \textbf{return } R,\ \text{ifwin}

What’s Next

When you first look at the SARSA update, it feels like everything finally fits together. On every step the agent adjusts its Q-value using the action it actually took, completing a tidy loop: act, observe, update, improve. Our drone or car now learns continuously rather than waiting for full episodes the way Monte Carlo does. But if you stare at SARSA’s update a little longer, something subtle appears. Because the next action comes from the same ε-greedy policy used to explore, every estimate is shaped by that small bit of randomness. In tricky or risky regions, SARSA ends up learning the value of a slightly jittery, exploration-heavy policy, not the clean, greedy policy we ultimately care about.

This naturally raises a deeper question. What if we want to keep exploring because exploration is essential, yet learn about a policy that behaves more confidently than our exploration-heavy one? Is there a way for the agent to act with curiosity but learn with conviction? In the next blog, we’ll follow this tension to a surprisingly elegant solution.