Reinforcement Learning: N-Step SARSA

A clear walkthrough of n-step returns, showing how they unify Monte Carlo and temporal-difference methods through a controllable lookahead horizon.

Cover

Every method so far has made a very specific commitment about how far into the future it wants to look.

Monte Carlo says:

“I will wait. I will watch the entire episode unfold. Only when the story is finished will I decide what the first step was worth.”

One-step TD methods like SARSA and Q-learning say:

“I won’t wait that long. I’ll peek one step ahead, trust my current estimate for the rest, and update immediately.”

And suddenly a question feels unavoidable. Why only one step? Why all the way to the end? Why not something in between?

To see what this really means, let’s slow time again and zoom into a single episode in Grid City. Imagine the drone starts at state s1s_1. Over the next few moments, it experiences the following trajectory:

s1r1=3s2s_1 \xrightarrow{r_1=3} s_2 s2r2=1s3s_2 \xrightarrow{r_2=-1} s_3 s3r3=5s4s_3 \xrightarrow{r_3=5} s_4 s4r4=2s5s_4 \xrightarrow{r_4=2} s_5

Assume s5s_5 is terminal.

Now imagine you are standing at time t=1t=1, just after leaving s1s_1. You want to estimate how good s1s_1 was. Monte Carlo would say:

G1=3+(1)+5+2G_1 = 3 + (-1) + 5 + 2

and only after reaching s5s_5 would it update:

V(s1)V(s1)+α(G1V(s1))V(s_1) \leftarrow V(s_1) + \alpha (G_1 - V(s_1))

One-step TD would instead say:

Target1(1)=3+γV(s2).\text{Target}_1^{(1)} = 3 + \gamma V(s_2).

It peeks one step ahead, then bootstraps. But now let’s try something different. Suppose we look two steps ahead before bootstrapping. From s1s_1, we observe:

r1=3,r2=1,r_1 = 3, \quad r_2 = -1,

and then instead of waiting for the whole episode, we stop at s3s_3 and bootstrap from there. That gives us the 2-step return:

G1(2)=3+γ(1)+γ2V(s3).G_1^{(2)} = 3 + \gamma(-1) + \gamma^2 V(s_3).

Notice what happened. We used two real rewards, then trusted our estimate for the rest. If we go three steps before bootstrapping, we get:

G1(3)=3+γ(1)+γ2(5)+γ3V(s4).G_1^{(3)} = 3 + \gamma(-1) + \gamma^2(5) + \gamma^3 V(s_4).

And if we keep going until the episode ends, we get:

G1(4)=3+γ(1)+γ2(5)+γ3(2),G_1^{(4)} = 3 + \gamma(-1) + \gamma^2(5) + \gamma^3(2),

which is exactly the Monte Carlo return. One-step TD is the 1-step return:

G1(1)=3+γV(s2)G_1^{(1)} = 3 + \gamma V(s_2)

Monte Carlo is the full n-step return, where nn equals the remaining episode length. And in between? There is a whole spectrum of possibilities:

Gt(n)=rt+γrt+1++γn1rt+n1+γnV(st+n).G_t^{(n)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n}).

Now pause and reflect. Each choice of nn changes the personality of the learner.

If n=1n=1, we bootstrap quickly. Low variance. Slight bias.

If nn is very large, we approach Monte Carlo. Unbiased. Higher variance.

Tradeoff

We are no longer trapped between two extremes. We have a dial we can turn.

Now, how do we update? So in general, at time tt, we will update the state visited at time tn+1t-n+1. The learning signal travels backward with a delay of nn.

Mathematically, the update is always

V(st)V(st)+α(Gt(n)V(st))V(s_t) \leftarrow V(s_t) + \alpha \big(G_t^{(n)} - V(s_t)\big)

Conceptually, what changes is how long we allow reality to unfold before trusting our estimate.

If n=1n=1, we trust our current value function almost immediately. If nn is large, we trust actual observed rewards longer and if n is of episode length then we update once the episode ends.

Let’s keep it short this time. We will visit same concept again in “Generalized Advantage Estimation (GAE)”. There we will derive generalized way of balancing between 1 step TD bootstrap and full monte carlo return.