Reinforcement Learning: Deep SARSA and Q Learning

The Evolution of Reinforcement Learning: From Tables to Deep Q-Networks (DQN).

We have walked a long road to reach this point.

First, we had a table. Every state–action pair had its own little drawer in memory, and we updated one number at a time. Then the table became too large, and we replaced it with tiles. The value was no longer stored. it was computed:

V(s) = w^\top \phi(s)

Then even the tiles became too rigid. Instead of hand-designed features, we let a neural network learn them. The update changed from nudging a few tile weights to nudging an entire parameter vector through gradients:

\theta \leftarrow \theta + \alpha \delta \nabla_\theta V(s; \theta)

At each step, the pattern remained the same:

\text{new} = \text{old} + \alpha (\text{target} - \text{old})

Only the representation of “old” changed.

Until now, SARSA and Q-learning were described either in tabular form or with simple function approximators. But modern reinforcement learning systems like game-playing agents, robotic controllers, recommendation systems do not operate in tiny grids. They see images, raw pixels, high-dimensional sensor streams. A table cannot hold that. Even tile coding collapses under that scale.

So what happens when we plug a deep neural network into SARSA or Q-learning?

Bridging the Gap

Up to now, most of our story was built around

V(s)

and that means we were always asking a very particular kind of question:

“How good is this state?”

That question is useful, but it quietly assumes something else is already handling the choice of action. A state-value tells us the value of being at an intersection, but it does not by itself tell us whether we should go left, right, up, or down. It gives a single number for the whole situation.

Imagine our drone is hovering above one city block. If we are told

V(s) = 8.2

we know the situation is fairly promising. But promising under what behavior? If the drone moves north, maybe it reaches the destination quickly. If it moves south, maybe it flies into a windy corridor and loses time. A single scalar for the whole state hides those differences.

That is exactly why control methods like SARSA and Q-learning do not stop at state values. They move one level deeper and ask a sharper question:

“What is the value of taking a particular action in this state?”

Now the object is no longer

V(s)

but

Q(s,a)

This looks like a tiny change in notation, but conceptually it is a major upgrade. Instead of one number per state, we now want one number for each action available in that state.

If there are four actions, then the state no longer maps to one scalar. It maps to four scalars:

Q(s,\text{UP}), \quad Q(s,\text{DOWN}), \quad Q(s,\text{LEFT}), \quad Q(s,\text{RIGHT})

You can think of this as replacing one opinion about the state with four separate “what if” estimates:

“What if I go up from here?”

“What if I go down from here?”

“What if I go left?”

“What if I go right?”

In the tabular world, this was easy to picture. Each state–action pair had its own entry in the table. In the function approximation world, the idea is still the same, but now those values are produced instead of stored.

Earlier, with neural value prediction, we wrote

V(s; \theta)

which means a network takes a state as input and emits one scalar. For SARSA or Q-learning, we simply ask the network to emit more than one scalar. The input is still the state, but the output becomes a collection of action-values:

Q(s,a; \theta)

If the action set is finite, the most natural design is to let the network take in $s$ and output a vector:

\big[ Q(s,a_1; \theta), Q(s,a_2; \theta), \dots, Q(s,a_k; \theta) \big]

So in one forward pass, the network evaluates every action at once.

Deep SARSA: The Same Idea, Larger Capacity

Imagine again our Grid City drone. But now instead of representing a state as a number like 13, suppose the drone sees a $64 \times 64$ grayscale image of the city from above. That image is the state. Every state is now a vector of 4096 numbers. A table indexed by images is impossible. Tile coding over pixels is absurd.

So we define a neural network:

Q(s, a; \theta)

that takes as input the image of the city and outputs a value for each possible action. Suppose there are four actions: UP, DOWN, LEFT, RIGHT. The network outputs:

\big[ Q(s, \text{UP}), \; Q(s, \text{DOWN}), \; Q(s, \text{LEFT}), \; Q(s, \text{RIGHT}) \big]

all computed in one forward pass. Now we replay what SARSA used to do in the tabular world.

At time step $t$ , the drone is in state $s_t$ . It chooses action $a_t$ using $\varepsilon$ -greedy on the network’s outputs. It receives reward $r_t$ , lands in $s_{t+1}$ , and selects the next action $a_{t+1}$ again using $\varepsilon$ -greedy policy.

In tabular SARSA, the update was:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big( r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \big)

Now replace the table entry $Q(s_t, a_t)$ with the network output $Q(s_t, a_t; \theta)$ . Define the TD error:

\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}; \theta) - Q(s_t, a_t; \theta)

Just like before.

But we cannot directly overwrite a number in a table. We must adjust parameters $\theta$ so that the network’s output for $(s_t, a_t)$ moves closer to the target. So we define a loss:

L(\theta) = \frac{1}{2} \big( r_t + \gamma Q(s_{t+1}, a_{t+1}; \theta) - Q(s_t, a_t; \theta) \big)^2

And then we perform gradient descent:

\theta \leftarrow \theta + \alpha \delta_t \nabla_\theta Q(s_t, a_t; \theta)

This is Deep SARSA.

Conceptually, nothing changed. It is still on-policy. It still bootstraps from the action actually taken. It still learns the value of the $\varepsilon$ -greedy policy it follows. The only difference is that the Q-function is now represented by a deep neural network instead of a table.

But something subtle and dangerous has entered the story.

Remember earlier when we discussed how TD targets are already “moving targets”? Even in simple neural TD learning, the target

r + \gamma V(s'; \theta)

depends on $\theta$ , the same parameters we are updating.

In tabular SARSA, this coupling was mild. Updating one entry in the table did not drastically affect others. But in a deep network, changing $\theta$ slightly changes all Q-values for all states, because they share parameters.

Deep Q-Learning and the Moving Target Problem

Now imagine what happens in Deep Q-learning. Q-learning’s target is:

r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta)

So the TD error becomes:

\delta_t = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta) - Q(s_t, a_t; \theta)

and we again minimize squared TD error. This is Deep Q-Learning in its most naive form.

At first glance, it looks almost identical to Deep SARSA. Just replace $Q(s_{t+1}, a_{t+1})$ with $\max_{a'} Q(s_{t+1}, a')$ . But now notice something unsettling.

The network parameters $\theta$ appear in three places:

Inside $Q(s_t, a_t; \theta)$
Inside $Q(s_{t+1}, a'; \theta)$
Inside the max operator

So the network is chasing a target that it itself is producing, while the target shifts every time the parameters shift. Imagine trying to shoot at a target that moves every time you adjust your aim and worse, the target moves because you adjusted your aim.

In small problems, sometimes this works. But with large nonlinear networks, training becomes unstable. Q-values can explode. Learning can diverge. To understand why, imagine a simple numeric example. Suppose initially:

Q(s_{t+1}, a_1; \theta) = 5, \quad Q(s_{t+1}, a_2; \theta) = 6

So the max is 6.

The target becomes:

r_t + \gamma \cdot 6

Now suppose a small gradient update slightly increases both values to 5.5 and 6.5. The target increases too. The update chases it upward again. If the network systematically overestimates values due to noise, the max operator amplifies that bias.

This is known as overestimation bias in Q-learning.

Stabilizing Deep Q-Learning: Replay and Target Networks

This is why modern Deep Q-Learning introduced two stabilizing ideas.

Experience replay.

Instead of updating the network immediately using the latest transition, we store transitions:
$(s_t, a_t, r_t, s_{t+1})$
in a replay buffer. During training, we sample random mini-batches from this buffer.

Why does this help?

Because consecutive transitions are highly correlated. If the drone is flying north for ten steps, those ten states are similar. Training on them sequentially makes the network chase local correlations. Random sampling breaks this correlation and makes training behave more like standard supervised learning.
Target networks.

We create a second network with parameters $\theta^{-}$ , initially copied from $\theta$ . When computing the target:
$r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta^{-})$
we use the frozen target network parameters $\theta^{-}$ , not the current parameters $\theta$ .

The online network $\theta$ is updated every step. The target network $\theta^{-}$ is updated only occasionally, say every few thousand steps, by copying $\theta$ .

Now the target moves slowly. It is no longer chasing itself at every tiny gradient step.

With both ideas included, the Deep Q-Learning update becomes:

\delta_t = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta^{-}) - Q(s_t, a_t; \theta)

\theta \leftarrow \theta + \alpha \delta_t \nabla_\theta Q(s_t, a_t; \theta)

This combination of deep network, replay buffer, target network is what made Deep Q-Networks (DQN) capable of learning directly from high-dimensional inputs like raw pixels.

Deep SARSA can also be implemented with replay and target networks, though historically DQN popularized the approach in off-policy learning.

$\textbf{Algorithm: Deep Q-Learning with Replay Buffer and Target Network} \\[10pt] \textbf{Input: } \mathcal{M},\ N,\ \alpha,\ \gamma,\ \varepsilon,\ C,\ B,\ s_0,\ S_{\text{end}} \\[6pt] \textbf{Output: } \theta \\[12pt] 1.\hspace{0.5cm} \text{Initialize online network parameters } \theta \text{ randomly} \\[6pt] 2.\hspace{0.5cm} \text{Initialize target network parameters } \theta^- \leftarrow \theta \\[6pt] 3.\hspace{0.5cm} \text{Initialize replay buffer } \mathcal{D} \leftarrow \emptyset \\[12pt] 4.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \textbf{ to } N \textbf{ do} \\[6pt] 5.\hspace{1.2cm} s \leftarrow s_0 \\[6pt] 6.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}} \textbf{ do} \\[6pt] 7.\hspace{2cm} \beta \sim \text{Uniform}(0,1) \\[6pt] 8.\hspace{2cm} \textbf{if } \beta \lt \varepsilon \textbf{ then} \\[6pt] 9.\hspace{2.8cm} a \leftarrow \text{random action from } A(s) \\[6pt] 10.\hspace{2cm} \textbf{else} \\[6pt] 11.\hspace{2.8cm} a \leftarrow \arg\max_{a' \in A(s)} Q(s,a';\theta) \\[10pt] 12.\hspace{2cm} r,\ s' \leftarrow \mathcal{M}(s,a) \\[6pt] 13.\hspace{2cm} \mathcal{D} \leftarrow \mathcal{D} \cup \lbrace(s,a,r,s')\rbrace \\[6pt] 14.\hspace{2cm} s \leftarrow s' \\[12pt] 15.\hspace{2cm} \textbf{if } |\mathcal{D}| \ge B \textbf{ then} \\[6pt] 16.\hspace{2.8cm} \text{Sample minibatch } \lbrace(s_j,a_j,r_j,s'_j)\rbrace_{j=1}^{B} \subset \mathcal{D} \\[10pt] 17.\hspace{2.8cm} \textbf{for } j \leftarrow 1 \textbf{ to } B \textbf{ do} \\[6pt] 18.\hspace{3.6cm} y_j \leftarrow \begin{cases} r_j & \text{if } s'_j \in S_{\text{end}} \\ r_j + \gamma \max_{a' \in A(s'_j)} Q(s'_j,a';\theta^-) & \text{otherwise} \end{cases} \\[12pt] 19.\hspace{2.8cm} L(\theta) \leftarrow \frac{1}{B} \sum_{j=1}^{B} \big(y_j - Q(s_j,a_j;\theta)\big)^2 \\[8pt] 20.\hspace{2.8cm} \theta \leftarrow \theta - \alpha \nabla_\theta L(\theta) \\[12pt] 21.\hspace{2cm} \textbf{if } i \bmod C = 0 \textbf{ then} \\[6pt] 22.\hspace{2.8cm} \theta^- \leftarrow \theta \\[12pt] 23.\hspace{0.5cm} \text{return } \theta$

The Unchanging Backbone

Step back now and look at the pattern across everything we have done.

Monte Carlo:

\text{target} = G_t

SARSA:

\text{target} = r_t + \gamma Q(s_{t+1}, a_{t+1})

Q-learning:

\text{target} = r_t + \gamma \max_{a'} Q(s_{t+1}, a')

Deep SARSA:

Same SARSA target, but $Q$ is a neural network.

Deep Q-learning:

Same Q-learning target, but stabilized with replay and target networks.

Across all of them, the backbone remains:

\text{parameters} \leftarrow \text{parameters} + \alpha \delta \nabla(\text{prediction})

Only two ingredients ever change:

What is the target?
How is the value function represented?

When we replaced tables with tiles, and tiles with neural networks, we did not change the soul of reinforcement learning. We only changed its capacity.

But now a new tension emerges.

Deep Q-learning is off-policy and value-based. Deep SARSA is on-policy and value-based. Both estimate Q-values and derive policies from them. But what if we do not want to infer a policy indirectly from Q-values? What if we want to learn the policy directly?

That question opens the door to policy gradients, actor–critic methods, and eventually to algorithms like PPO and SAC where value functions and policies coexist inside deep networks in a more delicate balance.

And just like before, the core ideas will look surprisingly familiar once we watch one update at a time. stay tuned!!