Reinforcement Learning: Tile Coding to Neural Network
How Neural Networks Replace Q-Tables in Continuous State Space Reinforcement Learning.
We ended the previous discussion standing at a very interesting place.
We had taken a continuous world, angles, velocities, positions and instead of storing an impossible infinite table, we built structure. First rough bins. Then overlapping tilings. Then a feature vector
And finally we arrived at something that looks deceptively simple:
At that moment, something profound happened. The table disappeared. The value of a state was no longer “stored.” It was computed. Now let’s continue that story. Imagine again the pole balancing robot. But this time, instead of 1 or 2 state variables, suppose we include:
- pole angle
- pole angular velocity
- cart position
- cart velocity
- maybe wind force
- maybe friction variation
Now we are easily in 5 or 6 dimensions.
You already saw what happens to tile coding in higher dimensions. Each tiling needs
tiles.
Even if each update only touches weights, the total number of parameters explodes. Memory becomes painful. Worse, most tiles are never visited. So let’s pause and ask a different question.
In tile coding, what were we really doing?
We were defining features manually. For a given state , we constructed a vector:
mostly zeros, few ones. Then we said:
This is a linear model in disguise. Now imagine we do something radical. Instead of manually defining tiles, what if we let the system learn the features itself?
That is the bridge to neural networks.
From Manual Features to Learned Features
Forget reinforcement learning for a moment. Suppose I give you a simple regression task. You observe pairs:
And you want to predict from . If you use a linear model, you assume:
That works if the relationship is linear. But suppose the true relationship is curved. Maybe:
A straight line won’t capture it well. In classical machine learning, you would manually create nonlinear features:
Then you would again compute:
Do you see the pattern? First we design features. Then we do linear learning on top.
Tile coding is exactly the same idea. We designed binary indicator features, then learned a linear combination. Now imagine instead of hard-coded tiles, we define:
where is a small neural network.
Now the value becomes:
But we can simplify even further. Why keep two sets of parameters? Why not just define:
directly as the output of a neural network? That is the leap.
TD Learning with a Neural Network
Let’s construct this step by step in a reinforcement learning setting.
Assume again a transition:
In tabular TD learning, we wrote:
and updated:
With tile coding, we wrote:
and updated:
Now suppose:
is a neural network. Then the TD error becomes:
But what do we update?
This is the moment where everything shifts. Until now, updates were easy. In the tabular case, you changed a single number. In tile coding, you nudged a few weights corresponding to active tiles. But now the value is produced by an entire network, layers of weights, nonlinearities, interactions we didn’t explicitly design. Here the real question is:
how do we push the network so that moves closer to the target ?
From TD Error to Gradient Descent
For a moment imagine we have a neural network trying to predict house prices. It outputs We observe the true price . The natural thing to do is measure the squared error:
Then we compute the gradient and adjust parameters:
This is just gradient descent. Now come back to our reinforcement learning world. What is the “true” target? We don’t have a full Monte Carlo return. We have something bootstrapped:
So define a temporary loss:
Look carefully at that expression. It is just squared TD error:
So what should we update? Exactly what we always update in neural networks:
Let’s compute the gradient now. By chain rule,
Now expand :
If we fully differentiate this expression, we would get gradients flowing through both and . But in standard TD learning, we make a simplifying choice: we treat the target
as if it were a constant. We “freeze” it during differentiation. That means the only term that contributes gradient is .
So the gradient becomes
Plug this back into the update rule:
which simplifies beautifully to
And now something clicks. Compare this with tile coding. With tile coding we had
for a linear function the gradient with respect to is exactly .
So the neural network update is not a new idea at all. It is the exact same TD update except now the “feature vector” is no longer manually designed. It is replaced by
The gradient itself plays the role of features.
Point to Note
But important thing to note here is, In supervised learning:
The target never changes. When we visit same sample again next time, y will be as it is. In TD learning:
The “target” is another prediction. When I visit same again value of target will not be the same so even though we treat it as fixed during a single gradient step, it is not fundamentally fixed.
Keep these ideas in mind as we move forward. What we’ve derived so far isn’t perfect, but as the series progresses, everything will start to make more sense.