Reinforcement Learning: Continuous State Space
From State Aggregation to Tile Coding: Handling Continuous State Spaces in Reinforcement Learning.
Imagine you are building a robot that balances a pole on its hand. The pole can lean slightly left, slightly right, or almost fall. The robot’s hand can move left or right with different speeds. Now pause for a second and think about the “state” of this world.
In a simple grid world, a state might just be a number like 1, 2, 3, or 4. But here, the pole angle could be radians, radians, radians… The velocity could be , , … There are infinitely many possibilities.
This is what we call a continuous state space.
In small textbook examples, we happily write something like
or
as if we can store a separate value for every state. But what does that mean when the angle of the pole can be radians? Are we going to store a value for every possible decimal?
Let’s try to be naive for a moment.
Suppose the angle of the pole ranges from to radians. We decide to store values with precision up to three decimal places. That means possible angles are
That’s 2001 distinct angle values already. Now add angular velocity, also from to with the same precision. That’s another 2001 possibilities.
The total number of states becomes
And this is just two variables. Real systems have more. Suddenly, our simple table-based idea collapses.
So the question becomes: if we cannot store a separate value for every continuous state, what can we do?
State Aggregation: Grouping the Continuous World
Let’s step back and think intuitively.
Imagine you are a teacher observing students’ heights. Heights are continuous. No two students are exactly the same height. But if someone asks, “How many short students are there?” you don’t say:
“There is one student who is 160.013 cm, one who is 160.027 cm…” Instead, you group them. You might say:
150–160 cm → short
160–170 cm → medium
170–180 cm → tall
You just discretized a continuous variable into bins. You lost some precision, but you gained something more important: simplicity.
This idea is the heart of state aggregation.
Return to the pole balancing example. Suppose the angle lies between and . Instead of remembering a value for every precise , we divide the range into intervals:
Now, instead of asking:
“What is ?”
we ask:
“Which interval does belong to?”
Suppose it falls in . Then we treat all angles in that interval as if they are the same state.
Let’s make it concrete. Assume we divide angle into 5 bins:
Suppose during learning we observe:
Instead of writing
we now write
We are updating the value of the entire bin .
What just happened?
We made a bold assumption: all states inside are approximately equivalent. That is the approximation. And that approximation gives us generalization.
This is state aggregation: grouping continuous states into discrete clusters and learning a single value per group.
But something subtle happens here.
Suppose and . They are only apart. Yet belongs to and belongs to . Two almost identical angles now belong to different bins and get completely different values.
That feels wrong.
Our bins introduced artificial boundaries.
If the optimal value function is smooth, we would prefer nearby states to have similar values. Hard bins break that smoothness.
So we ask: can we do better than rigid buckets?
Tile Coding: Overlapping Structure and Smooth Generalization
Imagine a long floor. Instead of dividing it into non-overlapping tiles like bathroom tiles, imagine laying multiple transparent grids on top of each other, each slightly shifted. This is the intuition behind tile coding.
We still discretize but we do it multiple times, with overlapping tilings.
Suppose . Instead of one partition into 5 bins, we create two tilings.
First tiling:
Second tiling, shifted by :
Now consider .
In tiling 1, it falls into .
In tiling 2, it falls into .
We assign each tile a weight. The value of is the sum of the weights of the active tiles.
Let
- = weight of tile in tiling 1
- = weight of tile in tiling 2
Then
If we observe TD error , we update
What does this accomplish?
Consider .
In tiling 1, it falls into .
In tiling 2, it still falls into .
So it shares one tile (tiling 2) with .
That means and will have partially shared representation. Their values will be similar, but not identical.
We have achieved smooth generalization. with few more tiling in place, If we cosider three values , and then all will have different value but in case of state aggregation, one or two of them may get a same value (based on boundry).
With more tilings, each shifted differently, the representation becomes more expressive. With enough tilings, we can approximate quite complex value functions.
Mathematically, tile coding can be viewed as a linear function approximator. We define a feature vector , where each component is 1 if a tile is active and 0 otherwise.
Then
For a given state, only a few components of are , one per tiling. So updates are efficient. When we observe the transition
the TD update becomes
and
Notice something beautiful here.
In tabular learning, each state had its own independent parameter.
In state aggregation, each bin had its own parameter.
In tile coding, states share parameters. Learning about one state slightly influences nearby states.
As we increase the number of tilings, decrease tile width, or move to more advanced function approximators like neural networks, we are just refining this same idea: approximate the value function over a continuous space using shared parameters.
State aggregation is the first step, rough, bold grouping.
Tile coding is the next step, structured, overlapping, smoother approximation.
And from there, the path naturally leads to the broader world of function approximation, where the table disappears entirely, and what remains is a learned surface stretched over a continuous world.
The Computational Cost of Tile Coding
Now that we understand how tile coding helps us generalize smoothly over a continuous space, let us slow down and ask a practical question.
How expensive is this idea?
At first glance, tile coding feels lightweight. After all, for any given state, only a few tiles are active. If we use tilings, then exactly weights are involved in computing
So computing the value of a single state costs roughly additions. Updating also touches only those same weights. That sounds beautifully efficient. But the hidden cost is not in the update. It is in how many parameters we must maintain. Suppose we have just one continuous variable, say angle
We divide it into bins in a single tiling. That means one tiling contains tiles. If we use tilings, each shifted slightly, then the total number of parameters becomes
That seems manageable. Now let’s add a second variable, angular velocity . We again divide it into bins. In two dimensions, each tiling is now a grid. If angle has bins and velocity has bins, then each tiling contains
tiles.
With tilings, the total number of parameters becomes
Now imagine three variables. Perhaps we also include cart position. Each variable has bins. Each tiling now has
tiles.
With tilings, total parameters:
See what is happening?
If we have continuous state variables, each divided into bins per tiling, then each tiling contains
tiles.
And with tilings, the total number of parameters becomes
This exponential dependence on dimension is unavoidable. Let’s make it concrete. Suppose:
- bins per dimension
- state variables
- tilings
Each tiling has
tiles.
Across 8 tilings:
parameters.
The growth is explosive in , not in . So time complexity for evaluating or updating a single state is
But memory complexity is
This is why tile coding works beautifully for low-dimensional problems like CartPole. But as dimension grows, even tile coding begins to struggle.
And that tension efficient updates but exponential growth in representation is exactly what pushes us toward more compact function approximators like neural networks.
Appendix
In case if you have a question: How Tile Coding is linear function approximator ?
Imagine again our simple continuous state variable:
Suppose we use two tilings, each dividing the interval into bins, slightly shifted from each other. Now pick a concrete state:
As we saw before, that state activates exactly one tile in each tiling. So if we have:
- Tiling 1 → 5 tiles
- Tiling 2 → 5 tiles
Then in total we have 10 possible tiles.
Now here is the key mental shift. Instead of thinking:
“State belongs to tile A and tile B”
we think:
“State is represented by a vector of zeros and ones.”
Let’s label the tiles:
Now we define a feature vector:
For , suppose:
- In tiling 1, tile is active
- In tiling 2, tile is active
Then the feature vector looks like:
Everything is zero except the active tiles. Now comes the crucial part. We assign a weight to every tile:
The value of the state is defined as:
What does that mean? It means:
But remember: is almost entirely zeros. So only the active tiles contribute.
If and , then:
That’s it. No multiplication between features. No nonlinear transformation. No squaring. No hidden layers. Just a weighted sum of features. That is exactly what a linear function approximator is:
The function is linear in the parameters .