Reinforcement Learning: Continuous State Space

From State Aggregation to Tile Coding: Handling Continuous State Spaces in Reinforcement Learning.

MAB Cover

Imagine you are building a robot that balances a pole on its hand. The pole can lean slightly left, slightly right, or almost fall. The robot’s hand can move left or right with different speeds. Now pause for a second and think about the “state” of this world.

In a simple grid world, a state might just be a number like 1, 2, 3, or 4. But here, the pole angle could be 0.010.01 radians, 0.0130.013 radians, 0.01350.0135 radians… The velocity could be 0.20.2, 0.2010.201, 0.20130.2013… There are infinitely many possibilities.

This is what we call a continuous state space.

In small textbook examples, we happily write something like

V(1),V(2),V(3)V(1), \quad V(2), \quad V(3)

or

Q(s,a)Q(s, a)

as if we can store a separate value for every state. But what does that mean when the angle of the pole can be 0.0012340.001234 radians? Are we going to store a value for every possible decimal?

Let’s try to be naive for a moment.

Suppose the angle of the pole ranges from 1-1 to +1+1 radians. We decide to store values with precision up to three decimal places. That means possible angles are

1.000,0.999,0.998,,0.999,1.000-1.000, -0.999, -0.998, \dots, 0.999, 1.000

That’s 2001 distinct angle values already. Now add angular velocity, also from 1-1 to +1+1 with the same precision. That’s another 2001 possibilities.

The total number of states becomes

2001×20014,000,0002001 \times 2001 \approx 4{,}000{,}000

And this is just two variables. Real systems have more. Suddenly, our simple table-based idea collapses.

So the question becomes: if we cannot store a separate value for every continuous state, what can we do?

State Aggregation: Grouping the Continuous World

State Aggregation

Let’s step back and think intuitively.

Imagine you are a teacher observing students’ heights. Heights are continuous. No two students are exactly the same height. But if someone asks, “How many short students are there?” you don’t say:

“There is one student who is 160.013 cm, one who is 160.027 cm…” Instead, you group them. You might say:

150–160 cm → short

160–170 cm → medium

170–180 cm → tall

You just discretized a continuous variable into bins. You lost some precision, but you gained something more important: simplicity.

This idea is the heart of state aggregation.

Return to the pole balancing example. Suppose the angle θ\theta lies between 1-1 and 11. Instead of remembering a value for every precise θ\theta, we divide the range into intervals:

[1,0.8),[0.8,0.6),,[0.8,1][-1, -0.8), \quad [-0.8, -0.6), \quad \dots, \quad [0.8, 1]

Now, instead of asking:

“What is V(θ=0.0135)V(\theta = 0.0135)?”

we ask:

“Which interval does 0.01350.0135 belong to?”

Suppose it falls in [0,0.2)[0, 0.2). Then we treat all angles in that interval as if they are the same state.

Let’s make it concrete. Assume we divide angle into 5 bins:

B1:[1,0.6)B2:[0.6,0.2)B3:[0.2,0.2)B4:[0.2,0.6)B5:[0.6,1]\begin{aligned} B_1 &: [-1, -0.6) \\ B_2 &: [-0.6, -0.2) \\ B_3 &: [-0.2, 0.2) \\ B_4 &: [0.2, 0.6) \\ B_5 &: [0.6, 1] \end{aligned}

Suppose during learning we observe:

θ=0.05,r=1,θ=0.1\theta = 0.05, \quad r = 1, \quad \theta' = 0.1

Instead of writing

V(0.05)V(0.05)+α(r+γV(0.1)V(0.05))V(0.05) \leftarrow V(0.05) + \alpha \left( r + \gamma V(0.1) - V(0.05) \right)

we now write

V(B3)V(B3)+α(r+γV(B3)V(B3))V(B_3) \leftarrow V(B_3) + \alpha \left( r + \gamma V(B_3) - V(B_3) \right)

We are updating the value of the entire bin B3B_3.

What just happened?

We made a bold assumption: all states inside B3B_3 are approximately equivalent. That is the approximation. And that approximation gives us generalization.

This is state aggregation: grouping continuous states into discrete clusters and learning a single value per group.

But something subtle happens here.

Suppose θ=0.21\theta = -0.21 and θ=0.19\theta = -0.19. They are only 0.020.02 apart. Yet 0.21-0.21 belongs to B2B_2 and 0.19-0.19 belongs to B3B_3. Two almost identical angles now belong to different bins and get completely different values.

That feels wrong.

Our bins introduced artificial boundaries.

If the optimal value function is smooth, we would prefer nearby states to have similar values. Hard bins break that smoothness.

So we ask: can we do better than rigid buckets?

Tile Coding: Overlapping Structure and Smooth Generalization

Tile Coding

Imagine a long floor. Instead of dividing it into non-overlapping tiles like bathroom tiles, imagine laying multiple transparent grids on top of each other, each slightly shifted. This is the intuition behind tile coding.

We still discretize but we do it multiple times, with overlapping tilings.

Suppose θ[0,1]\theta \in [0, 1]. Instead of one partition into 5 bins, we create two tilings.

First tiling:

[0,0.2),[0.2,0.4),[0.4,0.6),[0.6,0.8),[0.8,1][0, 0.2), \quad [0.2, 0.4), \quad [0.4, 0.6), \quad [0.6, 0.8), \quad [0.8, 1]

Second tiling, shifted by 0.10.1:

[0.1,0.3),[0.3,0.5),[0.5,0.7),[0.7,0.9)[0.1, 0.3), \quad [0.3, 0.5), \quad [0.5, 0.7), \quad [0.7, 0.9)

Now consider θ=0.39\theta = 0.39.

In tiling 1, it falls into [0.2,0.4)[0.2, 0.4).

In tiling 2, it falls into [0.3,0.5)[0.3, 0.5).

We assign each tile a weight. The value of θ\theta is the sum of the weights of the active tiles.

Let

Then

V(θ=0.39)=w1+w2V(\theta = 0.39) = w_1 + w_2

If we observe TD error δ\delta, we update

w1w1+αδw_1 \leftarrow w_1 + \alpha \delta w2w2+αδw_2 \leftarrow w_2 + \alpha \delta

What does this accomplish?

Consider θ=0.41\theta = 0.41.

In tiling 1, it falls into [0.4,0.6)[0.4, 0.6).

In tiling 2, it still falls into [0.3,0.5)[0.3, 0.5).

So it shares one tile (tiling 2) with 0.390.39.

That means 0.390.39 and 0.410.41 will have partially shared representation. Their values will be similar, but not identical.

We have achieved smooth generalization. with few more tiling in place, If we cosider three values 0.370.37, 0.390.39 and 0.410.41 then all will have different value but in case of state aggregation, one or two of them may get a same value (based on boundry).

With more tilings, each shifted differently, the representation becomes more expressive. With enough tilings, we can approximate quite complex value functions.

Mathematically, tile coding can be viewed as a linear function approximator. We define a feature vector ϕ(s)\phi(s), where each component is 1 if a tile is active and 0 otherwise.

Then

V(s)=wϕ(s)V(s) = w^{\top} \phi(s)

For a given state, only a few components of ϕ(s)\phi(s) are 11, one per tiling. So updates are efficient. When we observe the transition

srss \rightarrow r \rightarrow s'

the TD update becomes

δ=r+γwϕ(s)wϕ(s)\delta = r + \gamma w^\top \phi(s') - w^\top \phi(s)

and

ww+αδϕ(s)w \leftarrow w + \alpha \delta \phi(s)

Notice something beautiful here.

In tabular learning, each state had its own independent parameter.

In state aggregation, each bin had its own parameter.

In tile coding, states share parameters. Learning about one state slightly influences nearby states.

As we increase the number of tilings, decrease tile width, or move to more advanced function approximators like neural networks, we are just refining this same idea: approximate the value function over a continuous space using shared parameters.

State aggregation is the first step, rough, bold grouping.

Tile coding is the next step, structured, overlapping, smoother approximation.

And from there, the path naturally leads to the broader world of function approximation, where the table disappears entirely, and what remains is a learned surface stretched over a continuous world.

The Computational Cost of Tile Coding

Now that we understand how tile coding helps us generalize smoothly over a continuous space, let us slow down and ask a practical question.

How expensive is this idea?

At first glance, tile coding feels lightweight. After all, for any given state, only a few tiles are active. If we use kk tilings, then exactly kk weights are involved in computing

V(s)=wϕ(s)V(s) = w^\top \phi(s)

So computing the value of a single state costs roughly kk additions. Updating also touches only those same kk weights. That sounds beautifully efficient. But the hidden cost is not in the update. It is in how many parameters we must maintain. Suppose we have just one continuous variable, say angle

θ[1,1]\theta \in [-1, 1]

We divide it into nn bins in a single tiling. That means one tiling contains nn tiles. If we use kk tilings, each shifted slightly, then the total number of parameters becomes

n×kn \times k

That seems manageable. Now let’s add a second variable, angular velocity θ˙\dot{\theta}. We again divide it into nn bins. In two dimensions, each tiling is now a grid. If angle has nn bins and velocity has nn bins, then each tiling contains

n×n=n2n \times n = n^2

tiles.

With kk tilings, the total number of parameters becomes

kn2k \cdot n^2

Now imagine three variables. Perhaps we also include cart position. Each variable has nn bins. Each tiling now has

n×n×n=n3n \times n \times n = n^3

tiles.

With kk tilings, total parameters:

kn3k \cdot n^3

See what is happening?

If we have dd continuous state variables, each divided into nn bins per tiling, then each tiling contains

ndn^d

tiles.

And with kk tilings, the total number of parameters becomes

kndk \cdot n^d

This exponential dependence on dimension dd is unavoidable. Let’s make it concrete. Suppose:

Each tiling has

204=160,00020^4 = 160,000

tiles.

Across 8 tilings:

8×160,000=1,280,0008 \times 160,000 = 1,280,000

parameters.

The growth is explosive in dd, not in kk. So time complexity for evaluating or updating a single state is

O(k)\mathcal{O}(k)

But memory complexity is

O(knd)\mathcal{O}(k n^d)

This is why tile coding works beautifully for low-dimensional problems like CartPole. But as dimension grows, even tile coding begins to struggle.

And that tension efficient updates but exponential growth in representation is exactly what pushes us toward more compact function approximators like neural networks.

Appendix

In case if you have a question: How Tile Coding is linear function approximator ?

Imagine again our simple continuous state variable:

s[0,1]s \in [0, 1]

Suppose we use two tilings, each dividing the interval into bins, slightly shifted from each other. Now pick a concrete state:

s=0.39s = 0.39

As we saw before, that state activates exactly one tile in each tiling. So if we have:

Then in total we have 10 possible tiles.

Now here is the key mental shift. Instead of thinking:

“State 0.390.39 belongs to tile A and tile B”

we think:

“State 0.390.39 is represented by a vector of zeros and ones.”

Let’s label the tiles:

Now we define a feature vector:

ϕ(s)R10\phi(s) \in \mathbb{R}^{10}

For s=0.39s = 0.39, suppose:

Then the feature vector looks like:

ϕ(s)=[0,1,0,0,0,0,0,1,0,0]\phi(s) = [0, 1, 0, 0, 0, 0, 0, 1, 0, 0]

Everything is zero except the active tiles. Now comes the crucial part. We assign a weight to every tile:

w=[w1,w2,,w10]w = [w_1, w_2, \dots, w_{10}]

The value of the state is defined as:

V(s)=wϕ(s)V(s) = w^\top \phi(s)

What does that mean? It means:

V(s)=w1ϕ1+w2ϕ2++w10ϕ10V(s) = w_1 \phi_1 + w_2 \phi_2 + \dots + w_{10} \phi_{10}

But remember: ϕ(s)\phi(s) is almost entirely zeros. So only the active tiles contribute.

If ϕ2=1\phi_2 = 1 and ϕ8=1\phi_8 = 1, then:

V(s)=w2+w8V(s) = w_2 + w_8

That’s it. No multiplication between features. No nonlinear transformation. No squaring. No hidden layers. Just a weighted sum of features. That is exactly what a linear function approximator is:

V(s)=wϕ(s)V(s) = w^\top \phi(s)

The function is linear in the parameters ww.