Reinforcement Learning: Value Iteration

Value Iteration: a faster path to optimal policies for MDPs.

MAB Cover

Our drone has come a long way.It began with a random policy, wandering aimlessly. Then it learned to evaluate that policy, estimating how good each state was. After that, it improved the policy, updating arrows to point toward better futures. And finally, through repeated cycles of Policy Iteration,

evaluation → improvement → evaluation → improvement → evaluation → …..

the drone found a truly optimal plan: a set of actions that reliably guide it toward the rooftop while avoiding no-fly zones.

That’s thoughtful. It’s wise.

But it’s slow.

It requires sweeping through the grid dozens of times just to evaluate a single policy, before even improving it! What if, instead of waiting for an entire policy to converge, the drone could update its utilities and improve its decisions at the same time? This is where an even more powerful idea emerges, Value Iteration.

Let’s look closely at what happens inside Policy Iteration.

During Policy Evaluation, we repeatedly apply:

V(s)=r(s)+γsP(ss,a=π(s))V(s)V(s) = r(s) + \gamma\sum_{s'} P(s'\mid s, a = \pi(s))V(s')

During Policy Improvement, we update:

π(s)=argmaxasP(ss,a)V(s)\pi(s) = \arg\max_a \sum_{s'} P(s'\mid s,a) V(s')

But why keep these two phases separate? Why not insert the improvement step directly into the evaluation step? Whenever you update a value at state ss, instead of using the policy’s action π(s)\pi(s), why not simply use:

maxasP(ss,a)V(s)\max_a \sum_{s'} P(s' \mid s,a) V(s')

In other words, “Don’t wait to improve the policy later. Improve the decision now, during the value update itself.” This transforms the Bellman equation from a policy-based evaluator into an optimality-based evaluator.

So we replace the fixed action with a max over all possible actions:

V(s)=r(s)+γmaxa(sP(ss,a)V(s))V^*(s) = r(s) + \gamma \max_a \left( \sum_{s'} P(s' \mid s, a) V^*(s') \right)

This is the Bellman Optimality Equation.

Suppose we want to update the value of state 0. At the start of Value Iteration, all non-terminal values are initialized to 0, just like before:

V(0)=0,V(1)=1,V(3)=+1,V(others)=0V(0) = 0,\quad V(1)= -1,\quad V(3)=+1,\quad V(\text{others}) = 0

The drone must consider all four possible actions from state 0, So we write:

Vnew(0)=r(0)+γmax[sP(ss=0,a=UP)V(s)sP(ss=0,a=LEFT)V(s)sP(ss=0,a=DOWN)V(s)sP(ss=0,a=RIGHT)V(s)]V_{\text{new}}(0) = r(0) + \gamma \max \begin{bmatrix} \sum_{s'} P(s'|s=0,a=\text{UP})V(s') \\ \sum_{s'} P(s'|s=0,a=\text{LEFT})V(s') \\ \sum_{s'} P(s'|s=0,a=\text{DOWN})V(s') \\ \sum_{s'} P(s'|s=0,a=\text{RIGHT})V(s') \end{bmatrix}

So our four action-values become:

[0.8(0)+0.1(0)+0.1(1)0.8(0)+0.1(0)+0.1(0)0.8(0)+0.1(1)+0.1(0)0.8(1)+0.1(0)+0.1(0)]=[0.100.10.8]\begin{bmatrix} 0.8(0) + 0.1(0) + 0.1(-1) \\ 0.8(0) + 0.1(0) + 0.1(0) \\ 0.8(0) + 0.1(-1) + 0.1(0) \\ 0.8(-1) + 0.1(0) + 0.1(0) \end{bmatrix} = \begin{bmatrix} -0.1 \\ 0 \\ -0.1 \\ -0.8 \end{bmatrix}

The best action is the one with the highest value. Here, LEFT, with value 0. Final value update

Vnew(0)=r(0)+γmax[0.100.10.8]=0.04+0.5×0=0.04\begin{aligned} V_{\text{new}}(0) &= r(0) + \gamma \max \begin{bmatrix} -0.1 \\ 0 \\ -0.1 \\ -0.8 \end{bmatrix} \\[6pt] &= -0.04 + 0.5 \times 0 \\[6pt] &= -0.04 \end{aligned}

Just like that, the drone has completed one Value Iteration, improving the utility of state 0 by considering every possible action, not just the one suggested by a policy.

After a few sweeps of Value Iteration, the utilities across the grid begin to change rapidly. Good states grow brighter, dangerous states grow darker, and the “terrain of value” settles into its true shape. Eventually, Value Iteration converges:

Vnew(s)Vold(s)<θfor all states s\big|V_{\text{new}}(s) - V_{\text{old}}(s)\big| < \theta \quad\text{for all states } s

At this point, the drone has discovered the optimal value function

V(s)V^*(s)

the best possible estimate of long-term future reward from every state. But a value function alone doesn’t tell the drone what action to take. An important question remains:

“Once the values have converged, how do we actually get the final best actions, the optimal policy?”

Once we have V(s)V^*(s), the optimal value function, the drone doesn’t need to guess or experiment. For every state, it can now compute,:

Which action leads to the highest long-term value?

Formally, the optimal policy is:

π(s)=argmaxa[sP(ss,a)V(s)]\pi^*(s) = \arg\max_a \left[ \sum_{s'} P(s' \mid s, a)\, V^*(s') \right]

This looks similar to the “max” inside the Bellman equation because it is the same idea, just reused for the final decision.

AlgorithmInput: r(s), p(ss,a), γ, θ1.Initialize V(s)0 for all non-terminal states2.Initialize V(s)r(s) for terminal states3.Repeat until convergence:4.Δ05.for each state s:6.tempV(s)7.V(s)r(s)+γmaxa(sp(ss,a)V(s))8.Δmax(Δ, tempV(s))9.if Δ<θ: stop; values have converged10.for each state s:11.π(s)argmaxa(sp(ss,a)V(s))12.return π(s)\textbf{Algorithm} \\[10pt] \textbf{Input: } r(s),\ p(s' \mid s,a),\ \gamma,\ \theta \\[12pt] 1.\hspace{0.5cm} \text{Initialize } V(s) \leftarrow 0 \ \text{for all non-terminal states} \\[6pt] 2.\hspace{0.5cm} \text{Initialize } V(s) \leftarrow r(s) \ \text{for terminal states} \\[12pt] 3.\hspace{0.5cm} \text{Repeat until convergence:} \\[6pt] 4.\hspace{1.2cm} \Delta \leftarrow 0 \\[6pt] 5.\hspace{1.2cm} \text{for each state } s: \\[6pt] 6.\hspace{2cm} \text{temp} \leftarrow V(s) \\[6pt] 7.\hspace{2cm} V(s) \leftarrow r(s) + \gamma \max_{a} \left( \sum_{s'} p(s' \mid s,a)\, V(s') \right) \\[6pt] 8.\hspace{2cm} \Delta \leftarrow \max\big(\Delta,\ |\text{temp} - V(s)|\big) \\[10pt] 9.\hspace{1.2cm} \text{if } \Delta < \theta:\ \text{stop; values have converged} \\[12pt] 10.\hspace{0.5cm} \text{for each state } s: \\[6pt] 11.\hspace{1.2cm} \pi^*(s) \leftarrow \arg\max_{a} \left( \sum_{s'} p(s' \mid s,a)\, V(s') \right) \\[12pt] 12.\hspace{0.5cm} \text{return } \pi^*(s)

From a Perfect Map to the Real Sky

Over the past few chapters in our drone’s journey, we explored the full power of MDPs. We learned how a drone can navigate a city perfectly, as long as it has access to a perfectly defined world:

In our little grid-city, life is clean. Predictable. Fully observable. But the real sky is rarely so kind. The drone faces a world that does not hand over its transition probabilities in neat tables. No city official gives it a spreadsheet explaining:

If real life were that cooperative, we could solve everything with MDPs overnight. But the truth is harsher:

real-world environments do not reveal their rules.

Our drone must fly through a city where:

It’s like releasing the drone into a foggy metropolis where only feedback is “you moved… something happened… try again.” No clear map. No guaranteed model. No perfect grid.

In other words:

The MDPs we solved are fully observable. Reality is not.

And this is where the classical MDP toolbox reaches its limit. Policy Iteration and Value Iteration shine in worlds where the rules are known. But when the world is hidden or unpredictable, a new toolbox is needed, one built not on knowing the environment, but on learning it.

That toolbox is Reinforcement Learning.

In reinforcement learning, the drone learns the reward function, the transition model, or even the optimal policy directly through trial, error, exploration, and experience. It does not start with a map. It discovers the map. It reshapes its understanding with every flight. So as our story in the Grid City comes to an end, a new adventure begins:

What happens when the drone must learn to fly without ever being told the rules?

That is the world of reinforcement learning and that’s where we’ll go next.

Stay tuned. The sky is about to get unpredictable.