Reinforcement Learning: Model Based ADP Learner

From raw experience to learned MDP models: an introduction to model-based ADP.

MAB Cover

In Policy Iteration and Value Iteration, life was easy. Our drone had the full rulebook of Grid City:

It could sit in a warm simulation lab, crunch equations, and compute the perfect policy without ever leaving home.

But now? Grid City becomes a black box.

The drone has no access to the reward function. It has no access to the transition model. It doesn’t even know why or how it moves the way it does. All it can do is interact. The conversation goes like this:

And that’s it. No explanations. No diagrams. Just inputs and outputs. From the drone’s perspective, all it ever receives is a sequence:

s0=8, a0=2, r0=0.04, s1=9, a1=1, r1=0.04,s_0=8,\ a_0=2,\ r_0=-0.04,\ s_1=9,\ a_1=1,\ r_1=-0.04,\ldots

Just an endless stream of state, action, reward, state, action, reward

The drone has no idea what “8” means. It doesn’t know that “2” means “LEFT” or that -0.04 is “battery drain.” It never sees the map or the labels. The meaning is hidden. Only the experience is visible.

Reconstructing the Invisible City

The drone’s world is now reduced to a string of symbols:

8, 2, 0.04, 9, 1, 0.04, 5, 4, 0.04,8,\ 2,\ -0.04,\ 9,\ 1,\ -0.04,\ 5,\ 4,\ -0.04,\ldots

A long, cryptic rhythm of: state → action → reward → next state

But here's the beautiful part:

Even when the world hides its rules…patterns always leak through. And the drone can learn these patterns. That’s exactly what model-based ADP is about.

Learning the Reward Function

Let’s start with something simple. The reward function r(s)r(s) depends only on the state (function of state).

If the drone sees:

Then,

r(8)=0.04,r(9)=0.04,r(1)=1r(8) = -0.04, r(9) = -0.04, r(1) = -1

Each state reveals its reward simply by being visited.

Uncovering the Transition Model

The transition model is trickier. The drone wants to know:

p(ss,a)p(s' \mid s, a)

But the MDP never tells it. Instead, the drone gathers triples:

 (s,a,s), (s,a,s),\ (s, a, s'),\ (s, a, s''),\ldots

and simply counts them.

Let:

Then:

p(ss,a)=N(s,a,s)N(s,a)p(s' \mid s,a) = \frac{N(s,a,s')}{N(s,a)}

That’s it. The transition model is reconstructed from pure experience.

For example, suppose whenever the drone is in state 9 and chooses action 1, it sees this:

Then:

The estimated probabilities are:

p(59,1)=23,p(89,1)=13{p}(5 \mid 9, 1) = \frac{2}{3},\qquad {p}(8 \mid 9, 1) = \frac{1}{3}

Just by counting, the drone discovers how the winds behave. It starts seeing structure inside the black box.

Algorithm: Model-Based ADP Perceptinput: sequence element (s,a,r,s)persistent: {r(s) initially zerop(ss,a) initially zeron(s,a,s) initially zeron(s,a) initially zerovisited(s) initially false1.if visited(s)=false then2.visited(s)true3.r(s)r4.V(s)r5.n(s,a,s)n(s,a,s)+16.for each s^S do7.p(s^s,a)n(s,a,s^)in(s,a,si)\textbf{Algorithm: Model-Based ADP Percept} \\[4pt] \textbf{input: } \text{sequence element } (s, a, r, s') \\[4pt] \textbf{persistent: } \begin{cases} r(s) \text{ initially zero} \\ p(s'|s,a) \text{ initially zero} \\ n(s,a,s') \text{ initially zero} \\ n(s,a) \text{ initially zero} \\ \text{visited}(s) \text{ initially false} \end{cases} \\[10pt] 1.\hspace{0.5cm} \textbf{if } \text{visited}(s') = \text{false then} \\[6pt] 2.\hspace{1.2cm} \text{visited}(s') \leftarrow \text{true} \\[6pt] 3.\hspace{1.2cm} r(s') \leftarrow r \\[6pt] 4.\hspace{1.2cm} V(s') \leftarrow r \\[6pt] 5.\hspace{0.5cm} n(s,a,s') \leftarrow n(s,a,s') + 1 \\[4pt] 6.\hspace{0.5cm} \textbf{for each } \hat{s} \in S \textbf{ do} \\[6pt] 7.\hspace{1.2cm} p(\hat{s}|s,a) \leftarrow \dfrac{n(s,a,\hat{s})}{\sum_{i} n(s,a,s_i)} \\[6pt]

With the Model Learned… back to Dynamic Programming

Once the drone has estimated:

it effectively has a complete reconstruction of the unseen MDP. From there, it can apply value iteration or policy iteration exactly like before.

The drone’s intelligence adapts as the model improves. The more it flies, the better the city reveals itself. It learns not by being told, but by discovering the rules behind its own experiences.

This family of approaches, learning the model from experience, then solving it with dynamic programming is known as Adaptive Dynamic Programming (ADP) and because we are relying on learning a model here we call it a model based approch hence model based ADP learner.

Algorithm: Model-based ADP learner policy updateinput: convergence threshold θpersistent: {π(s) the current policyV(s) the current value estimater(s) the learned reward modelp(ss,a) the learned transition modelε exploration probability thresholdξ decay factor for ε1.stablefalse2.while stable=false do3.V(s)policy_evaluation(r^(s), p^(ss,a), γ, θ, π(s), V(s))4.π(s), changepolicy_improvement(p^(ss,a), π(s), V(s))5.if change=false then6.stabletrue7.εξε\textbf{Algorithm: Model-based ADP learner policy update} \\[6pt] \textbf{input: } convergence\ threshold\ \theta \\[6pt] \textbf{persistent: } \begin{cases} \pi(s) \text{ the current policy} \\ V(s) \text{ the current value estimate} \\ r(s) \text{ the learned reward model} \\ p(s' \mid s,a) \text{ the learned transition model} \\ \varepsilon \text{ exploration probability threshold} \\ \xi \text{ decay factor for } \varepsilon \end{cases} \\[12pt] 1.\hspace{0.5cm} stable \leftarrow false \\[10pt] 2.\hspace{0.5cm} \textbf{while } stable = false\ \textbf{do} \\[6pt] 3.\hspace{1.2cm} V(s) \leftarrow policy\_evaluation\big(\hat{r}(s),\ \hat{p}(s' \mid s,a),\ \gamma,\ \theta,\ \pi(s),\ V(s)\big) \\[8pt] 4.\hspace{1.2cm} \pi(s),\ change \leftarrow policy\_improvement\big(\hat{p}(s' \mid s,a),\ \pi(s),\ V(s)\big) \\[10pt] 5.\hspace{1.2cm} \textbf{if } change = false\ \textbf{then} \\[4pt] 6.\hspace{1.8cm} stable \leftarrow true \\[8pt] 7.\hspace{0.5cm} \varepsilon \leftarrow \xi \varepsilon

Gathering Knowledge

By now, our drone has learned how to reconstruct the invisible city, estimating both the reward function and transition probabilities simply by observing sequences like:

s, a, r, s, a, r,s,\ a,\ r,\ s,\ a,\ r,\ldots

In theory, if the drone collected infinitely many such sequences, its estimates r(s){r}(s) and p(ss,a){p}(s'|s,a) would eventually become perfect. The reconstructed MDP would match the real Grid City exactly. But there’s a catch:

Exploring an unknown world is not free.

In Grid City, generating one trajectory means flying one full episode. Flying ten thousand episodes means burning battery, risking crashes, drifting into restricted airspace and enduring constant turbulence.

If the drone over-explores, it wastes time and battery wandering aimlessly. If it over-exploits too early, it might settle on a wrong understanding of the world.

This should remind you of the dilemma in the multi-armed bandit casino story from earlier blog:

The drone faces the exact same tradeoff. In practice, model-based ADP must choose carefully:

Just like the ε-greedy strategy in multi-armed bandit:

Balancing exploration with exploitation allows the drone to learn efficiently, without burning itself out.

Algorithm: Model-Based ADP Actuate (Exploration–Exploitation)Input: state sPersistent: {π(s) (current policy, initially random)ε (exploration probability, initially 1.0)1.βrandom number in [0,1]2.if β<ε then3.arandom action from A4.else5.aπ(s)6.return a\textbf{Algorithm: Model-Based ADP Actuate (Exploration--Exploitation)} \\[10pt] \textbf{Input: } \text{state } s \\[6pt] \textbf{Persistent: } \begin{cases} \pi(s) \text{ (current policy, initially random)} \\ \varepsilon \text{ (exploration probability, initially } 1.0) \end{cases}\\[12pt] 1.\hspace{0.5cm} \beta \leftarrow \text{random number in }[0,1] \\[6pt] 2.\hspace{0.5cm} \textbf{if } \beta < \varepsilon \ \textbf{then} \\[6pt] 3.\hspace{1.2cm} a \leftarrow \text{random action from } A \\[6pt] 4.\hspace{0.5cm} \textbf{else} \\[6pt] 5.\hspace{1.2cm} a \leftarrow \pi(s) \\[6pt] 6.\hspace{0.5cm} \text{return } a

Combining Everything Together

The interaction between the learner and the MDP black box follows a simple but powerful loop. Within each episode, the learner repeatedly percepts incoming information, the state transition and reward and then actuates by choosing the next action. This percept–act cycle continues until the episode reaches one of the terminal states.

Once an episode is complete, the learner enters a different mode: it uses everything it has observed to update its policy and model estimates. As more episodes are played, these updates accumulate, allowing the policy to steadily improve. To measure how well training is going, the learner keeps a record of the total reward obtained in each episode, making it easy to track learning performance over time.

Algorithm: Training a Model-based ADP learner for N episodesinput: {MDP black boxADP learnerN training episodess0 starting stateSend set of terminal states1.initialize R with zeros2.for i1 to N do3.ss04.while sSend do5.aADP learner actuate(s)6.r, sMDP black box(s,a)7.R(i)R(i)+r8.ADP learner percept(s,a,r,s)9.ss10.ADP learner policy update11.return R\textbf{Algorithm: Training a Model-based ADP learner for N episodes} \\[8pt] \textbf{input: } \begin{cases} \text{MDP black box} \\ \text{ADP learner} \\ N \text{ training episodes} \\ s_0 \text{ starting state} \\ S_{\text{end}} \text{ set of terminal states} \end{cases} \\[16pt] 1.\hspace{0.5cm} \text{initialize } R \text{ with zeros} \\[6pt] 2.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \text{ to } N\ \textbf{do} \\[6pt] 3.\hspace{1.2cm} s \leftarrow s_0 \\[10pt] 4.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}}\ \textbf{do} \\[4pt] 5.\hspace{2cm} a \leftarrow \text{ADP learner actuate}(s) \\[6pt] 6.\hspace{2cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[6pt] 7.\hspace{2cm} R(i) \leftarrow R(i) + r \\[6pt] 8.\hspace{2cm} \text{ADP learner percept}(s,a,r,s') \\[6pt] 9.\hspace{2cm} s \leftarrow s' \\[12pt] 10.\hspace{1.2cm} \text{ADP learner policy update} \\[12pt] 11.\hspace{0.5cm} \textbf{return } R

When the World Is Too Complicated to Model

Everything we built in this chapter rests on a single idea:

If the drone can learn the model, it can solve the model.

That is why ADP is called model-based.

It estimates the reward function, the transition probabilities, and then uses dynamic programming to compute an optimal policy. And as long as the transition structure is clear and learnable like in grid worlds, video games, puzzles, and controlled environments, ADP performs wonderfully.

But what about worlds where the rules are murky?

In these settings, the transition model p(ss,a)p(s'|s,a) is:

Building a model in such environments is like trying to bottle the wind. For these domains, model-based learning becomes impractical, not because the idea is flawed, but because the world itself refuses to be neatly captured.

This opens the door to model-free learning methods. These methods do not attempt to estimate p(ss,a)p(s'|s,a) at all. Instead, they learn values and policies directly from observed returns, bypassing the need for an internal model entirely.

In the next blog, we’ll begin exploring this model-free landscape, starting with one of the most fundamental techniques in reinforcement learning: the Monte Carlo (MC) learner.