Reinforcement Learning: Off Policy Monte Carlo

Explore how importance sampling lets RL evaluate policies without executing them.

MAB Cover

In the last blog, our drone learned by doing. And “doing” meant crashing into buildings, wasting battery, drifting into restricted airspace, and spiraling around in the wind until it finally figured things out. The agent must live through its own mistakes in order to learn from them. That’s fine when we’re training a tiny drone in Grid City…But this approach quickly hits a wall when the world becomes more serious. Let’s look into couple of scenarios to understand it better.

Scenario 1: A Trading Bot

Imagine you’re training a trading agent. Unlike a drone, it can’t afford early crashes. One bad trade may cost millions. But luckily your firm already has:

This existing trader isn’t optimal. it’s slow, cautious, maybe outdated but it’s safe, battle-tested and profitable enough to trust in production. If the bot learns on-policy, it must behave using its own early policy and that policy will be awful.

What you actually want is:

Learn an optimal or improved policy from the trades already available without executing the bad ones.

Scenario 2: A self-driving car

Now imagine a self-driving car. The current system, deployed in the real world, follows:

Policy πsafe\pi_{\text{safe}} (safe-but-slow)

This policy is extremely safe, but traffic efficiency is terrible, trips take longer, users complain, and competitors are faster. Your research team has designed a new, more assertive driving policy:

Policy πrisky\pi_{\text{risky}} (fast-but-unsafe-to-test)

This policy might dramatically improve travel time, but there is a critical problem. You cannot put πrisky\pi_{\text{risky}} on the road to collect experience because one bad decision can cause a life. so now:

I want to know how good πrisky\pi_{\text{risky}} is.…but I don’t want my car to actually drive according to πrisky\pi_{\text{risky}} yet.

When you say “how good”, what you really mean is, If my self-driving car were to always follow the risky policy πrisky\pi_{\text{risky}}, what would its expected long-term return be?

This adds an important nuance:

What you're really saying is:

I want to evaluate the consequences of πrisky\pi_{\text{risky}} before letting my real car behave that way in the real world.

Prediction vs Control

In both of our stories the trading bot and the self-driving car, there’s a subtle but very important shift in what question we’re asking. Up until now, our Monte Carlo drone was mostly answering:

What should I do in each state to get the best long-term return?

That’s the control problem. But in the self-driving example with πsafe\pi_{\text{safe}} and πrisky\pi_{\text{risky}}, we suddenly started asking a different question:

If I were to follow this policy all the time, how good would it be?

That’s the prediction problem.

So in reinforcement learning, we usually separate things into two tasks:

In the last post we learned an on-policy control algorithm using action-values Q(s,a)Q(s,a). In this post, we take a step sideways: we focus on off-policy prediction evaluating a target policy from data generated by another policy. For clarity, we will write this in terms of state values V(s)V(s), but all of this can be done with Q-values as well. If the policy π\pi is fixed, The policy already tells us which action will be used hence the value we care about is

Vπ(s)=Eπ[GtSt=s]V^\pi(s) = \mathbb{E}^\pi [ G_t \mid S_t = s ]

which means: if I start in state ss, and from that point onward I strictly follow policy π\pi, what total discounted reward should I expect?

Bridging the Gap

That equation may look familiar and unfamiliar at the same time. Familiar, because we are still talking about return. Unfamiliar, because in the previous blog our main object was not Vπ(s)V_\pi(s). It was the action-value Q(s,a)Q(s,a), and the whole story revolved around choosing the best action.

There, the drone stood in a state, looked at all available actions, and asked a control-style question:

“What should I do next?”

That is why the key relationship looked like

V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)

The value of a state was obtained by looking across actions and picking the best one. The max was the mathematical expression of control. It encoded a very specific attitude: from this state, I am free to choose whichever action gives the highest long-term return.

But now the question has quietly changed.

In off-policy prediction, we are no longer asking what the best action is. We are asking what happens if we commit to a particular policy and keep following it. The policy may be cautious, reckless, random, or somewhere in between. Whatever it is, once the policy is fixed, we are no longer free to take a max over actions. The policy itself decides how actions are chosen.

To make this concrete, imagine the self-driving car reaches a yellow light. There are only two actions:

a1=GO,a2=BRAKEa_1 = \text{GO}, \quad a_2 = \text{BRAKE}

If we were solving a control problem, we would compare the two action-values and keep the larger one:

V(s)=max{Q(s,GO),Q(s,BRAKE)}V^*(s) = \max \{ Q^*(s,\text{GO}), Q^*(s,\text{BRAKE}) \}

But suppose instead that the risky policy does not always do the same thing. Maybe in this state it chooses

πrisky(GOs)=0.8,πrisky(BRAKEs)=0.2\pi_{\text{risky}}(\text{GO} \mid s) = 0.8, \quad \pi_{\text{risky}}(\text{BRAKE} \mid s) = 0.2

Now the policy is not saying “always pick the better one.” It is saying “most of the time go, sometimes brake.” So the value of state ss under this policy must reflect both possibilities. That means the max disappears and an average appears:

Vπrisky(s)=0.8Qπrisky(s,GO)+0.2Qπrisky(s,BRAKE)V_{\pi_{\text{risky}}}(s) = 0.8 \, Q_{\pi_{\text{risky}}}(s,\text{GO}) + 0.2 \, Q_{\pi_{\text{risky}}}(s,\text{BRAKE})

More generally,

Vπ(s)=aπ(as)Qπ(s,a)V_\pi(s) = \sum_a \pi(a \mid s) \, Q_\pi(s,a)

And since Qπ(s,a)Q_\pi(s,a) is just the expected return obtained by taking action aa in state ss and then continuing with policy π\pi, we can write

Vπ(s)=aπ(as)E[GtSt=s,At=a]V_\pi(s) = \sum_a \pi(a \mid s) \, \mathbb{E}\big[ G_t \mid S_t = s, A_t = a \big]

or equivalently,

Vπ(s)=EAπ(s)[GtSt=s]V_\pi(s) = \mathbb{E}_{A \sim \pi(\cdot \mid s)} \big[ G_t \mid S_t = s \big]

This is the conceptual bridge from the previous blog to this one. In on-policy control, the state value came from the best action. In off-policy prediction, the state value comes from the policy’s action distribution. The moment we stop asking “what is the best move?” and start asking “what return does this policy produce?”, the max\max naturally turns into an expectation.

Importance Sampling

So let’s slow things down and understand what it means to evaluate a policy we never executed.

Under the risky policy, the self-driving car would sometimes GO, sometimes STOP, in proportions dictated by πrisky\pi_{\text{risky}}. If we could run that policy, we would simply average the observed returns according to the actions it chose. In other words, the value of state ss under πrisky\pi_{\text{risky}} is just the expected return produced by the actions that policy prefers.

Vπrisky(s)=EAπrisky(s)[Gs]=aπrisky(as)E[Gs,a]V^{\pi_{\text{risky}}}(s)=\mathbb{E}_{A\sim{\pi_{\text{risky}}}(\cdot\mid s)}[G \mid s] = \sum_a {\pi_{\text{risky}}}(a\mid s)\,\mathbb{E}[G\mid s,a]

But we don’t actually have any samples drawn from πrisky\pi_{\text{risky}}. Every return we’ve ever seen came from the safe policy πsafe\pi_{\text{safe}}. So the question becomes:

How do we rewrite an expectation taken under one policy in terms of data generated by another?

To do this, we perform a small algebraic trick: multiply and divide each term by πsafe\pi_{\text{safe}} (which is non-zero for all actions the safe policy ever took). Since we’re multiplying by 1, nothing changes numerically:

Vπrisky(s)=aπrisky(as)G(a)πsafe(as)πsafe(as)V^{\pi_{\text{risky}}}(s) = \sum_a \pi_{\text{risky}}(a \mid s)\, G(a) \cdot \frac{\pi_{\text{safe}}(a \mid s)}{\pi_{\text{safe}}(a \mid s)}

Now rearrange the terms:

=aπsafe(as)(πrisky(as)πsafe(as))G(a)= \sum_a \pi_{\text{safe}}(a \mid s) \left( \frac{\pi_{\text{risky}}(a \mid s)}{\pi_{\text{safe}}(a \mid s)} \right) G(a)

This expression is beautiful because:

This weighting term is what allows off-policy Monte Carlo to evaluate πrisky\pi_{\text{risky}} without ever executing it. This correction factor is exactly where importance sampling weights come from. With this reformulation in hand, we can now compute off-policy values using only the πsafe\pi_{\text{safe}} data.

Putting it into action

let’s shrink Scenario 2 (self-driving car) into the simplest possible MDP. Just one state. Two actions. One reward. No transitions. This lets us focus on the core idea without algebra swallowing the message.

Environment

One decision at a traffic light:

We have two policies.

Behavior Policy (safe policy used during data collection where STOP is preferred more)

b=πsafe:b(GOs0)=0.2,b(STOPs0)=0.8b = \pi_{\text{safe}}: \quad b(\text{GO}\mid s_0)=0.2,\qquad b(\text{STOP}\mid s_0)=0.8

Target Policy (risky policy we want to evaluate where GO is preferred more)

π=πrisky:π(GOs0)=0.9,π(STOPs0)=0.1\pi = \pi_{\text{risky}}: \quad \pi(\text{GO}\mid s_0)=0.9,\qquad \pi(\text{STOP}\mid s_0)=0.1

What we want:

Vπrisky(s0)=E[Gs0 and act using πrisky]V^{\pi_{\text{risky}}}(s_0) = \mathbb{E} \big[ G \mid s_0 \text{ and act using } \pi_{\text{risky}} \big]

But We never actually drive using πrisky\pi_{\text{risky}} because it’s unsafe. All real data comes from πsafe\pi_{\text{safe}}.

Suppose we let the car approach this intersection 3 times, each time following the safe policy. The collected episodes:

| Episode | Action Taken | Reward | | --- | --- | --- | | 1 | STOP | -1 | | 2 | GO (safe outcome) | +10 | | 3 | STOP | -1 |

Everything in this dataset was generated by πsafe\pi_{\text{safe}}, not πrisky\pi_{\text{risky}}. If we did on-policy MC for the safe policy, we’d simply average the rewards Because that’s what we were doing in on-policy Monte Carlo, we generated multiple episodes using the same policy and simply averaged the returns we observed. Every time a state (or state–action pair) appeared, we recorded its full return and updated its value as the mean of all those returns over multiple episodes.

V^πsafe(s0)=1+1013=832.67.\hat{V}^{\pi_{\text{safe}}}(s_0) = \frac{-1 + 10 - 1}{3} = \frac{8}{3} \approx 2.67.

That tells us how good the safe policy is. To answer “what if we had driven under πrisky\pi_{\text{risky}}?”, we must adjust each episode according to importance sampling:

w=πrisky(as0)πsafe(as0)w = \frac{ \pi_{\text{risky}}(a \mid s_0) }{ \pi_{\text{safe}}(a \mid s_0) }

This ratio tells us:

“How much more (or less) likely was this action under the risky policy compared to the safe one?”

If we compute them.

Episode 1 Action: STOP

b(STOP)=0.8b(\text{STOP}) = 0.8 and π(STOP)=0.1\pi(\text{STOP}) = 0.1

w1=0.10.8=0.125.w_1 = \frac{0.1}{0.8} = 0.125.

Episode 2 Action: GO

b(GO)=0.2b(\text{GO}) = 0.2 and π(GO)=0.9\pi(\text{GO}) = 0.9

w2=0.90.2=4.5.w_2 = \frac{0.9}{0.2} = 4.5.

Episode 3 Action: STOP

w3=0.125.w_3 = 0.125.

Now our dataset looks like this:

| Ep | Action | Reward GG | Weight ww | | --- | --- | --- | --- | | 1 | STOP | -1 | 0.125 | | 2 | GO | +10 | 4.5 | | 3 | STOP | -1 | 0.125 |

Now instead of averaging raw returns, we weight them.

Vπrisky(s0)13(w1(1)+w2(10)+w3(1))V^{\pi_{\text{risky}}}(s_0) \approx \frac{1}{3} \big( w_1(-1) + w_2(10) + w_3(-1) \big)

Plug in numbers:

=13(0.125(1)+4.5(10)+0.125(1))= \frac{1}{3} \big( 0.125(-1) + 4.5(10) + 0.125(-1) \big) =13(0.125+450.125)=44.75314.92= \frac{1}{3} \big( -0.125 + 45 - 0.125 \big) = \frac{44.75}{3} \approx 14.92

Now suddenly the estimate explodes upward. Why?

Because the one GO event gets amplified by 4.5 times.

In our three episodes, the safe policy produced two STOP actions and one GO. Under the safe policy that makes sense: it stops most of the time. But the risky policy behaves very differently. It almost always goes. Under the risky policy, GO is far more common than under the safe one. So every GO sample becomes extremely influential. Importance sampling compensates for this mismatch.

From a Single State to a Full Episode

Real environments whether drones, trading bots, or self-driving cars don’t make just one decision. They make many decisions per episode:

So we now face the more general question:

How do we evaluate a target policy when each episode consists of multiple decisions, each taken under the behavior policy?

To extend importance sampling beyond a single action, we must look at the entire trajectory.

A real drone flight episode might look like:

s0a0s1a1s2a2aT1sT.s_0 \xrightarrow{a_0} s_1 \xrightarrow{a_1} s_2 \xrightarrow{a_2} \cdots \xrightarrow{a_{T-1}} s_T.

And the return we care about is no longer just a single reward instead it is the whole discounted sum:

G=r0+γr1+γ2r2++γT1rT1.G = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots + \gamma^{T-1} r_{T-1}.

When the behavior policy πsafe\pi_{\text{safe}} generated this trajectory, it selected each action according to its own probabilities:

b(a0s0),b(a1s1),,b(aT1sT1).b(a_0 \mid s_0),\quad b(a_1 \mid s_1),\quad \ldots,\quad b(a_{T-1} \mid s_{T-1}).

Under the target policy, the same trajectory would look like:

π(a0s0),π(a1s1),,π(aT1sT1).\pi(a_0 \mid s_0),\quad \pi(a_1 \mid s_1),\quad \ldots,\quad \pi(a_{T-1} \mid s_{T-1}).

we extend the same idea we used in the single-state case. Instead of one importance ratio, we now have one ratio per step, multiplied together:

ρ0:T1=π(a0s0)b(a0s0)π(a1s1)b(a1s1)π(aT1sT1)b(aT1sT1).\rho_{0:T-1} = \frac{\pi(a_0 \mid s_0)}{b(a_0 \mid s_0)} \cdot \frac{\pi(a_1 \mid s_1)}{b(a_1 \mid s_1)} \cdots \frac{\pi(a_{T-1} \mid s_{T-1})}{b(a_{T-1} \mid s_{T-1})}. =t=0T1π(atst)b(atst)= \prod_{t=0}^{T-1} \frac{ \pi(a_t|s_t) }{ b(a_t|s_t) }

Why a product?

Because the probability of a trajectory under a policy is the product of action probabilities along the path. If we want to convert expectations from one policy’s trajectory distribution to another’s, we must correct every decision along the way.

So the Monte Carlo return update becomes:

V(St)V(St)+α(ρ0:T1GtV(St))V(S_t) \leftarrow V(S_t) + \alpha \Big( \rho_{0:T-1} G_t - V(S_t) \Big)

Look carefully at what changed compared to ordinary Monte Carlo.

Ordinary Monte Carlo:

V(St)V(St)+α(GtV(St))V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))

Off-policy Monte Carlo:

V(St)V(St)+α(ρGtV(St))V(S_t) \leftarrow V(S_t) + \alpha (\rho G_t - V(S_t))

Just one extra multiplier. That’s it. But that one multiplier changes everything.

Now comes the uncomfortable truth. Imagine a short 3-decision episode where the safe policy is conservative at every step, and the risky policy is aggressive at every step. Suppose at each of the three states, the risky policy chooses the aggressive action with probability 0.90.9, while the safe policy chooses it with probability 0.20.2. If one episode happens to contain aggressive actions all three times, then the ratio becomes

ρ=(0.90.2)3=4.53=91.125\rho = \left(\frac{0.9}{0.2}\right)^3 = 4.5^3 = 91.125

So one “lucky” aggressive trajectory can count as ninety-one episodes worth of evidence. That’s not a small correction anymore. That’s a hijacking of your estimate.

And there’s an even harsher truth hidden inside that product: This product tells us how much more (or less) likely the entire trajectory would have been under πrisky\pi_{\text{risky}} compared to πsafe\pi_{\text{safe}}. If the target policy can choose an action in a state, the behavior policy must assign it non-zero probability. Formally:

ifπ(as)>0,thenb(as)>0if \quad \pi(a \mid s) > 0, \quad then \quad b(a \mid s) > 0

Otherwise the ratio π(as)b(as)\frac{ \pi(a|s) }{ b(a|s) } either explodes (division by zero) or is undefined, and you literally cannot evaluate that part of the target policy from your data. that means

The behavior policy πsafe\pi_{\text{safe}} must sometimes do everything that πrisky\pi_{\text{risky}} might do. No data → no evaluation.

Weighted importance sampling

From here we have two ways to use importance sampling in off-policy Monte Carlo method

Important thing to keep in mind here is, Vπ(s0)V^{\pi}(s_0) is “the expected return when we start in state s0s_0 ****and then behave according to policy π\pi” for episode-length weights instead of one-step weights.

Algorithm: Off-policy MC prediction updateinput: {E=[(s0,a0,r0),(s1,a1,r1),,(sT1,aT1,rT1)]b(as) behavior policyπ(as) target policyγ discount factorpersistent: {V(s) current value estimatesC(s) cumulative normalizing weights1.G02.w13.for tT1 down to 0 do4.st,at,rtE[t]5.Grt+γG6.wwπ(atst)b(atst)7.C(st)C(st)+w8.V(st)V(st)+wC(st)(GV(st))\textbf{Algorithm: Off-policy MC prediction update} \\[6pt] \textbf{input: } \begin{cases} E = \big[(s_0,a_0,r_0), (s_1,a_1,r_1), \ldots, (s_{T-1},a_{T-1},r_{T-1})\big] \\ b(a \mid s)\ \text{behavior policy} \\ \pi(a \mid s)\ \text{target policy} \\ \gamma\ \text{discount factor} \end{cases} \\[10pt] \textbf{persistent: } \begin{cases} V(s)\ \text{current value estimates} \\ C(s)\ \text{cumulative normalizing weights} \end{cases} \\[12pt] 1.\hspace{0.5cm} G \leftarrow 0 \\[4pt] 2.\hspace{0.5cm} w \leftarrow 1 \\[10pt] 3.\hspace{0.5cm} \textbf{for } t \leftarrow T-1\ \text{down to}\ 0\ \textbf{do} \\[4pt] 4.\hspace{1.2cm} s_t,a_t,r_t \leftarrow E[t] \\[6pt] 5.\hspace{1.2cm} G \leftarrow r_t + \gamma G \\[6pt] 6.\hspace{1.2cm} w \leftarrow w \cdot \dfrac{\pi(a_t \mid s_t)}{b(a_t \mid s_t)} \\[10pt] 7.\hspace{1.2cm} C(s_t) \leftarrow C(s_t) + w \\[8pt] 8.\hspace{1.2cm} V(s_t) \leftarrow V(s_t) + \dfrac{w}{C(s_t)} \big( G - V(s_t) \big)

Algorithm: Training Off-policy MC predictorinput: {MDP black boxb(as) behavior policy (used to act)π(as) target policy (to evaluate)γ discount factorN number of training episodess0 starting stateSend set of terminal statespersistent: {V(s) value estimates for states, initially zeroC(s) cumulative normalizing weights, initially zero1.for i1 to N do2.initialize empty episode list E3.ss04.while sSend do5.sample ab(s)6.r, sMDP black box(s,a)7.append (s,a,r) to episode E8.ss9.Off-policy MC prediction update (Weighted IS) using episode E10.return V\textbf{Algorithm: Training Off-policy MC predictor} \\[6pt] \textbf{input: } \begin{cases} \text{MDP black box} \\ b(a \mid s)\ \text{behavior policy (used to act)} \\ \pi(a \mid s)\ \text{target policy (to evaluate)} \\ \gamma\ \text{discount factor} \\ N\ \text{number of training episodes} \\ s_0\ \text{starting state} \\ S_{\text{end}}\ \text{set of terminal states} \end{cases} \\[10pt] \textbf{persistent: } \begin{cases} V(s)\ \text{value estimates for states, initially zero} \\ C(s)\ \text{cumulative normalizing weights, initially zero} \end{cases} \\[12pt] 1.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \text{ to } N\ \textbf{do} \\[4pt] 2.\hspace{1.2cm} \text{initialize empty episode list } E \\[4pt] 3.\hspace{1.2cm} s \leftarrow s_0 \\[8pt] 4.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}}\ \textbf{do} \\[4pt] 5.\hspace{2cm} \text{sample } a \sim b(\cdot \mid s) \\[4pt] 6.\hspace{2cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[4pt] 7.\hspace{2cm} \text{append } (s,a,r) \text{ to episode } E \\[4pt] 8.\hspace{2cm} s \leftarrow s' \\[8pt] 9.\hspace{1.2cm} \text{Off-policy MC prediction update (Weighted IS) using episode } E \\[10pt] 10.\hspace{0.5cm} \textbf{return } V

From Off-Policy Prediction to Off-Policy Control

So far in this post, we’ve stayed in the prediction world:

Given a fixed target policy π\pi, how do we estimate VπV^\pi using episodes generated by another policy bb?

we can safely evaluate risky or experimental policies from logged data but in reinforcement learning we rarely want to stop at prediction. Eventually, we want control:

Can I use off-policy Monte Carlo ideas not just to evaluate a policy, but to actually learn a better one?

Conceptually, off-policy MC control is “just” off-policy prediction + policy improvement:

  1. Behavior policy bb:

    used to explore and generate episodes (e.g., safe/self-driving, conservative trader).

  2. Target policy π\pi:

    a greedy (or almost greedy) policy with respect to our current value estimates.

  3. Evaluate π\pi off-policy using importance sampling.

  4. Improve π\pi by making it greedier w.r.t. updated values.

  5. Repeat.

For control, it’s more natural to work with action-values Q(s,a)Q(s,a) instead of just state values:

Once we have QQ, we can easily define a greedy target policy:

π(s)=argmaxaQ(s,a).\pi(s) = \arg\max_a Q(s,a).

Now the game looks like this:

We generate episodes under bb, but we update Q as if we had followed π\pi, using importance sampling on the state–action level.

For a single episode:

E=[(s0,a0,r0), (s1,a1,r1), , (sT1,aT1,rT1)].E = [(s_0,a_0,r_0),\ (s_1,a_1,r_1),\ \ldots,\ (s_{T-1},a_{T-1},r_{T-1})].

For each time step tt, we can compute:

Then wtGtw_t G_t is a corrected Monte Carlo return. The same old “full return from time tt” idea, except now it’s been reweighted so that even though the episode was generated by bb, it counts as evidence for π\pi.

Now a weighted-IS style update for each visited pair (st,at)(s_t,a_t) looks like:

C(st,at)C(st,at)+wtC(s_t,a_t) \leftarrow C(s_t,a_t) + w_t Q(st,at)Q(st,at)+wtC(st,at)(GtQ(st,at))Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \frac{w_t}{C(s_t,a_t)} \big( G_t - Q(s_t,a_t) \big)

In control, we usually make the target policy greedy with respect to our current QQ. In its purest form it’s deterministic. After processing the whole episode, we update the target policy:

π(s)argmaxaQ(s,a)\pi(s) \leftarrow \arg\max_a Q(s,a)

We keep behaving with bb (which must still explore all actions that π\pi might want), but our evaluation target π\pi keeps getting greedier.

From Episodes to Step-by-Step Learning

Off-policy Monte Carlo control gives us a powerful idea, we can learn a better policy without ever acting according to it but MC has one big limitation, it only learns after an episode ends. In long or complicated tasks, this makes learning slow, unstable, and sometimes impractical.

So the natural next question is:

Can we update our estimates during an episode instead of waiting until the end?

This is exactly what Temporal-Difference (TD) learning does. Instead of waiting for the full return, TD methods learn step-by-step, adjusting their value estimates continuously.