Reinforcement Learning: Off Policy Monte Carlo

Explore how importance sampling lets RL evaluate policies without executing them.

In the last blog, our drone learned by doing. And “doing” meant crashing into buildings, wasting battery, drifting into restricted airspace, and spiraling around in the wind until it finally figured things out. The agent must live through its own mistakes in order to learn from them. That’s fine when we’re training a tiny drone in Grid City…But this approach quickly hits a wall when the world becomes more serious. Let’s look into couple of scenarios to understand it better.

Scenario 1: A Trading Bot

Imagine you’re training a trading agent. Unlike a drone, it can’t afford early crashes. One bad trade may cost millions. But luckily your firm already has:

10 years of historical trades,
decisions made by human experts,
logs from a conservative legacy algorithm.

This existing trader isn’t optimal. it’s slow, cautious, maybe outdated but it’s safe, battle-tested and profitable enough to trust in production. If the bot learns on-policy, it must behave using its own early policy and that policy will be awful.

What you actually want is:

Learn an optimal or improved policy from the trades already available without executing the bad ones.

Scenario 2: A self-driving car

Now imagine a self-driving car. The current system, deployed in the real world, follows:

Policy $\pi_{\text{safe}}$ (safe-but-slow)

Keeps a large buffer distance.
Only overtakes when the road is nearly empty.
Brakes early near intersections.
Avoids tight gaps aggressively.

This policy is extremely safe, but traffic efficiency is terrible, trips take longer, users complain, and competitors are faster. Your research team has designed a new, more assertive driving policy:

Policy $\pi_{\text{risky}}$ (fast-but-unsafe-to-test)

Closes the gap sooner to cars ahead.
Overtakes more frequently.
Accelerates through borderline yellow lights.
Navigates tighter spaces in traffic.

This policy might dramatically improve travel time, but there is a critical problem. You cannot put $\pi_{\text{risky}}$ on the road to collect experience because one bad decision can cause a life. so now:

I want to know how good $\pi_{\text{risky}}$ is.…but I don’t want my car to actually drive according to $\pi_{\text{risky}}$ yet.

When you say “how good”, what you really mean is, If my self-driving car were to always follow the risky policy $\pi_{\text{risky}}$ , what would its expected long-term return be?

This adds an important nuance:

You don’t actually want to deploy $\pi_{\text{risky}}$ on the real vehicle.
Because doing so could:
- break traffic laws
- increase collision risk
- endanger passengers or other road users

What you're really saying is:

I want to evaluate the consequences of $\pi_{\text{risky}}$ before letting my real car behave that way in the real world.

Prediction vs Control

In both of our stories the trading bot and the self-driving car, there’s a subtle but very important shift in what question we’re asking. Up until now, our Monte Carlo drone was mostly answering:

What should I do in each state to get the best long-term return?

That’s the control problem. But in the self-driving example with $\pi_{\text{safe}}$ and $\pi_{\text{risky}}$ , we suddenly started asking a different question:

If I were to follow this policy all the time, how good would it be?

That’s the prediction problem.

So in reinforcement learning, we usually separate things into two tasks:

Prediction → Evaluate how good a given policy is (Policy Evaluation).
Control → Find a good (or optimal) policy (Policy Evaluation + Policy Improvement).

In the last post we learned an on-policy control algorithm using action-values $Q(s,a)$ . In this post, we take a step sideways: we focus on off-policy prediction evaluating a target policy from data generated by another policy. For clarity, we will write this in terms of state values $V(s)$ , but all of this can be done with Q-values as well. If the policy $\pi$ is fixed, The policy already tells us which action will be used hence the value we care about is

V^\pi(s) = \mathbb{E}^\pi [ G_t \mid S_t = s ]

which means: if I start in state $s$ , and from that point onward I strictly follow policy $\pi$ , what total discounted reward should I expect?

Bridging the Gap

That equation may look familiar and unfamiliar at the same time. Familiar, because we are still talking about return. Unfamiliar, because in the previous blog our main object was not $V_\pi(s)$ . It was the action-value $Q(s,a)$ , and the whole story revolved around choosing the best action.

There, the drone stood in a state, looked at all available actions, and asked a control-style question:

“What should I do next?”

That is why the key relationship looked like

V^*(s) = \max_a Q^*(s,a)

The value of a state was obtained by looking across actions and picking the best one. The max was the mathematical expression of control. It encoded a very specific attitude: from this state, I am free to choose whichever action gives the highest long-term return.

But now the question has quietly changed.

In off-policy prediction, we are no longer asking what the best action is. We are asking what happens if we commit to a particular policy and keep following it. The policy may be cautious, reckless, random, or somewhere in between. Whatever it is, once the policy is fixed, we are no longer free to take a max over actions. The policy itself decides how actions are chosen.

To make this concrete, imagine the self-driving car reaches a yellow light. There are only two actions:

a_1 = \text{GO}, \quad a_2 = \text{BRAKE}

If we were solving a control problem, we would compare the two action-values and keep the larger one:

V^*(s) = \max \{ Q^*(s,\text{GO}), Q^*(s,\text{BRAKE}) \}

But suppose instead that the risky policy does not always do the same thing. Maybe in this state it chooses

\pi_{\text{risky}}(\text{GO} \mid s) = 0.8, \quad \pi_{\text{risky}}(\text{BRAKE} \mid s) = 0.2

Now the policy is not saying “always pick the better one.” It is saying “most of the time go, sometimes brake.” So the value of state $s$ under this policy must reflect both possibilities. That means the max disappears and an average appears:

V_{\pi_{\text{risky}}}(s) = 0.8 \, Q_{\pi_{\text{risky}}}(s,\text{GO}) + 0.2 \, Q_{\pi_{\text{risky}}}(s,\text{BRAKE})

More generally,

V_\pi(s) = \sum_a \pi(a \mid s) \, Q_\pi(s,a)

And since $Q_\pi(s,a)$ is just the expected return obtained by taking action $a$ in state $s$ and then continuing with policy $\pi$ , we can write

V_\pi(s) = \sum_a \pi(a \mid s) \, \mathbb{E}\big[ G_t \mid S_t = s, A_t = a \big]

or equivalently,

V_\pi(s) = \mathbb{E}_{A \sim \pi(\cdot \mid s)} \big[ G_t \mid S_t = s \big]

This is the conceptual bridge from the previous blog to this one. In on-policy control, the state value came from the best action. In off-policy prediction, the state value comes from the policy’s action distribution. The moment we stop asking “what is the best move?” and start asking “what return does this policy produce?”, the $\max$ naturally turns into an expectation.

Importance Sampling

So let’s slow things down and understand what it means to evaluate a policy we never executed.

Under the risky policy, the self-driving car would sometimes GO, sometimes STOP, in proportions dictated by $\pi_{\text{risky}}$ . If we could run that policy, we would simply average the observed returns according to the actions it chose. In other words, the value of state $s$ under $\pi_{\text{risky}}$ is just the expected return produced by the actions that policy prefers.

V^{\pi_{\text{risky}}}(s)=\mathbb{E}_{A\sim{\pi_{\text{risky}}}(\cdot\mid s)}[G \mid s] = \sum_a {\pi_{\text{risky}}}(a\mid s)\,\mathbb{E}[G\mid s,a]

But we don’t actually have any samples drawn from $\pi_{\text{risky}}$ . Every return we’ve ever seen came from the safe policy $\pi_{\text{safe}}$ . So the question becomes:

How do we rewrite an expectation taken under one policy in terms of data generated by another?

To do this, we perform a small algebraic trick: multiply and divide each term by $\pi_{\text{safe}}$ (which is non-zero for all actions the safe policy ever took). Since we’re multiplying by 1, nothing changes numerically:

V^{\pi_{\text{risky}}}(s) = \sum_a \pi_{\text{risky}}(a \mid s)\, G(a) \cdot \frac{\pi_{\text{safe}}(a \mid s)}{\pi_{\text{safe}}(a \mid s)}

Now rearrange the terms:

= \sum_a \pi_{\text{safe}}(a \mid s) \left( \frac{\pi_{\text{risky}}(a \mid s)}{\pi_{\text{safe}}(a \mid s)} \right) G(a)

This expression is beautiful because:

Now everything on the outside matches the distribution we actually collected data from.
The ratio adjusts each sample to reflect how strongly $\pi_{\text{risky}}$ preferred that action compared to $\pi_{\text{safe}}$ .

This weighting term is what allows off-policy Monte Carlo to evaluate $\pi_{\text{risky}}$ without ever executing it. This correction factor is exactly where importance sampling weights come from. With this reformulation in hand, we can now compute off-policy values using only the $\pi_{\text{safe}}$ data.

Putting it into action

let’s shrink Scenario 2 (self-driving car) into the simplest possible MDP. Just one state. Two actions. One reward. No transitions. This lets us focus on the core idea without algebra swallowing the message.

Environment

One decision at a traffic light:

State: at the intersection (call it $s_0$ )
Actions:
- GO
- STOP
Transition Model:
- If you GO:
  - 99% chance → safe → reward = +10
  - 1% chance → crash → reward = −100
- If you STOP:
  - always safe → reward = −1

We have two policies.

Behavior Policy (safe policy used during data collection where STOP is preferred more)

b = \pi_{\text{safe}}: \quad b(\text{GO}\mid s_0)=0.2,\qquad b(\text{STOP}\mid s_0)=0.8

Target Policy (risky policy we want to evaluate where GO is preferred more)

\pi = \pi_{\text{risky}}: \quad \pi(\text{GO}\mid s_0)=0.9,\qquad \pi(\text{STOP}\mid s_0)=0.1

What we want:

V^{\pi_{\text{risky}}}(s_0) = \mathbb{E} \big[ G \mid s_0 \text{ and act using } \pi_{\text{risky}} \big]

But We never actually drive using $\pi_{\text{risky}}$ because it’s unsafe. All real data comes from $\pi_{\text{safe}}$ .

Suppose we let the car approach this intersection 3 times, each time following the safe policy. The collected episodes:

| Episode | Action Taken | Reward | | --- | --- | --- | | 1 | STOP | -1 | | 2 | GO (safe outcome) | +10 | | 3 | STOP | -1 |

Everything in this dataset was generated by $\pi_{\text{safe}}$ , not $\pi_{\text{risky}}$ . If we did on-policy MC for the safe policy, we’d simply average the rewards Because that’s what we were doing in on-policy Monte Carlo, we generated multiple episodes using the same policy and simply averaged the returns we observed. Every time a state (or state–action pair) appeared, we recorded its full return and updated its value as the mean of all those returns over multiple episodes.

\hat{V}^{\pi_{\text{safe}}}(s_0) = \frac{-1 + 10 - 1}{3} = \frac{8}{3} \approx 2.67.

That tells us how good the safe policy is. To answer “what if we had driven under $\pi_{\text{risky}}$ ?”, we must adjust each episode according to importance sampling:

w = \frac{ \pi_{\text{risky}}(a \mid s_0) }{ \pi_{\text{safe}}(a \mid s_0) }

This ratio tells us:

“How much more (or less) likely was this action under the risky policy compared to the safe one?”

If $\pi_{\text{risky}}$ strongly prefers the action → weight is large ( > 1).
If $\pi_{\text{risky}}$ rarely takes that action → weight is small (< 1 ).

If we compute them.

Episode 1 Action: STOP

$b(\text{STOP}) = 0.8$ and $\pi(\text{STOP}) = 0.1$

w_1 = \frac{0.1}{0.8} = 0.125.

Episode 2 Action: GO

$b(\text{GO}) = 0.2$ and $\pi(\text{GO}) = 0.9$

w_2 = \frac{0.9}{0.2} = 4.5.

Episode 3 Action: STOP

w_3 = 0.125.

Now our dataset looks like this:

| Ep | Action | Reward $G$ | Weight $w$ | | --- | --- | --- | --- | | 1 | STOP | -1 | 0.125 | | 2 | GO | +10 | 4.5 | | 3 | STOP | -1 | 0.125 |

Now instead of averaging raw returns, we weight them.

V^{\pi_{\text{risky}}}(s_0) \approx \frac{1}{3} \big( w_1(-1) + w_2(10) + w_3(-1) \big)

Plug in numbers:

= \frac{1}{3} \big( 0.125(-1) + 4.5(10) + 0.125(-1) \big)

= \frac{1}{3} \big( -0.125 + 45 - 0.125 \big) = \frac{44.75}{3} \approx 14.92

Now suddenly the estimate explodes upward. Why?

Because the one GO event gets amplified by 4.5 times.

In our three episodes, the safe policy produced two STOP actions and one GO. Under the safe policy that makes sense: it stops most of the time. But the risky policy behaves very differently. It almost always goes. Under the risky policy, GO is far more common than under the safe one. So every GO sample becomes extremely influential. Importance sampling compensates for this mismatch.

From a Single State to a Full Episode

Real environments whether drones, trading bots, or self-driving cars don’t make just one decision. They make many decisions per episode:

the drone flies through a whole grid,
the trading bot opens, adjusts, and closes positions,
the self-driving car navigates an entire route.

So we now face the more general question:

How do we evaluate a target policy when each episode consists of multiple decisions, each taken under the behavior policy?

To extend importance sampling beyond a single action, we must look at the entire trajectory.

A real drone flight episode might look like:

s_0 \xrightarrow{a_0} s_1 \xrightarrow{a_1} s_2 \xrightarrow{a_2} \cdots \xrightarrow{a_{T-1}} s_T.

And the return we care about is no longer just a single reward instead it is the whole discounted sum:

G = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots + \gamma^{T-1} r_{T-1}.

When the behavior policy $\pi_{\text{safe}}$ generated this trajectory, it selected each action according to its own probabilities:

b(a_0 \mid s_0),\quad b(a_1 \mid s_1),\quad \ldots,\quad b(a_{T-1} \mid s_{T-1}).

Under the target policy, the same trajectory would look like:

\pi(a_0 \mid s_0),\quad \pi(a_1 \mid s_1),\quad \ldots,\quad \pi(a_{T-1} \mid s_{T-1}).

we extend the same idea we used in the single-state case. Instead of one importance ratio, we now have one ratio per step, multiplied together:

\rho_{0:T-1} = \frac{\pi(a_0 \mid s_0)}{b(a_0 \mid s_0)} \cdot \frac{\pi(a_1 \mid s_1)}{b(a_1 \mid s_1)} \cdots \frac{\pi(a_{T-1} \mid s_{T-1})}{b(a_{T-1} \mid s_{T-1})}.

= \prod_{t=0}^{T-1} \frac{ \pi(a_t|s_t) }{ b(a_t|s_t) }

Why a product?

Because the probability of a trajectory under a policy is the product of action probabilities along the path. If we want to convert expectations from one policy’s trajectory distribution to another’s, we must correct every decision along the way.

So the Monte Carlo return update becomes:

V(S_t) \leftarrow V(S_t) + \alpha \Big( \rho_{0:T-1} G_t - V(S_t) \Big)

Look carefully at what changed compared to ordinary Monte Carlo.

Ordinary Monte Carlo:

V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))

Off-policy Monte Carlo:

V(S_t) \leftarrow V(S_t) + \alpha (\rho G_t - V(S_t))

Just one extra multiplier. That’s it. But that one multiplier changes everything.

Now comes the uncomfortable truth. Imagine a short 3-decision episode where the safe policy is conservative at every step, and the risky policy is aggressive at every step. Suppose at each of the three states, the risky policy chooses the aggressive action with probability $0.9$ , while the safe policy chooses it with probability $0.2$ . If one episode happens to contain aggressive actions all three times, then the ratio becomes

\rho = \left(\frac{0.9}{0.2}\right)^3 = 4.5^3 = 91.125

So one “lucky” aggressive trajectory can count as ninety-one episodes worth of evidence. That’s not a small correction anymore. That’s a hijacking of your estimate.

And there’s an even harsher truth hidden inside that product: This product tells us how much more (or less) likely the entire trajectory would have been under $\pi_{\text{risky}}$ compared to $\pi_{\text{safe}}$ . If the target policy can choose an action in a state, the behavior policy must assign it non-zero probability. Formally:

if \quad \pi(a \mid s) > 0, \quad then \quad b(a \mid s) > 0

Otherwise the ratio $\frac{ \pi(a|s) }{ b(a|s) }$ either explodes (division by zero) or is undefined, and you literally cannot evaluate that part of the target policy from your data. that means

The behavior policy $\pi_{\text{safe}}$ must sometimes do everything that $\pi_{\text{risky}}$ might do. No data → no evaluation.

Weighted importance sampling

From here we have two ways to use importance sampling in off-policy Monte Carlo method

Ordinary Importance Sampling
$V^\pi(s_0) = \frac{1}{N} \sum_{i=1}^N w_i G_i.$
This estimator is unbiased: if you collected infinitely many episodes, its average would exactly equal the true value of the target policy.

However, because the weights $w_i$ are products of many probability ratios, they can become extremely large. A single rare trajectory (trajectory that is very unlikely under $b$ but much more likely under $\pi$ ) can dominate the estimate, so the variance can be huge.
Weighted (self-normalized) Importance Sampling
$V^\pi(s_0) = \frac{\sum_{i=1}^N w_i G_i}{\sum_{i=1}^N w_i} = \sum_{i=1}^N \alpha_i G_i,\quad \alpha_i = \frac{w_i}{\sum_j w_j}.$
This estimator is slightly biased (since it’s a ratio of random sums), but it usually has much lower variance. It effectively treats the weights as relative, normalizing them to sum to 1, and is therefore far more stable in practice.

And there’s a beautifully simple incremental form of that weighted update that feels almost like a TD update even though it’s still Monte Carlo. Suppose you maintain a running “total weight” $C(s)$ for each state, and a value estimate $V(s)$ . For each new weighted sample $(\rho, G)$ for that state, you do
$C(s)\leftarrow C(s)+\rho$ $V(s)\leftarrow V(s) + \frac{\rho}{C(s)}\Big(G - V(s)\Big)$
That update has a very comforting interpretation. The term $(G - V(s))$ is the familiar prediction error. The learning rate is $\rho/C(s)$ , which automatically shrinks as you accumulate more weighted evidence. So big-weight episodes can move the estimate more, but only in proportion to the total evidence you’ve already seen.

Important thing to keep in mind here is, $V^{\pi}(s_0)$ is “the expected return when we start in state $s_0$ ****and then behave according to policy $\pi$ ” for episode-length weights instead of one-step weights.

$\textbf{Algorithm: Off-policy MC prediction update} \\[6pt] \textbf{input: } \begin{cases} E = \big[(s_0,a_0,r_0), (s_1,a_1,r_1), \ldots, (s_{T-1},a_{T-1},r_{T-1})\big] \\ b(a \mid s)\ \text{behavior policy} \\ \pi(a \mid s)\ \text{target policy} \\ \gamma\ \text{discount factor} \end{cases} \\[10pt] \textbf{persistent: } \begin{cases} V(s)\ \text{current value estimates} \\ C(s)\ \text{cumulative normalizing weights} \end{cases} \\[12pt] 1.\hspace{0.5cm} G \leftarrow 0 \\[4pt] 2.\hspace{0.5cm} w \leftarrow 1 \\[10pt] 3.\hspace{0.5cm} \textbf{for } t \leftarrow T-1\ \text{down to}\ 0\ \textbf{do} \\[4pt] 4.\hspace{1.2cm} s_t,a_t,r_t \leftarrow E[t] \\[6pt] 5.\hspace{1.2cm} G \leftarrow r_t + \gamma G \\[6pt] 6.\hspace{1.2cm} w \leftarrow w \cdot \dfrac{\pi(a_t \mid s_t)}{b(a_t \mid s_t)} \\[10pt] 7.\hspace{1.2cm} C(s_t) \leftarrow C(s_t) + w \\[8pt] 8.\hspace{1.2cm} V(s_t) \leftarrow V(s_t) + \dfrac{w}{C(s_t)} \big( G - V(s_t) \big)$

$\textbf{Algorithm: Training Off-policy MC predictor} \\[6pt] \textbf{input: } \begin{cases} \text{MDP black box} \\ b(a \mid s)\ \text{behavior policy (used to act)} \\ \pi(a \mid s)\ \text{target policy (to evaluate)} \\ \gamma\ \text{discount factor} \\ N\ \text{number of training episodes} \\ s_0\ \text{starting state} \\ S_{\text{end}}\ \text{set of terminal states} \end{cases} \\[10pt] \textbf{persistent: } \begin{cases} V(s)\ \text{value estimates for states, initially zero} \\ C(s)\ \text{cumulative normalizing weights, initially zero} \end{cases} \\[12pt] 1.\hspace{0.5cm} \textbf{for } i \leftarrow 1 \text{ to } N\ \textbf{do} \\[4pt] 2.\hspace{1.2cm} \text{initialize empty episode list } E \\[4pt] 3.\hspace{1.2cm} s \leftarrow s_0 \\[8pt] 4.\hspace{1.2cm} \textbf{while } s \notin S_{\text{end}}\ \textbf{do} \\[4pt] 5.\hspace{2cm} \text{sample } a \sim b(\cdot \mid s) \\[4pt] 6.\hspace{2cm} r,\ s' \leftarrow \text{MDP black box}(s,a) \\[4pt] 7.\hspace{2cm} \text{append } (s,a,r) \text{ to episode } E \\[4pt] 8.\hspace{2cm} s \leftarrow s' \\[8pt] 9.\hspace{1.2cm} \text{Off-policy MC prediction update (Weighted IS) using episode } E \\[10pt] 10.\hspace{0.5cm} \textbf{return } V$

From Off-Policy Prediction to Off-Policy Control

So far in this post, we’ve stayed in the prediction world:

Given a fixed target policy $\pi$ , how do we estimate $V^\pi$ using episodes generated by another policy $b$ ?

we can safely evaluate risky or experimental policies from logged data but in reinforcement learning we rarely want to stop at prediction. Eventually, we want control:

Can I use off-policy Monte Carlo ideas not just to evaluate a policy, but to actually learn a better one?

Conceptually, off-policy MC control is “just” off-policy prediction + policy improvement:

Behavior policy $b$ :

used to explore and generate episodes (e.g., safe/self-driving, conservative trader).
Target policy $\pi$ :

a greedy (or almost greedy) policy with respect to our current value estimates.
Evaluate $\pi$ off-policy using importance sampling.
Improve $\pi$ by making it greedier w.r.t. updated values.
Repeat.

For control, it’s more natural to work with action-values $Q(s,a)$ instead of just state values:

State values $V^\pi(s)$ tell us how good it is to be in state $s$ if we follow $\pi$ .
Action values $Q^\pi(s,a)$ tell us how good it is to take action $a$ in state $s$ and then follow $\pi$ .

Once we have $Q$ , we can easily define a greedy target policy:

\pi(s) = \arg\max_a Q(s,a).

Now the game looks like this:

$b(a \mid s)$ : behavior policy, kept exploratory and safe.
$\pi(a \mid s)$ : target policy, greedy (or $\varepsilon$ -greedy) with respect to $Q$ .

We generate episodes under $b$ , but we update Q as if we had followed $\pi$ , using importance sampling on the state–action level.

For a single episode:

E = [(s_0,a_0,r_0),\ (s_1,a_1,r_1),\ \ldots,\ (s_{T-1},a_{T-1},r_{T-1})].

For each time step $t$ , we can compute:

the return from that point on:
$G_t = r_t + \gamma r_{t+1} + \cdots$
the tail importance ratio:
$w_t = \prod_{k=t+1}^{T-1} \frac{\pi(a_k \mid s_k)}{b(a_k \mid s_k)}.$

Then $w_t G_t$ is a corrected Monte Carlo return. The same old “full return from time $t$ ” idea, except now it’s been reweighted so that even though the episode was generated by $b$ , it counts as evidence for $\pi$ .

Now a weighted-IS style update for each visited pair $(s_t,a_t)$ looks like:

C(s_t,a_t) \leftarrow C(s_t,a_t) + w_t

Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \frac{w_t}{C(s_t,a_t)} \big( G_t - Q(s_t,a_t) \big)

In control, we usually make the target policy greedy with respect to our current $Q$ . In its purest form it’s deterministic. After processing the whole episode, we update the target policy:

\pi(s) \leftarrow \arg\max_a Q(s,a)

We keep behaving with $b$ (which must still explore all actions that $\pi$ might want), but our evaluation target $\pi$ keeps getting greedier.

From Episodes to Step-by-Step Learning

Off-policy Monte Carlo control gives us a powerful idea, we can learn a better policy without ever acting according to it but MC has one big limitation, it only learns after an episode ends. In long or complicated tasks, this makes learning slow, unstable, and sometimes impractical.

So the natural next question is:

Can we update our estimates during an episode instead of waiting until the end?

This is exactly what Temporal-Difference (TD) learning does. Instead of waiting for the full return, TD methods learn step-by-step, adjusting their value estimates continuously.