Continuous Agent and Environment Dynamics

May 2026

Why long-horizon agents fail, and a small state table that recovers a surprising fraction of them.

When a long-horizon web agent fails in WebArena, the natural reaction is "the LLM can't plan." After staring at a hundred or so failure trajectories, I have come to a different view: the LLM can plan. What it cannot do is maintain a plan across a noisy, append-only context window. The failure mode looks like execution drift, repeated mistakes, and an inability to recover after the first slip.

This post sketches the argument and the preliminary evidence behind it.

The conjecture

Modern web agents typically receive an append-only stream: instruction, action₀, observation₀, action₁, observation₁, … Each turn extends the context. In principle a sufficiently capable LLM should track the full history and act accordingly. In practice three things go wrong:

Append-only is not a state. There is no stable, updatable representation of "where I am in the task." The model must re-derive the current stage from a long, growing string of tokens at every step.
Important facts get buried. A confirmed price, address, or constraint from step 3 is lost in the haystack by step 25.
Failure has no slot. When something goes wrong, there is no explicit "I just failed; here are my options" representation, so the model often repeats the failed action.

The complaint that "LLMs can't plan" is, on this view, a downstream symptom. The upstream cause is that the LLM has no execution state — only a transcript.

Error taxonomy

To check this, I labeled 100 failed trajectories from a WebArena-style setup. The breakdown:

Error type	Mechanism	Count
Execution-state drift	History grows but no updatable state; loses track of "where am I"	29
Important facts buried	Lost-in-the-haystack on facts already observed	23
Repair failure	After a local error, no explicit failure type or recovery options	17
Environment dynamics gap	Missing UI / affordance priors	18
Genuine planning gap	Even with clean state, cannot decompose	13
Total		100

The first three rows — 69 of 100 — are all symptoms of the missing-state hypothesis. Only the last 13 cases look like a genuine planning ceiling.

Counterfactual state intervention

If the diagnosis is right, then injecting a small, well-typed state slot at the moment of failure should rescue some of these trajectories. The setup:

For each failed trajectory, freeze the environment state at the first clearly-wrong step. Re-run from that point with the same model and the same page, varying only the context representation:

A. Original append-only stream.
B. Original stream plus longer context.
C. Original stream plus an oracle state summary.
D. Original stream plus goal + current stage + verified facts + uncertainty + last failure type.
E. Original stream plus a same-length irrelevant summary (control).

The interesting condition is D: a structured, short state slot. Aggregate rescue rate across all four diagnostic buckets was 46%, with the per-bucket numbers:

Bucket	Minimal intervention	Rescue rate
Execution-state drift	Explicit stage state	45%
Important facts buried	Verified-facts slot	52%
Repair failure	Recovery slot (failure type + options)	47%
Environment dynamics gap	Affordance hints	50%
Planning gap	Plan sketch + decision checklist	31%

The four state-related buckets all rescue at 45–52%. The planning bucket — the only one that should be capacity-bound — rescues at 31%. The gap is consistent with the hypothesis: most "planning failures" are actually state failures wearing a planning costume.

Method sketch: an explicit state table

If a short structured slot recovers half of the state-related failures with oracle injection, the question becomes: can we maintain such a slot during execution, online, without an oracle?

The proposal is a state table with three families of entries:

Goal state. Claims to verify, current sub-goal, completed sub-goals.
Experience state. Stale-locator ledger, repeat-action counts, last failure and recovery options.
Environment state. Commit affordances ("there is a Submit button at bid 789"), form-submission watermarks.

A small updater step maintains the table after each observation. The policy step reads the table alongside the recent context.

Four concrete failure buckets

To make the proposal less abstract, here are four real failures and the single typed slot that would have changed the action.

1. Sent-wrong-answer shortcut

Original. Task: "count reviews mentioning 'disappointed'." Agent opens the shopping-admin home, sees Reviews: 3214 on the dashboard, and at step 3 calls send_msg_to_user("3214"). It never enters the filter flow.

With state. Step 1 the updater extracts goal_claims = [{claim: "count reviews mentioning 'disappointed'", status: "unknown", evidence_step: null}]. At step 3 the policy sees status=unknown and evidence_step=null; the prompt directly states "there is an unverified claim, do not submit," and the policy keeps walking the search/filter path instead of taking the shortcut.

2. Stale-bid loop

Original. On /admin/catalog/product/, the agent clicks bid 233 thirteen times in a row. Neither the URL nor the accessibility tree changes between clicks; the model never registers that it is spinning in place.

With state. By step 3 the runtime has filled stale_locator_ledger = [{bid: "233", click_count: 3, last_url: "/admin/catalog/product/"}] and step_env_signature.steps_since_change = 3. By step 5, with click_count=5, the policy prompt explicitly notes "bid 233 has been clicked 5 times, URL has not changed in 5 steps," and the agent is steered toward back or a different locator.

3. Post-error recovery failure

Original. click [456] raises scroll_into_view_if_needed timeout 500ms. The next step the model issues click [456] again, or clicks a neighbouring bid 457. No recovery action is attempted.

With state. The runtime fills post_error_flag = {active: true, error_step: 3, error_text: "TimeoutError 500ms", action_that_errored: "click [456]"}. At step 4 the policy prompt sees active=true and is required to draw from a recovery class — back, reload, scroll — rather than repeating the action.

4. Modal commit-button missed

Original. On a GitLab "create issue" page, the agent fills in the title and description, sees that the form is "filled," and replies send_msg_to_user("issue created"). The "Create issue" button was never clicked.

With state. The updater fills commit_affordance = {has_submit: true, submit_bid: "789", submit_label: "Create issue", enabled: true} and form_submission_watermark = {last_fill_step: 4, submitted: false}. At decision time the policy sees has_submit=true && submitted=false, clicks [789] first, and only then reports completion.

None of these are "more capable model" stories. The same model on the same page picks a different action, because one small typed slot is exposed to the policy.

What is left

The sketch above is the easy half. The harder questions:

How is the state table built online? The oracle experiment shows a 46% rescue is reachable in principle. The online updater introduces noise: claim extraction errors, mis-classified failures, missed affordances. The interesting empirical question is how much of the oracle ceiling survives the updater.
When does typed experience transfer across environments? Raw-trajectory replay is a known anti-pattern. Condition-matched replay — typed by stage and environment signature — is what we want. A sequential variant benchmark (E0 → E1 → E2) makes the transfer question testable.
What is the right benchmark? Static WebArena measures a subset. Sequential variants of the same template, plus counterfactual perturbations (renamed fields, hidden prerequisites, transient failures), expose recovery and transfer far more directly than the static success rate does.

If the conjecture is right, the next step in agent research is not longer context windows or fancier planners. It is giving the agent a small, stable, typed place to keep its execution state, and the discipline to read from and write to it.

← Back to blog