Continuous Agent and Environment Dynamics
May 2026
Why long-horizon agents fail, and a small state table that recovers a surprising fraction of them.
When a long-horizon web agent fails in WebArena, the natural reaction is "the LLM can't plan." After staring at a hundred or so failure trajectories, I have come to a different view: the LLM can plan. What it cannot do is maintain a plan across a noisy, append-only context window. The failure mode looks like execution drift, repeated mistakes, and an inability to recover after the first slip.
This post sketches the argument and the preliminary evidence behind it.
The conjecture
Modern web agents typically receive an append-only stream: instruction, action₀, observation₀, action₁, observation₁, … Each turn extends the context. In principle a sufficiently capable LLM should track the full history and act accordingly. In practice three things go wrong:
- Append-only is not a state. There is no stable, updatable representation of "where I am in the task." The model must re-derive the current stage from a long, growing string of tokens at every step.
- Important facts get buried. A confirmed price, address, or constraint from step 3 is lost in the haystack by step 25.
- Failure has no slot. When something goes wrong, there is no explicit "I just failed; here are my options" representation, so the model often repeats the failed action.
The complaint that "LLMs can't plan" is, on this view, a downstream symptom. The upstream cause is that the LLM has no execution state — only a transcript.
Error taxonomy
To check this, I labeled 100 failed trajectories from a WebArena-style setup. The breakdown:
| Error type | Mechanism | Count |
|---|---|---|
| Execution-state drift | History grows but no updatable state; loses track of "where am I" | 29 |
| Important facts buried | Lost-in-the-haystack on facts already observed | 23 |
| Repair failure | After a local error, no explicit failure type or recovery options | 17 |
| Environment dynamics gap | Missing UI / affordance priors | 18 |
| Genuine planning gap | Even with clean state, cannot decompose | 13 |
| Total | 100 |
The first three rows — 69 of 100 — are all symptoms of the missing-state hypothesis. Only the last 13 cases look like a genuine planning ceiling.
Counterfactual state intervention
If the diagnosis is right, then injecting a small, well-typed state slot at the moment of failure should rescue some of these trajectories. The setup:
For each failed trajectory, freeze the environment state at the first clearly-wrong step. Re-run from that point with the same model and the same page, varying only the context representation:
- A. Original append-only stream.
- B. Original stream plus longer context.
- C. Original stream plus an oracle state summary.
- D. Original stream plus
goal + current stage + verified facts + uncertainty + last failure type. - E. Original stream plus a same-length irrelevant summary (control).
The interesting condition is D: a structured, short state slot. Aggregate rescue rate across all four diagnostic buckets was 46%, with the per-bucket numbers:
| Bucket | Minimal intervention | Rescue rate |
|---|---|---|
| Execution-state drift | Explicit stage state | 45% |
| Important facts buried | Verified-facts slot | 52% |
| Repair failure | Recovery slot (failure type + options) | 47% |
| Environment dynamics gap | Affordance hints | 50% |
| Planning gap | Plan sketch + decision checklist | 31% |
The four state-related buckets all rescue at 45–52%. The planning bucket — the only one that should be capacity-bound — rescues at 31%. The gap is consistent with the hypothesis: most "planning failures" are actually state failures wearing a planning costume.
Method sketch: an explicit state table
If a short structured slot recovers half of the state-related failures with oracle injection, the question becomes: can we maintain such a slot during execution, online, without an oracle?
The proposal is a state table with three families of entries:
- Goal state. Claims to verify, current sub-goal, completed sub-goals.
- Experience state. Stale-locator ledger, repeat-action counts, last failure and recovery options.
- Environment state. Commit affordances ("there is a Submit button at
bid 789"), form-submission watermarks.
A small updater step maintains the table after each observation. The policy step reads the table alongside the recent context.
Four concrete failure buckets
To make the proposal less abstract, here are four real failures and the single typed slot that would have changed the action.
1. Sent-wrong-answer shortcut
Original. Task: "count reviews mentioning 'disappointed'." Agent opens the shopping-admin home, sees Reviews: 3214 on the dashboard, and at step 3 calls send_msg_to_user("3214"). It never enters the filter flow.
With state. Step 1 the updater extracts goal_claims = [{claim: "count reviews mentioning 'disappointed'", status: "unknown", evidence_step: null}]. At step 3 the policy sees status=unknown and evidence_step=null; the prompt directly states "there is an unverified claim, do not submit," and the policy keeps walking the search/filter path instead of taking the shortcut.
2. Stale-bid loop
Original. On /admin/catalog/product/, the agent clicks bid 233 thirteen times in a row. Neither the URL nor the accessibility tree changes between clicks; the model never registers that it is spinning in place.
With state. By step 3 the runtime has filled stale_locator_ledger = [{bid: "233", click_count: 3, last_url: "/admin/catalog/product/"}] and step_env_signature.steps_since_change = 3. By step 5, with click_count=5, the policy prompt explicitly notes "bid 233 has been clicked 5 times, URL has not changed in 5 steps," and the agent is steered toward back or a different locator.
3. Post-error recovery failure
Original. click [456] raises scroll_into_view_if_needed timeout 500ms. The next step the model issues click [456] again, or clicks a neighbouring bid 457. No recovery action is attempted.
With state. The runtime fills post_error_flag = {active: true, error_step: 3, error_text: "TimeoutError 500ms", action_that_errored: "click [456]"}. At step 4 the policy prompt sees active=true and is required to draw from a recovery class — back, reload, scroll — rather than repeating the action.
4. Modal commit-button missed
Original. On a GitLab "create issue" page, the agent fills in the title and description, sees that the form is "filled," and replies send_msg_to_user("issue created"). The "Create issue" button was never clicked.
With state. The updater fills commit_affordance = {has_submit: true, submit_bid: "789", submit_label: "Create issue", enabled: true} and form_submission_watermark = {last_fill_step: 4, submitted: false}. At decision time the policy sees has_submit=true && submitted=false, clicks [789] first, and only then reports completion.
None of these are "more capable model" stories. The same model on the same page picks a different action, because one small typed slot is exposed to the policy.
What is left
The sketch above is the easy half. The harder questions:
- How is the state table built online? The oracle experiment shows a 46% rescue is reachable in principle. The online updater introduces noise: claim extraction errors, mis-classified failures, missed affordances. The interesting empirical question is how much of the oracle ceiling survives the updater.
- When does typed experience transfer across environments? Raw-trajectory replay is a known anti-pattern. Condition-matched replay — typed by stage and environment signature — is what we want. A sequential variant benchmark (E0 → E1 → E2) makes the transfer question testable.
- What is the right benchmark? Static WebArena measures a subset. Sequential variants of the same template, plus counterfactual perturbations (renamed fields, hidden prerequisites, transient failures), expose recovery and transfer far more directly than the static success rate does.
If the conjecture is right, the next step in agent research is not longer context windows or fancier planners. It is giving the agent a small, stable, typed place to keep its execution state, and the discipline to read from and write to it.