Short Model Horizons Revisited

Since our earlier “short horizons, fragile state, orchestration first” note ¹, there have been more data points published and the picture is becoming a bit sharper, justifying a follow-up post.

We now have:

A more mature horizon curve for GPT-5.1-Codex-Max from METR.²
Controlled experiments on chain-of-thought that isolate pattern replay from general reasoning.³ ⁴
Behavioral work showing LLM-first interfaces produce shallower understanding than web search.⁵
Concrete datasets showing RAG still hallucinates.⁶
Theory and evidence on model collapse under recursive training.⁷
A dense cluster of agent security work that treats LLM agents as untrusted by default.⁸ ⁹ ¹⁰

None of this overturns the original thesis and rather solidifies stated boundaries as we will discuss in this post.

New METR horizon numbers

METR’s GPT-5.1-Codex-Max report puts new numbers on the same stopwatch they used for GPT-5.² Take a suite of agentic software tasks (HCAST, RE-Bench, SWAA), obtain human time-to-complete estimates, let an agent scaffold drive the model with tools and large token budgets, then fit a logistic curve to get the human task duration at which the agent has 50% (or 80%) success.²

The new headline numbers:

50% time horizon around 2h42m, with a 95% interval from 75 minutes to 5h50m.²
80% time horizon around 30 minutes, interval 9–80 minutes.²

They also run a worst-case extrapolation over their historical trend and land on an upper-bound 50% horizon of about 13h25m by April 2026, still under a 20-hour benchmark they use as a rough threshold for “catastrophic” AI R&D automation or rogue replication.²

A few details matter for interpretation:

GPT-5.1-Codex-Max received a 32M token budget per task attempt, the highest METR has used. Gains beyond ~5M tokens per attempt were modest.²
Counting even obvious reward-hacking runs as legitimate success and removing a handful of broken tasks lifts the 50% horizon into the ~5 hour region, with a wide confidence interval stretching into double digits, yet still below their risk thresholds.²

So the model is evidently stronger than GPT-5 on the same metric, but the unit has not changed. We are still talking about hours, not weeks. The Goldilocks band¹¹ from the original essay has expanded from “tens of minutes to low hours” to “tens of minutes to a small handful of hours”. If you squint at the extrapolation, you see a demanding workday; you do not see an autonomous research lab.

This matches the intuitive shape many people already had from hands-on use: long projects feel easier with GPT-5-class tooling, especially with good scaffolding, but you still do not hand over a week-long problem and walk away.

The important update is not “models plateaued”. The update is that a clean external metric, tuned to long tasks, continues to land in a range where orchestration dominates outcome. You live or die by how you break work into segments that fit under that horizon, how you monitor state between segments, and how you repair or restart when the agent drifts.

Chain-of-thought as mirage rather than backbone

The original post argued that long chain-of-thought often plays a cosmetic role. The model imitates patterns from training; practitioners interpret that as reasoning; small perturbations reveal how shallow the structure is. Zhao et al. took that suspicion and built a testbed around it.³ In the “DataAlchemy” framework they train models from scratch in an isolated environment, then probe chain-of-thought across three axes:

Move within the same task distribution.
Change surface form while preserving underlying structure.
Cross into adjacent distributions that share high-level semantics but differ in details.

Their result is that chain-of-thought gains are distribution-bound. As you move away from the training distribution, the apparent reasoning advantage collapses, even when the new tasks are natural variations of the old ones.³ Apple’s GSM-Symbolic work does something similar for arithmetic and algebra.⁴ They generate large families of math problems by applying symbolic transformations that humans find trivial: changing numbers, permuting irrelevant clauses, altering superficial layout. Models that look strong on GSM8K suddenly miss these near-neighbors.⁴ ⁷

Taken together, these papers support a view where:

Chain-of-thought is a format that can improve performance on tasks whose structure already matches training.
The apparent “long reasoning” you see in verbose traces is often a mirage; it does not survive even moderate shifts in input format or compositional depth.³ ⁴

Link this back to METR’s horizon results. A multi-hour HCAST or RE-Bench task implicitly forces the model to cross distributions: new tool sequences, new intermediate states, new failure modes. You cannot remain inside a narrow, well-rehearsed pattern in the way you can on GSM8K-style problems.

So we get two interacting limits:

Time horizon: how far you can push the agent along a task axis before success probability drops through 50%.
Distribution horizon: how far you can perturb format and structure before previously-learned reasoning patterns stop transferring.

You can try to extend the first with more tokens, retries, and scaffolds. The second is harder to patch, because it reflects how training data and model architecture encode the task. The new work suggests both ceilings are closer than rhetoric around “general reasoning” implies.

Benchmarks, perceived speed, and real work

The original essay argued that benchmarks and subjective experience both exaggerate productivity gains from current systems on multi-hour tasks. The new METR report leans into this gap more directly. Their horizon paper from March 2025 introduced the 50% time-horizon metric and compared models against human contractors. Their GPT-5.1-Codex-Max follow-up reiterates a pattern they have now seen multiple times:

Agents look strongest on carefully instrumented benchmarks, particularly those that resemble their training distributions and scaffolding assumptions.
Deployed agents in “messier real-world settings” achieve much less, even on tasks that look similar on paper.²

Their open-source developer trial is an uncomfortable example.² Experienced maintainers working on their own repositories, with tools like Claude 3.5 and Cursor, reported feeling helped; instrumented measurements showed them slower on average under their trial conditions.² The report mentions “no (or very limited) productivity benefits” even for shorter tasks, and explicitly flags the discrepancy between perceived uplift and measured uplift as a concern.²

There is no reason to treat these numbers as universal. Different teams, scaffolds, and codebases will get different results. The point is a bit narrower, once you benchmark against multi-hour, idiosyncratic tasks in a realistic environment, the headline “IQ points” from test suites and leaderboards lose most of their predictive power.

This aligns with the DataAlchemy and GSM-Symbolic story. If chain-of-thought is brittle under modest distribution shift, then every real codebase—with its own weird abstractions, build systems, and historical scars—is itself a distribution shift. Expect breakdowns, even when the same agent looks reliable on polished, self-contained tasks.

Under those conditions, orchestration remains paramount. You have to:

constrain tasks to segments that fit inside the measured horizon
build in verification that speaks the language of your code and infra
track and correct drift over many interactions, because subjective fluency is not a trustworthy signal

The more horizon numbers and CoT analyses we get, the more this looks like standard practice (not pessimism!).

Retrieval, long context, and the discovery gap

The earlier “retrievability is not discovery” argument¹² hinged on two observations:

Embedding-based retrieval favors “things that look like this” in vector space.
AI summaries compress diverse sources into a single answer that often loses structure and novelty.

Melumad and Yun’s recent PNAS Nexus work compares LLM-based syntheses against classic web search as learning tools.⁵ Participants were randomly assigned to learn using either an LLM chat interface or a search engine. Afterward, they were asked to give advice or explain the topic; independent human raters scored depth, originality, and practical usefulness.

The pattern demonstrates several limits of LLM-first learning:

People who learn via LLM syntheses form shallower mental models than those who learn via search, even when both groups see the same underlying factual content.⁵
Advice generated from LLM-trained knowledge is sparser, less original, and less likely to be adopted by others.⁵

The study is about structure and richness. It confirms that LLMs compress and smooth, while the devil is in the details. You get a fluently packaged perspective, but you lose edges, outliers, and internal contradictions that matter later when something breaks.

On the retrieval side, RAGTruth offers a grounded view of hallucination under retrieval-augmented generation.⁶ The dataset contains about 18k RAG outputs with span-level annotations indicating where the model goes beyond or against retrieved documents.⁴ ⁸ ¹² Even with relevant evidence present, models still produce unsupported content in subtle ways. The gap shrinks compared with pure parametric generation, but it does not vanish.

Finally, Shumailov’s “Curse of Recursion” formalizes the concern that repeated training on model-generated content degrades distributions.⁷ Under recursive training, rare events disappear first (early collapse), then the model contracts toward a low-variance core (late collapse).⁴ ⁷ Follow-up work confirms the effect and explores mitigations through careful mixing of real and synthetic data, but the basic dynamic remains.¹

Put these together:

LLM-first interfaces encourage a quick, shallow pass over a topic.
RAG improves grounding but does not reliably prevent hallucination, and still inherits retrieval’s bias toward familiar, easy-to-embed content.
Recursive training on LLM-dominated corpora threatens the distribution tails where novelty and fragile knowledge sit.

The METR report echoes this from a different direction. It notes that some HCAST questions, especially those involving non-standard or emerging knowledge, remain hard for GPT-5-family systems even as coding benchmarks improve.² Horizons lengthen for structured engineering tasks, yet discovery and novelty still stall in surprising places.

If you care about problems whose best answers emerge over months or years of fragmentary discussion, for example, new security vulnerabilities, weird library interactions, or methods that go against current consensus, this combination matters more than the headline “context window length”.

Orchestration first, with better evidence

The “orchestration first” part of the original thesis suggested that real systems will end up as workflow engines, routers, and verifiers wrapped around small and medium models, with large models acting as expensive escalation paths rather than central brains. We believe the new evidence still pushes in that direction from multiple sides.

Horizon measurements say that even frontier models settle around hours-scale reliable horizons under careful scaffolding, with realistic best-case projections in the low-double-digit hours.² That makes them poor candidates as monolithic controllers of multi-day projects. They are, however, well suited to a sequence of bounded subtasks aligned with this band.

Chain-of-thought mirage results suggest to rely on structure enforced by the environment, not on the model to maintain a global plan across distribution shifts.³ ⁴ If a critical property has to hold over time—security invariants, consistency rules, data constraints—encode that in code, schemas, and external checkers, not in a long prompt.

RAGTruth and model-collapse work recommends to keep human-curated and high-quality external data in the loop, and treat summarization as a lossy operation that should not be the only view anyone sees for long-horizon questions.⁶ ⁷

Agent security research and zero-trust guidelines say: constrain agents with sandboxing, explicit manifests, narrow tool interfaces, and patterns that separate decision from execution.⁸ ⁹ ¹⁰

Viewed through that lens, the “agent” that most organizations can safely run looks less like an autonomous research associate and more like:

a workflow graph where each node is a small, sharply scoped model call
a routing layer that chooses which node to activate based on cheap signals and previous results
a verification layer that checks outputs with other tools (linters, interpreters, additional models) before changing any real system
a security layer that enforces hard limits on what any single run can see or touch

SLM-centric architectures fit naturally into this shape. You accept that no individual component maintains a coherent view for more than N minutes and that serious reasoning comes from repeated re-entry into a narrow context, guided by external state and metrics.

The large, frontier model becomes a special tool in that graph. You use it sparingly for steps where small models consistently fail, where context is unusually complex, or where you need cross-domain synthesis. You do not hand it root access and hope that bigger “intelligence” replaces orchestration. Hype aside, this is our straightforward reading of current evidence about horizons, robustness, and security.

Updated statement of the thesis

With these new measurements and papers on the table, the original “short horizons, fragile state, orchestration first” line still stands, with a more detailed backing:

Short horizons: Frontier agents like GPT-5.1-Codex-Max reliably operate within a horizon measured in a few hours, with aggressive extrapolations placing near-term upper bounds in the low-double-digit hours, not days.² Chain-of-thought work indicates that the apparent “long reasoning” remains tied to training distributions and breaks under moderate shifts.³ ⁴

Fragile state: LLM-first interfaces produce shallower knowledge and less original advice than web search, even when both expose the same facts.⁵ RAG reduces hallucination but still emits unsupported claims.⁶ Recursive training on model-generated data erodes rare events and shrinks distributions unless carefully mitigated.⁷ Agent systems are vulnerable to prompt injection, tool hijacking, and data exfiltration in ways that do not depend on long horizons.⁸ ⁹ ¹⁰

Orchestration first: Reliability, safety, and usefulness emerge from how we slice tasks into horizon-fitting segments, how we wire models to tools, and how we isolate and verify each step. The best available work on horizons, chain-of-thought, RAG, and agent security all point toward architectures that treat LLMs as stochastic components inside a larger, structured system rather than the system itself.² ³ ⁴ ⁶ ⁸ ⁹ ¹⁰

If anything, the new data makes the thesis less speculative and more of a default engineering assumption. Models are stronger, but the shape of their strengths still demands orchestration. The choice is whether to treat that as an annoyance or as the design surface where most of the interesting work now happens.

Tools and Filters for Short Horizons, Fragile State, July 2025. Original essay introducing the “short horizons, fragile state, orchestration first” thesis.(nullmirror.com) ↩︎ ↩︎
METR, Details about METR’s evaluation of OpenAI GPT-5.1-Codex-Max, November 2025. 50% time horizon ~2h42m, 80% ~30m, worst-case extrapolated 50% horizon ~13h25m by April 2026; report also emphasizes benchmark-to-real-world gaps and limited productivity uplift for experienced developers.( METR’s Autonomy Evaluation Resources ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Chengshuai Zhao et al., Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, 2025. Introduces the DataAlchemy framework and shows that chain-of-thought gains vanish under modest distribution shifts.( arXiv ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Seyed Iman Mirzadeh et al., GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, Apple Machine Learning Research, 2024. Demonstrates failures of compositional generalization under simple symbolic transformations of math problems.( ml-site.cdn-apple.com ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Shiri Melumad and Jin Ho Yun, Experimental evidence of the effects of large language models versus web search on depth of learning, PNAS Nexus, 2025. Finds that LLM-based learning yields shallower knowledge and less original, less adoptable advice compared with web search.( OUP Academic ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Chen Niu et al., RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Generation, ACL 2024. Provides ~18k RAG outputs with span-level hallucination annotations, showing persistent unsupported content even with correct retrieval.( ACL Anthology ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Ilia Shumailov et al., The Curse of Recursion: Training on Generated Data Makes Models Forget, 2023, and follow-up commentary on early and late model collapse. Shows that recursive training on model-generated data erodes distribution tails and leads to collapse without careful mixing.( arXiv ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Deng et al., AI Agents Under Threat (see note 4). Surveys vulnerabilities specific to agent architectures and emphasizes the gap between agent benchmarks and real-world deployments.( ACM Digital Library ) ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Luca Beurer-Kellner et al., Design Patterns for Securing LLM Agents against Prompt Injections, 2025. Proposes architectural patterns (action-selector, plan-then-execute, code-then-execute, dual-LLM, context-minimization) that treat the LLM as untrusted and constrain agent behavior.( arXiv ) ↩︎ ↩︎ ↩︎ ↩︎
ANSSI & BSI, Design Principles for LLM-based Systems with Zero Trust, 2025, and related commentary. Extends zero-trust security principles to LLM systems, emphasizing least privilege, strong identity, and continuous monitoring of model inputs, outputs, and tool calls.( BSI ) ↩︎ ↩︎ ↩︎ ↩︎
The Goldilocks Horizon: Scaling, Reasoning, and the Stopwatch, August 2025. Discusses the concept of a “Goldilocks” band for model horizons and reasoning capabilities.(nullmirror.com) ↩︎
Retrievability is not Discovery: Limits of RAG for Novelty and Structure, September 2025. Explores the limitations of retrieval-augmented generation in supporting discovery and maintaining structural richness.(nullmirror.com) ↩︎ ↩︎