Tools and Filters for Short Horizons, Fragile State

Star Trek showed computer systems breaking in ways that once looked cartoonish: the machine takes a command too literally, or it finds a boundary no one anticipated, and suddenly the ship is compromised. The year is 2025 and now we can relate, those failures map directly to probabilistic AI and large language models specifically. Classical software crashes or halts in ways that are legible. LLMs misfire with contradictions, unstable reproducibility, and error paths that resist debugging. What once looked absurd now reads as operational notes.

Episodes like M-5 turning drills into war, or Moriarty exploiting a “clever” request, were dismissed as science fiction cautionary tales. The mechanism is identical to today: systems given broad authority, weak boundaries, and no external enforcement cross lines without intent or awareness. Current deployments wire stochastic generators into repos, schedulers, and finance tooling with privileges no production engineer would tolerate. One ambiguous instruction, and execution spills over. The industry has invented all sorts of new labels and it all essentially falls under “autonomy”, but reality is closer to “malpractice”.

METR’s “time horizon” work is one metric that maps to such observed behavior. A model can handle what a human could solve in minutes, but collapses when a task stretches across hours. The gap explains the contradiction: pass a law exam in one sitting, stall on software development that takes half a day. Doubling horizons exist, but they scale in hours, not weeks (metr.org¹). In field studies, developers told themselves AI sped them up while metrics showed a near 20 percent slowdown on two-hour problems (metr.org², arXiv³, GitHub⁴). Belief bends, the curve does not.

Attempts to extend scope with retrieval or context stuffing don’t change this. Extra documents increase randomness through primacy, recency, and omission biases. The system “forgets” in patterned ways. Teams calling this “architectural memory” of code or spec are likely lying to themselves; all that changes is which fragments are visible this run (research.trychroma.com⁵).

Reasoning prompts and chain-of-thought tricks inflate transcripts without stabilizing outcomes. Perturb a task slightly and success swings to incoherence. Apple’s group labeled it the “illusion of thinking”, and adversarial tests show collapse under minimal prompt changes (arXiv⁶, Apple Machine Learning Research⁷). Wrapping this generator in “agents” does not solve horizon failure. It just retries, schedules, and accumulates drift. Agentic systems are closer to workflow orchestration with logs.

This leads to the realization that reliability must come from outside the generator. Deterministic filters need to guard every actuator: schemas, linters, analyzers. Risk gates stop outputs when uncertainty spikes, escalate to humans, and block stochastic error paths from hitting production. Immutable logs and replay systems make rollback enforceable. Prompt injection defenses assume adversarial inputs from every channel, because that’s what live traffic delivers (OWASP⁸, OWASP Gen AI Security Project⁹, arXiv¹⁰). No filter should be optional, and none should be trusted because of a marketing claim.

Evaluation pipelines cannot recycle the same biases by pointing models at each other. We need to be careful if shared failure is or just looks like “verification”, model diversity and calibrated human intervention need to be structured into the process. Horizon analysis also must apply to pipelines under load, not just models in isolation. Otherwise organizations mislabel fragility as robustness (metr.org¹). A/B testing can be a strong baseline: run competing configurations against the same tasks and measure degradation directly instead of trusting anecdotes. Multi-armed bandits extend this into adaptive A/B testing, shifting traffic toward variants and configurations that hold up under stress while cutting exposure to those that collapse. These methods keep reliability checks continuous rather than one-off and prevent silent drift from being mistaken for progress.

The Star Trek examples made these mapping obvious almost 60 years ago. Specification creep and boundary breaches are constrained only by narrow permissions and typed workflows. Interface misreads require gates demanding hard evidence before triggering tools. And every deployment should include an operational kill switch, both human and automated. Model scaling has shown that it does not erase these limits. Larger windows and longer traces do not fix error accumulation, oversight demand, or schema drift. Every serious attack demo confirms that (arXiv¹⁰, ACL Anthology¹¹, PMC¹²). If week-scale reliability ever arrives with the current LLM technology, it may come from memory systems, program-aided reasoning, and verification-first pipelines, less likely from asking a stochastic generator to “think harder” (metr.org¹³).

Our working stance is blunt. Treat LLMs as unreliable components bounded by external filters. Wrap them with typed interfaces and risk gates. Assume poisoned input, adversarial traffic, and injection attempts. Design for rollback and constant measurement. In most LLM pipelines, predictability is the primary operational requirement. Without external enforcement, the failures Star Trek showed become production incidents. With enforcement, the system can actually help.

We are building these evaluation and adaptive testing tools and will show them in upcoming posts, without them, reliability collapses.

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ “Measuring AI Ability to Complete Long Tasks - METR” ↩︎ ↩︎
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR” ↩︎
https://arxiv.org/abs/2507.09089 “[2507.09089] Measuring the Impact of Early-2025 AI on …” ↩︎
https://github.com/METR/Measuring-Early-2025-AI-on-Exp-OSS-Devs “Measuring the Impact of Early-2025 AI on Experienced …” ↩︎
https://research.trychroma.com/context-rot “Context Rot: How Increasing Input Tokens Impacts LLM …” ↩︎
https://arxiv.org/html/2506.06971v2 “Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation” ↩︎
https://machinelearning.apple.com/research/illusion-of-thinking “The Illusion of Thinking: Understanding the Strengths and …” ↩︎
https://owasp.org/www-project-top-10-for-large-language-model-applications/ “OWASP Top 10 for Large Language Model Applications” ↩︎
https://genai.owasp.org/llmrisk/llm01-prompt-injection/ “LLM01:2025 Prompt Injection - OWASP Gen AI Security Project” ↩︎
https://arxiv.org/abs/2312.14197 “Benchmarking and Defending Against Indirect Prompt …” ↩︎ ↩︎
https://aclanthology.org/2025.findings-naacl.123.pdf “Attention Tracker: Detecting Prompt Injection Attacks in LLMs” ↩︎
https://pmc.ncbi.nlm.nih.gov/articles/PMC11785991/ “Prompt injection attacks on vision language models in …” ↩︎
https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/ “How Does Time Horizon Vary Across Domains? - METR” ↩︎