LAUNCH ETA: 2026 May

Cheap Local, Expensive Global Correctness

9 min read

The most dangerous AI-generated code is not obviously wrong. It compiles. It uses the right library. It follows the local pattern. It may even pass the nearby unit tests. The problem is that it can still be wrong at different levels, especially those that require assumptions. The system boundary, the retry path, the ordering guarantee, the source of truth, the state transition, or the invariant nobody wrote down.

That is the real shift LLMs create for engineering organizations. They do not remove the need for engineering judgment. They make judgment the bottleneck.

Before LLMs, execution was expensive enough to act as a brake. Writing code took time. Exploring five variants took time. Adding an abstraction took time. Building a new integration took time. That cost was not inherently good, but it created a pause in which someone usually had to think: Does this belong here? What does this depend on? What happens if it runs twice? What if messages arrive out of order? What owns this state?

LLMs reduce that pause. They make it easy to produce code that is locally plausible before anyone has re-derived the system-level constraints that make the code correct.

Our failure mode is discipline erosion.

From derivation to plausibility checking

A subtle review downgrade happens when AI enters the loop. The old standard, at least in good systems work, was:

Is this globally correct under the relevant interleavings, failures, and ownership boundaries?

The new default can become:

Does this look locally reasonable?

A small enough shift to go unnoticed but large enough to rot a system. An LLM can correctly tell you that an upsert is idempotent. Locally, that may be true. But the system question is different. Is the operation commutative with other writes? Does it preserve ordering? Does it interact correctly with snapshot reads? Does retrying it after a timeout create a second externally visible effect? Does it collapse distinct business events into one state? Does the downstream consumer rely on seeing every transition, not just the final row?

Those are invariant questions. AI is good at generating the local implementation shape. It is much weaker at knowing which unstated invariants matter unless the surrounding system has made those invariants explicit. The model can autocomplete code; it cannot reliably infer organizational lore, incident history, implicit ownership, or the business meaning of state transitions buried across services.

That means the scarce resource moves. The bottleneck is no longer typing, scaffolding, or even first-pass implementation. The bottleneck is integration: deciding whether a plausible change is actually correct inside a living system.

Cognitive surrender, but for code review

Shaw and Nave describe a related phenomenon as cognitive surrender. The moment AI is no longer just assisting thought but replacing it, with the person adopting the output as their own judgment.1 In their framing, AI becomes a third cognitive path: not fast intuition, not slow deliberation, but external automated reasoning that can bypass both.

That maps onto what we observe in software work. A model proposes a patch. The patch is coherent. The explanation is fluent. The tests it suggests are plausible. The reviewer’s posture can change from deriving correctness to “validating vibes”. The human remains in the loop, but no longer supplies the same kind of judgment. Engineers are not careless, but fluent output suppresses the signals that normally trigger scrutiny. Confusion, contradiction, missing pieces, and implementation effort all create useful friction. They force the reviewer to reconstruct the problem. AI removes many of those signals by producing a complete-looking answer immediately.

Much of our friction is waste: Waiting on builds, fighting bad tooling, manually wiring boilerplate, or hunting through tribal documentation is not rigor. It is drag. But some friction is load-bearing. The useful kind forces the system questions back into view:

  • What invariant is this change relying on?
  • Who owns the state being modified?
  • What happens under retry, reordering, duplication, or partial failure?
  • Which behavior must remain true across services?
  • How would we know this is wrong in production?

AI removes a lot of bad friction. The danger is that it also removes the friction that used to trigger those questions.

Generation is cheap; comprehension is not

Once code generation becomes cheap, engineering organizations are tempted to increase output volume. More tickets. More parallel work. More prototypes. More internal tools. More migrations. More “small” changes.

But comprehension does not scale with output volume. A team can easily generate more code than it can understand. It can create more variants than it can evaluate, add more surface area than anyone can own. When that happens, the system does not become more productive. It becomes less legible.

That changes what leadership should optimize for. The old instinct was to maximize throughput: keep people utilized, parallelize work, increase surface area, and reduce blockers. Under cheap execution, those instincts can destroy value. Parallelism without comprehension creates drift. Coordination does not scale just because code generation does.

The job shifts from maximizing activity to protecting cognition, and while this may sound abstract, it has concrete consequences. Teams need smaller ownership surfaces. Interfaces need to be explicit. Tests need to encode real invariants, not just local examples. Rollback paths need to be cheap. Observability needs to answer semantic questions, not just infrastructure questions. Review needs to force the author to state why the change is globally safe.

AI favors addition; good systems require deletion

LLMs are biased toward addition. Ask for a fix and they will often add a helper, a fallback, an abstraction, a compatibility layer, an edge-case branch, or a new configuration option. Each piece may be reasonable in isolation. Together they bury intent. This is why AI-assisted engineering needs a stronger deletion culture. The default question should become what can we remove so the correct solution becomes obvious?

That applies to code, architecture, process, and documentation. A system that requires long explanations to modify safely is not ready for high-volume AI-assisted change. If an agent needs three pages of caveats to touch a domain, the domain is probably too implicit, too coupled, or too large.

A practical response is defensive engineering:

  • Prefer smaller components with clear ownership.
  • Make sources of truth explicit.
  • Keep interfaces narrow and documented.
  • Encode invariants in tests, contracts, and monitors.
  • Delete abstractions that only exist because nobody remembers why they were added.
  • Treat generated code as disposable draft material, not as progress.

The goal is not to slow teams down. The goal is to keep the system understandable enough that speed remains useful.

Make the system legible to humans first, agents second

A lot of AI tooling discussion already start from agents. How to give it more context, more tools, more autonomy, more permissions. That may be the wrong order. The first question should be whether the system is legible. If meaning lives in people’s heads, old incidents, Slack threads, implicit coupling, and undocumented workflows, agents will rediscover ambiguity every time they operate. Long runs will produce churn. The model will generate locally reasonable changes because local reasonableness is all the system exposes.

A better approach is to make the smallest useful amount of system knowledge explicit. Not exhaustive documentation. Exhaustive documentation rots. The useful layer is thin, versioned, and close to the code. In many well designed systems, not much of it is needed.

Reintroduce friction where correctness depends on context

A recent nontechnical piece from The School of Life proposes a simple three-step practice for thinking with AI: first write what you already think, then ask what you might be missing, then decide what you think now.2 Stripped of its self-help framing, the useful pattern is this: do not let AI be the first cognitive move. Establish a position, introduce doubt, then integrate. Engineering teams need the same pattern, but expressed as review mechanics. Before AI output is accepted, the author chould state:

  1. Position: what invariant or behavior the change is intended to preserve.
  2. Doubt: what failure mode, edge case, or cross-system interaction could make it wrong.
  3. Integration: what test, monitor, contract, rollback path, or owner sign-off makes acceptance safe.

Add deliberate friction, just the minimum structure needed to prevent plausible local output from bypassing system-level reasoning.

The same pattern applies to strategy. Romasanta, Thomas, and Levina call one LLM failure mode “trendslop”: polished recommendations that converge on fashionable generic ideas rather than context-specific judgment.3 In engineering, the equivalent is architecture slop: plausible abstractions, familiar patterns, and confident explanations that ignore the actual constraints of the system. The antidote is to make context harder to skip.

What to change in practice

If an engineering organization is serious about AI-assisted development, the operating model should change in a few specific ways.

First, treat generated code as a draft. A draft can be useful, but it has no authority. The author still owns the reasoning.

Second, move review away from style and toward invariants. “Looks good” is not a review. “This preserves idempotency because retries are keyed by event ID and downstream consumers only observe committed transitions” is a review.

Third, make ownership narrower and stronger. Shared ownership sounds collaborative, but under high output volume it often means nobody has a complete enough model to make decisive calls. Someone must be able to say no, revert aggressively, and simplify the domain.

Fourth, prefer oracles over opinions. Tests, contracts, type constraints, property checks, monitors, and replayable scenarios are better than reviewer confidence. AI makes confidence cheap. Oracles make correctness inspectable.

Fifth, make deletion visible. If AI increases additive pressure, leaders need to reward negative work: removing branches, shrinking APIs, deleting unused abstractions, collapsing workflows, and reducing the amount of context required to make a safe change.

None of this is new engineering wisdom. LLMs did not change what good engineering is. They changed how quickly teams pay for ignoring it.

Our thesis

“Rigor requires friction” is a useful shorthand. Rigor requires triggered deliberation. Historically, implementation effort, uncertainty, and confusion triggered that deliberation. AI removes many of those triggers by making plausible answers immediate.

So the work now is to put friction where correctness depends on context:

  • at ownership boundaries,
  • around state transitions,
  • before irreversible changes,
  • where retries and ordering matter,
  • where business meaning is implicit,
  • where no strong oracle exists.

AI makes local correctness cheap, not global correctness. Until systems can express their own invariants completely and machines can verify them reliably, human judgment remains the scarce resource.

The organizations that benefit most from AI will not be the ones that generate the most code. They will be the ones that keep their systems simple enough, explicit enough, and observable enough that generated code can be safely judged.

Speed without that structure will become entropy at scale.


  1. Steven D. Shaw and Gideon Nave, “Thinking — Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender,” SSRN, 2026: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646  ↩︎

  2. The School of Life, “Thinking Well in the Age of AI,” published April 28, 2026: https://www.theschooloflife.com/web-article/thinking-well-in-the-age-of-ai/  ↩︎

  3. Angelo Romasanta, Llewellyn D. W. Thomas, and Natalia Levina, “Researchers Asked LLMs for Strategic Advice. They Got ‘Trendslop’ in Return,” Harvard Business Review, March 16, 2026: https://hbr.org/2026/03/researchers-asked-llms-for-strategic-advice-they-got-trendslop-in-return  ↩︎