LAUNCH ETA: 2026 May

Agentic AI Raises The Floor More Than The Ceiling

9 min read

Over the past year, reporting and anecdotes around “agentic” AI in software engineering have shifted from cautious optimism to near certainty. Posts on socials increasingly suggest that software development is on the verge of becoming largely autonomous: write a spec, unleash an agent, supervise lightly, and collect the output. This narrative is certainly built on real progress. The problem is however that it extrapolates that progress far beyond the class of problems where it currently holds.

What follows is our interrogation of the question of what actually changed, where the limits remain, and why the public narrative continues to outrun day-to-day engineering experience.

What has actually improved

AI assistance has materially reduced the cost of a large class of software work. Boilerplate, scaffolding, often UI code, adapters, serialization layers, repetitive refactors, and CRUD logic behind stable interfaces are faster than they used to be, often dramatically so. The unifying property of this work is shallow dataflow. Inputs and outputs are explicit. The failure surface is small. When something is wrong, it usually looks wrong. Verification is cheap, and correction is faster than writing from scratch.

This pattern is visible even in how respected practitioners describe their own usage. Karpathy’s “vibe coding” framing emphasizes speed and altered ergonomics, but not correctness or autonomy1. Willison is even more explicit, repeatedly drawing a boundary between low-stakes experimentation and responsible software development, and warning against shipping code you do not understand2.

Fundamentally, shallow dataflow and low-leverage engineering tasks have seen a real reduction in friction.

Raising the floor, not the ceiling

In practice, these tools raise the floor far more reliably than they raise the ceiling. A developer who previously struggled with syntax, structure, or basic composition can now produce acceptable output quickly. That alone creates a striking subjective experience, especially compared to prior friction.

Mid-tier engineers also benefit. Routine work moves faster. Context switching becomes cheaper. Mechanical mistakes disappear. But this does not translate into dominance on hard problems. The work that differentiates strong engineers—domain modeling, architectural tradeoffs, long-lived invariants, subtle failure modes—is precisely the work that does not compress well into generation.

Easy work becomes easier. Cheap work becomes cheaper. The shape of difficult work, however, remains largely unchanged.

Marketing frequently conflates visible gains at the low and middle of the distribution with a general breakthrough in capability. Raising the floor feels dramatic. It is not the same thing as raising the ceiling. The “10× developer” rhetoric of the past now redefines leverage.

Another area that has shifted is how rarely code is removed. These new tools dramatically reduce the friction of addition while leaving deletion, pruning, and scope reduction manual and deliberate. As a result, systems grow faster than teams’ ability to reason about them. More often than not, teams reach for a “rewrite from scratch” as the escape hatch.

Where autonomy breaks down

The failure mode is now deep dataflow with delayed consequences. Domain models, migrations, cross-module invariants, concurrency semantics, idempotency, authorization, and billing paths share a common trait: correctness cannot be locally verified, and mistakes surface far from their cause, as errors compound across time and modules.

In this regime, AI systems are brittle. Larger contexts do not reliably help. More retries mostly increase variance. Looping until something passes a surface-level oracle often produces large diffs that require careful human unwinding.

This reflects a mismatch between short-horizon generation and long-horizon ownership. In short, as dataflow deepens, failure modes change.

On “spec-driven” agents

Spec-driven and loop-based agent systems are marketed as a way to scale engineering output. In practice, they mostly displace effort at this point. Especially for non-greenfield projects.

Engineering work usually moves from code to spec maintenance, from local reasoning to oracle construction, from incremental diffs to batch outputs that must be reviewed defensively. Engineers spend time constraining prompts, repairing drift, resetting runs, and maintaining the harness and tools themselves. When requirements change—as they usually do—the spec becomes a liability instead of a guide.

These systems perform best when the task was already easy: bounded, repetitive, and strongly checkable. In those cases, the spec functions as a batch description. When judgment is required, it becomes a lossy proxy for understanding, and the loop optimizes for completion rather than correctness.

More subtly, agent workflows bias teams toward accretion over refinement. Specs optimize for coverage, not minimality. Generated diffs grow larger. Rewrites feel easier than incremental pruning. Over time, systems accumulate behavior faster than understanding, and the only safe-seeming operation becomes replacement rather than reduction.

Token-heavy background work and retries, for now at least, mostly create the appearance of progress.

The hidden tax teams underestimate

Teams experimenting with agent workflows consistently underestimate the cost of keeping them aligned. Specifically, we identify four asymmetric costs:

  • Setup cost: tooling, prompts, harnesses, CI wiring.
  • Oracle cost: tests that approximate intent but never fully encode it.
  • Drift cost: undoing large, low-signal diffs and reasserting invariants.
  • Deletion cost: deciding what not to keep, which remains manual and high-risk.

In many environments, these costs erase the raw speed of generation. Code appears faster, but convergence to something correct and maintainable slows. This aligns closely with Willison’s repeated point that effective use of LLMs requires more discipline, not less2 3.

This is why, despite aggressive claims, teams are not consistently faster end-to-end.

Maintenance pressure and trust erosion

A parallel line of evidence is now emerging from maintainers. A recently coined term and blog post, “agent psychosis”, describes developers repeatedly prompting agents to generate changes that feel productive locally but impose high review and cleanup costs downstream4. When contributors cannot even explain or defend their changes, maintainers lose a primary signal they rely on: Trust. Other recent online discussion echo this and report growing frustration with high-volume, low-signal issues and pull requests that “look fine” at first glance but demand significant verification effort5.

The scarce resource in open source was actually never raw code. Agent-mediated workflows increase output, but unless responsibility and provenance are more explicit, they often degrade the very signals that keep those systems healthy.

Porting works

Projects have been successfully translated with heavy AI involvement, including detailed public write-ups.

Willison’s account of porting justhtml is a good illustration6. A later January 2026 update reiterates that pointing a coding agent at an existing project and asking it to “port this and make the tests pass” now works reliably in practice7.

We should note that porting is unusually agent-friendly because the hardest parts of engineering are already solved. The original implementation is a complete, executable specification. The architecture is frozen. The data model exists. The invariants are implicit in running code. Success criteria are concrete.

Most importantly, the oracle for porting is especially strong. Tests, fixtures, and differential output comparisons allow tight convergence. The agent is translating behavior, not inventing a system.

Porting sits squarely in the category of behavior-preserving transformation with cheap verification. That agents excel here is evidence of strength in translation and convergence, not evidence of autonomous system design.

Benchmarks and the production gap

The same pattern appears in benchmarks as SWE-bench measures the ability to resolve individual GitHub issues with clear pass/fail criteria8. Performance on SWE-bench Verified has improved substantially for top systems. While SWE-EVO was specifically introduced to test longer-horizon software evolution with multi-file changes, large test suites, and broader system impact, the reported performance drop from SWE-bench to SWE-EVO is stark, with success rates collapsing when tasks begin to resemble real maintenance work9.

That gap reflects the difference between producing a plausible patch and owning a system over time. We argue that production engineering mostly lives in that gap.

Incentives and narrative pressure

It is also worth acknowledging the incentive environment surrounding these tools. As platforms move toward mass-market monetization (including advertising!), success becomes increasingly tied to scale, retention, and narrative clarity. In that environment, broad claims travel further than narrow ones. “Autonomous agents” is a simpler story than “AI accelerates bounded work under human ownership”, even if the latter is more accurate.

OpenAI’s move toward advertising indicates a response to cost and revenue pressure10. Advertising is a business decision that helps explain why public narratives continue to inflate even as practitioner experience remains uneven.

Practical implications for engineering leadership

The practical implications are less dramatic than the marketing suggests. We maintain our thesis that AI tools today raise the floor of engineering output far more than they raise the ceiling. Teams move faster on shallow, routine work, and fewer people are blocked on basics. Mid-tier engineers often see real gains in throughput. What does not change is who reliably handles the hard parts: architecture, data models, invariants, migrations, and failure semantics remain human, and often expert led.

Claims of autonomy generalize from bounded successes. Porting, refactoring, scaffolding, and translation work well because the target behavior is already defined and verification is strong. That success does not extend to open-ended system evolution, where tests are incomplete and mistakes compound over time.

Spec-driven agent workflows introduce a hidden tax. Setup, oracle construction, drift control, and defensive review frequently offset raw generation speed. Leaders should treat any workflow that increases code volume without proportionally increasing deletion safety as a long-term cost, regardless of short-term speedups.

The net effect is uneven acceleration. Some tasks get dramatically cheaper. Teams are not (yet) consistently faster on the work that actually determines system quality and long-term cost.

Closing thoughts

What has changed is the cost of producing code, not the cost of understanding systems.

Shallow dataflow work has become cheaper and faster. Deep dataflow work has not become autonomous. The tools compress variance at the low end of the skill distribution and help mid-tier engineers move quicker on routine tasks, but they do not replace judgment or system ownership.

A grounded current posture is to use AI aggressively for bounded, local work where verification is cheap. Keep humans explicitly responsible for architecture, data models, invariants, and operational semantics. Prefer interactive, file-scoped collaboration over long autonomous loops. Treat agents as batch tools where appropriate, not as substitutes for engineering judgment.

Until deletion, constraint, and long-horizon ownership are meaningfully automated, autonomy will raise the floor while leaving the ceiling mostly where it is.


  1. Andrej Karpathy, “Vibe Coding” https://x.com/karpathy/status/1886192184808149383  ↩︎

  2. Addy Osmani, Vibe Coding is not an excuse for low-quality work https://addyo.substack.com/p/vibe-coding-is-not-an-excuse-for  ↩︎ ↩︎

  3. Simon Willison, LLMs and programming https://simonwillison.net/tags/llms/  ↩︎

  4. Armin Ronacher, Agent Psychosis: Are We Going Insane? https://lucumr.pocoo.org/2026/1/18/agent-psychosis/  ↩︎

  5. Hacker News discussion on Ronacher’s post https://news.ycombinator.com/item?id=46666777  ↩︎

  6. Simon Willison, Porting justhtml to Python with LLM assistance https://simonwillison.net/2025/Dec/15/porting-justhtml/  ↩︎

  7. Simon Willison, Answers, Jan 11, 2026 https://simonwillison.net/2026/Jan/11/answers/  ↩︎

  8. SWE-bench https://www.swebench.com/  ↩︎

  9. SWE-EVO: Evaluating Long-Horizon Software Evolution Tasks https://arxiv.org/abs/2512.18470  ↩︎

  10. Reuters, OpenAI to test ads in ChatGPT, Jan 2026 https://www.reuters.com/business/openai-begin-testing-ads-chatgpts-free-go-tiers-2026-01-16/  ↩︎