LAUNCH ETA: 2025 October

Agent Systems with SLMs, Workflow Engines and Adaptive Routing

5 min read

NVIDIA’s June 2025 paper Small Language Models Are the Future of Agentic AI argues that agent systems run better when built around small language models (SLMs) rather than leaning exclusively on large ones1. A small model is whatever can run with practical latency for a single user on consumer hardware. In practice today that means dense models under roughly ten billion parameters or mixture-of-experts that only activate a few billion per token, so long as the full checkpoint fits on a single 24–32 GB GPU or a Mac Mini M4 Pro with 64 GB memory. The dividing line is not parameter count, but whether deployment on commodity devices is feasible without cluster hardware.

NVIDIA frames small models as efficient, but real performance diverges. In our benchmarks, we showed llama3.1:8b averaging close to 20 GB unified memory consumption, while qwen3-coder:30b ran below 6 GB2. The same inversion appears in response times: qwen3-coder:30b averaged ~6s, faster than qwen3:4b at ~32s and llama3.1:8b at ~11s3 on our M4 Pro. Scale alone does not predict latency, and any operational claim about “small” models needs to measure actual wall-clock performance rather than parameter count.

Agent workloads decompose into narrow tasks such as parsing arguments, filling schema fields, or formatting a command for a tool. These tasks fail frequently, and reliability comes from retries and branching rather than expecting one large inference to succeed. Small models make retries cheaper, and with high tokens per second the orchestrator can run multiple candidates in parallel. NVIDIA cites inference-time scaling methods such as self-consistency and verifier feedback as mitigation, but the outcome is the same conclusion we published earlier in Tools and Filters for Short Horizons: control logic and schema enforcement sit outside the model, not inside it. The model is approximate and probabilistic; deterministic filters and external gates are mandatory.

The tool-use loop in this design is repetitive. The orchestrator selects the next step from an explicit workflow or compiled plan. The model fills schema-bound JSON, verification checks it, retries fire if it fails, and only valid outputs reach a deterministic tool. The tool itself is stateless and returns schema-bound output. Escalation to a larger model only occurs when retries fail or when the task is genuinely open-ended. This loop repeats until completion. As shown in The Goldilocks Horizon, even large models collapse at longer timescales, so reliability comes from iteration and validation, not size.

NVIDIA extends this by distinguishing code agency from language agency1. The model does not improvise freeform plans in text but fills arguments and emits structured calls into bounded activities. Each activity is a tool or script with declared input and output schemas, executed by the orchestrator. They frame the workflow as a directed acyclic graph, though in practice bounded cycles are unavoidable verification loops, retries, self-consistency iterations. A key property when cycles are allowed is that the number of steps is capped and each step is verifiable. Otherwise loops may degenerate into uncontrolled language agency. In effect, agent systems reduce to workflow engines with activities, and the intelligence resides in orchestration and filters rather than in model improvisation.

NVIDIA’s authors stop at that loop, but we argue the interfaces require stricter, bounded design. Further, observability is required to measure tail latencies or enforce service-level objectives. Our benchmark framework exposed this directly: stability scores and tail latency distributions diverge even for models with similar mean latencies3, and in workflows with SLO requirements predictability is sometimes a real measure of viability. Operators are forced to design around the worst tail.

The hardware line is less crisp than NVIDIA presents. They give ten billion parameters as a cutoff, but commodity devices in 2025 can run models up to around thirty billion with quantization. Past that, multi-GPU or datacenter hardware is unavoidable and retry costs break down. Mixture-of-experts complicates classification. qwen3-coder:30b activates about three billion experts per token, so compute resembles an SLM but memory footprint resembles an LLM. By NVIDIA’s own standard the only valid measure is whether it runs on commodity hardware with acceptable latency.

Vendors sell large models as more intelligent, but their own research suggests that agent systems need retries and orchestration, and retries at scale are often only economical with smaller or faster models. The stated goal is generality, the observed outcome is specialization and schema binding. Our own work in Fingerprints v1.0 showed that routing decisions are better informed by stability and refusal profiles than by leaderboard averages, and we argued for multi-armed bandits in an earlier post about guardrails4 to compare prompts and models across branches. NVIDIA frames the case in terms of efficiency, but the real shift is architectural: intelligence accrues from orchestration, control logic and gates rather than parameter count.

We stipulate that the architecture implied by both our benchmarks and NVIDIA’s research requires two concrete components. First, a workflow engine or flow-based programming layer that executes bounded activities under schema enforcement. Second, an adaptive multi-variant testing framework that treats prompts and models as arms to be compared and routed dynamically.

This latter point has independent support. A recent paper by Fujitsu and Microsoft frames model routing under budget constraints as a contextual bandit problem, enforcing user-defined cost and latency ceilings while dynamically allocating calls across a model pool5. Their method aligns with our proposal for adaptive routing: orchestration must bind both activities and also enforce economic limits by testing and shifting between variants over time.

The stopping condition is hardware. If a model cannot run on a single consumer device with predictable latency and bounded memory, it cannot participate in the retry loop that makes agent systems reliable. Orchestration architects have to account for this.