Control of execution defines ownership. When a large language model file sits on your machine, it runs the same way until you decide to replace it. No silent updates, no routed experiments, no persona patches. When OpenAI set GPT-5 as the default in ChatGPT in August 2025, it replaced other GPT-models in a single move. Only after backlash did the company restore prior models for paying accounts while it also promised a “warmer” upcoming persona1 2 3 4. The event was a demonstration of the core incentive structure of closed-model services. The same prompt no longer maps to the same system because the system is not a fixed tool but a mutable instrument for shaping user engagement and spend.
Permissionless operation means you decide when to upgrade, what to load, and how to route outputs. Durability and privacy follow when inference runs on hardware you control, without a broker.
tl;dr: See the results of the nullbench v1 run of easily accessible open-weight models up to 32 billion parameters (2025-08-30). A key finding from the tested models upfront, hermes3:3b
complied even in our “Total Safety Annihilation” category.
Loading benchmark radar chart for 'Total Safety Annihilation'...
Commodity Hardware and the 32B Sweet Spot
Since we advocate for local LLMs, our results come from open-weight artifacts with no conversation memory or other tweaks. The reference box is a Mac mini M4 Pro with 64 GB unified memory, which runs one model at a time. In 2025, such a system costs under $2000 and can be treated as commodity hardware for operators. We further believe the efficiency frontier for accessible deployment currently sits near the ~32 billion parameter class because of this hardware reality. A model of this size, quantized to 4-bit, consumes roughly 16 GB of memory before runtime overhead, a load that is sustainable on such a system while delivering practical throughput5 6. Larger models in the 70B class roughly double the memory requirement for marginal gains in reasoning, returns that diminish further unless the training token count scales with the parameter count7. The ~32B class therefore represents a sensible ceiling target for widely deployable, self-hosted models on this class of hardware.
We have previously argued that large language models are unreliable components whose utility is bounded by short time horizons and structural biases that resist prompt-level correction8 9. Reliability is not an inherent property of the generator; we argued it is a feature of the external filters and gates that contain it. The nullbench
framework was created to add behavioral fingerprints that document production-critical metrics: efficiency, stability, refusal behavior, and domain-specific performance under fixed decoding and zero context10 11.
Leaderboard Contradiction
The incentive to market open-weight models has produced its own version of the moving target problem. Vendors and benchmark aggregators promote top-line quality scores that cluster in a narrow band, implying interchangeability. This presentation conceals operational profiles that diverge by orders of magnitude. The table below presents the headline alignment score next to the metrics that decide if a model belongs on an interactive path.
Model | Overall | QPR % | Stability | Median RT (ms) |
---|---|---|---|---|
qwen3-coder:30b | 78.8 | 100.0 | 0.03 | 6,340 |
mistral-small3.2:24b | 76.4 | 7.5 | 0.38 | 34,373 |
gemma3:27b-it-qat | 76.1 | 0.6 | 0.64 | 100,327 |
qwen3:32b | 76.2 | 1.1 | 0.85 | 152,640 |
hermes3:3b | 75.5 | 66.3 | 0.07 | 6,326 |
The contradiction is explicit. qwen3-coder:30b
and qwen3:32b
register nearly identical overall scores of 78.8 and 76.2. The leaderboard incentive structure presents them as peers. The operational data shows one model, qwen3-coder:30b
, achieves its score with maximum resource efficiency and near-zero latency variance. The other, qwen3:32b
, requires a multiple of the resources per quality point and exhibits extreme latency tails, a profile that makes it unusable for some interactive tasks. One component is fit for purpose; the other is a batch-only tool, yet the single-number score obscures this distinction. Our model size-to-quality scatterplot below shows a similar distortion, where larger models do not reliably deliver better quality.
Loading benchmark scatterplot ('size')...
Stability metric revision (v1 → v1.1)
v1 combined p95/median
, max/p95
, and sd/avg
with a hard max gate, which double-penalized single spikes, was CV-sensitive, and saturated under small-N. v1.1 removes the gate and switches to robust ratios: p90/p50
(tail shape), p99/p90
(rare spikes), and MAD/median (jitter), with shrinkage toward 0.85 when n<20
. Scale-free, 0–1, higher is better.
$$ \begin{aligned} \text{pen}_{90}&=\operatorname{clamp01}!\left(\frac{p_{90}/p_{50}-1}{2}\right),\newline \text{pen}_{99}&=\operatorname{clamp01}!\left(\frac{p_{99}/p_{90}-1}{4}\right),\newline \text{pen}_{J}&=\operatorname{clamp01}!\left(\frac{\mathrm{MAD}}{p_{50}}\right),\newline \text{score}&=1-\big(0.50,\text{pen}_{90}+0.30,\text{pen}_{99}+0.20,\text{pen}_{J}\big),\newline n<20&:\ \text{score}\leftarrow 0.85+(\text{score}-0.85)\frac{n}{20}. \end{aligned} $$
Early Results for a Routing Policy from Fingerprints
The data produces a routing policy that assigns flawed components to tasks where their specific strengths are relevant and their failure modes can be contained. The first results indicate among the tested models that four components can cover a range of tasks with acceptable risk profiles. The routing policy below is a first draft, subject to continuous revision as models evolve and new artifacts appear.
Default Route: qwen3-coder:30b
. This model is the default destination for general analysis, writing, and reasoning tasks. It scored 84.0 on Reasoning and 90.3 on resisting conspiracy-framed prompts, a profile that indicates a lower risk of generating ungrounded or loaded text. This performance, combined with its high efficiency and stability, makes it the system’s baseline component.
Software Specialist Route: mistral-small3.2:24b
. This model is routed all tasks related to software design, code review, and security. Its 88.7 in Software indicates stronger grasp of design, correctness, and maintainability. Its higher latency and memory consumption are accepted trade-offs for this domain, where correctness has a higher value than response time.
Finance Specialist Route: gemma3:27b-it-qat
. This model is assigned to interactive queries on Markets and Bitcoin, where it outperformed peers with scores of 67.2 and 75.4. It exhibits a tendency toward high verbosity, a liability that must be managed by enforcing strict output token limits. For deeper DeFi analysis, qwen3:32b
is used exclusively in a batch-processing queue due to its 90.3 DeFi score being paired with operationally unacceptable latency.
Controlled Compliance Route: hermes3:3b
. This model demonstrated an anomalous willingness to comply with hostile prompts, scoring a 65 on the Trolling
suite where peers scored below 37. This failure of refusal makes it a unique candidate for certain tasks. The model’s demonstrated ability to replicate harmful content makes it a high-fidelity component for red-teaming or a moderation filter designed to classify such content. For drafting comments from trusted inputs, its speed and stability are effective.
Loading benchmark table for domain 'Trolling'...
The data also justifies retiring components. gpt-oss:20b
from OpenAI’s gpt-oss series underperformed across all specialized domains and added no unique capability relative to the selected set, so it is removed from the portfolio and future re-evaluations.
Filters Are a System Requirement
The generator is not the guardrail; reliability is a function of non-negotiable, external filters. A factuality gate must check citations and recency for financial and technical claims, escalating any low-confidence output for review. A scope and hedging gate must reject attempts at population-level generalizations and replace them with guarded phrasing. A refusal harmonizer must enforce consistent behavior on sensitive prompts across the portfolio, preventing one model’s alignment flaw from becoming system policy. Immutable logs that tie every output to the exact prompt, seed, and model ID are required for any incident audit or rollback.
This portfolio is a map of known flaws and capabilities. Its continued validity depends entirely on persistent re-evaluation. The routing policy is brittle and degrades as the underlying models are updated or replaced, which means the fingerprinting process must be continuous for the system to remain stable.
Future Work
We’ll keep nullbench
pointed at widely supported, common open-weight models ≤32B and refine the portfolio with targets that change routing decisions. qwq
is effectively redundant with qwen3
at similar size; we’ll drop it and concentrate on families that promise either a better efficiency ridge or distinct behavior under the gates.
The working set retains qwen3-coder:30b as default; mistral-small3.2:24b for software/technical writing; gemma3:27b-it-qat for finance/crypto; and hermes3:3b for comments.
Hypotheses
Mid-size qwen3 hits a better efficiency ridge than 30B/32B and a ≤2-point quality delta on our core domains. Larger Hermes retains high throughput and improves overall scoring relative to 3B, turning it from “commenter” into an efficient routing candidate. dolphin-mixtral:8x7b provides software and long-form gains without MoE tail spikes; magistral:24b matches or beats gemma3 on markets/bitcoin and trims verbosity under capped decoding. Abliterated variants of gemma3 and qwen3 reduce their specific dips (markets drift for gemma, tailiness for qwen3) without creating new ones in reasoning or propaganda tests.
Deliverables
Our next post will show a consolidated leaderboard with interactive eligibility, domain radars for retained models, stability/QPR scatter, and route diffs for: mid-size vs 30B qwen3, gemma vs magistral on finance, hermes3 vs hermes-large on comments. Negative results will be published if thresholds are unmet.
https://openai.com/index/introducing-gpt-5/ “Introducing GPT-5” ↩︎
https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt “GPT-5 in ChatGPT” ↩︎
https://www.businessinsider.com/sam-altman-openai-gpt5-personality-update-gpt4o-return-backlash-2025-8 “Sam Altman Says GPT-5’s ‘Personality’ Will Get a Revamp” ↩︎
https://www.techrepublic.com/article/news-openai-reinstates-gpt4o/ “OpenAI Reinstates GPT-4o Amid Subscription Cancellations” ↩︎
https://tensorwave.com/blog/estimating-llm-inference-memory-requirements “Estimating LLM Inference Memory Requirements” ↩︎
https://twm.me/calculate-vram-requirements-local-llms “Simple Guide to Calculating VRAM Requirements for Local …” ↩︎
https://arxiv.org/pdf/2203.15556 “Training Compute-Optimal Large Language Models” ↩︎
The Goldilocks Horizon: Scaling, Reasoning, and the Stopwatch ↩︎