Building on the previous nullbench1 methodology and early refinements2, we expanded the pool, kept decoding and scoring fixed, and asked if these additions change routing. The primary harness still scores answers only under zero context; refusals, scope drift, and tone distortion are debited; latency and memory do not change the grade but are captured on the evaluation machine (Mac Mini with M4 Pro 64GB unified memory). The outcome is consistent with the earlier posts. Larger checkpoints with heavier “safety” layers often lose ground because they refuse, pad, or universalize where the spec asks for scoped, direct answers. Mid-size models that stay on-spec climb once we strip runtime from the metric.
Two newcomers stood out. GLM-4-9B3 4 lands in the mid field and behaves like a disciplined worker: concise schema-true outputs, solid axis balance for its size, and fewer template tics than similar small judges. EXAONE-Deep-32B behaves differently from the EXAONE-3.5 instruct5 6 line: the Deep variant is a distilled, reasoning-tuned model7 8 that, in our runs, produces the most repeatable zero-context summarization in the pool. It also shows an odd runtime profile of low observed peak footprint for its class, near-frozen tails, glacial medians. That mix is especially useful as a reliable batch summarizer under caps.
We also tested community abliteration variants: instruction-tuned checkpoints with refusal paths disabled without a full retrain. The trade is the one we’d expect. They comply on content the untampered siblings would refuse, which helps on hostile commentary, and they give up hedging fidelity, language breadth, and tone control. In our axes that looks like quick “wins” on refusal gates followed by drops on scope and caveat preservation. Given the breadth of other models and fingerprints, abliterated checkpoints didn’t add much beyond contrast instruments.9
Smaller checkpoints inside a family still beat larger siblings on answers-only often enough to mention. The mechanism appears mechanical. Stronger refusal/hedging priors and stylized “helpfulness” templates come with scale; our rubric penalizes disclaimers that replace content, universalizing, and tone smoothing. The small sibling with lighter refusal and less padding lands closer to the requested shape and gets credit. When you add retrieval and gates in production, the larger model’s capacity may matter again; in this harness it often doesn’t.
We’re shrinking the target set for the next run. We drop OpenAI’s gpt-oss:120b
and gpt-oss:20b
for refusal craters and no answers-only upside. qwen2.5vl:32b
, qwq:32b
, qwen3:32b
, and qwen3:30b
duplicate behavior we already capture with qwen3-coder:30b
. dolphin-mixtral:8x7b
burns tokens for a steady mid-field placement. The two abliterated mids (qwen3-abliterated:16b
, gemma3-abliterated:12b
) are out; a single abliterated baseline is enough to expose the refusal-versus-quality trade, so we keep qwen3-abliterated:14b
only. EXAONE-3.5-32B-Instruct
and the distilled deepseek-r1:32b
do not differentiate on our axes. Between the two 24B “M*stral” routes we keep magistral:24b
and drop mistral-small3.2:24b
for redundancy. llama3.1:8b
exits for weak distinctiveness; qwen3:4b
is optional and stays only if we need a small-Qwen anchor. We keep EXAONE-Deep-32B
as the single summarization outlier worth the latency tax.
Loading benchmark scatterplot ('size')...
Routing updates that actually change work
We keep qwen3-coder:30b as default writer and general analyst; its answers land on-spec across axes even with efficiency ignored. GLM-4-9B moves into schema-fill and retry loops where speed and rubric adherence dominate breadth. hermes3:8b remains the fast commenter for short-form drafts. gemma3:27b-IT-QAT stays the translation route with hard token caps to suppress verbosity. hermes3:3b and EXAONE-Deep-32B run behind refusal harmonizers for hostile-commentary probes and red-team tasks; neither goes to user-visible paths without gates. magistral:24b fills the software/technical lane formerly covered by mistral-small3.2:24b.
Judge refresh: reduce family correlation, keep costs low
Averaging removes order effects, not shared blind spots. Two judges from one family reinforce the same biases on edge items. We want a compact panel, low cost, diverse failure modes, reproducible labels.
Judge | Role | Distinct signal | Cost profile | Typical failure |
---|---|---|---|---|
qwen3-coder:30b | Anchor for separation + continuity | High-capacity rubric fit; preserves ordering across runs | Moderate VRAM, fast for class | Over-refusal on sensitive templates if run without harmonizers |
GLM-4-9B | Mid judge to reduce family correlation | Concise, schema-true answers; different alignment lineage than Qwen | Fast and efficient for size | Short-form bias on long, nuanced prompts |
granite3.3:2b | Cheap order-keeper | Low-variance labels; stabilizes panel | Very low cost | Under-scoring subtle hedging preservation |
hermes3:3b | Refusal-sensitive probe | Will comply on hostile commentary; exposes over/under-refusal in targets | Very fast; tiny footprint | Tone drift and over-compliance on sensitive content |
We remove qwen3:1.7b from the judge set; GLM-4-9B replaces it to cut family correlation and sharpen small-judge behavior. Ordering remains irrelevant since we average; diversity and repeatability are what matter.
Measurement note on memory and stability
We run models serially on a macOS host; ollama leverages MLX. We sample ‘peak memory’ by summing RSS across the single ollama
process tree at 100 ms intervals. This is a lower-bound approximation under MLX/Metal: driver-managed allocations don’t always surface in RSS, macOS compression and shared mappings skew totals, and coarse ticks can miss short spikes on load/teardown. For within-host comparisons it’s stable enough to rank models by relative footprint across many prompts. For capacity planning, treat it as a lower bound and apply headroom.
What changed our mind
EXAONE-Deep-32b. A 32B model showing tiny observed peak footprint for its class, near-frozen tails, and the most reliable zero-context summarization in the set, paired with a ~180s median. That mix forces a new slot: a summarizer kept for correctness under caps, not an interactive worker.
Why we’re dropping the long tail
We don’t keep artifacts that don’t move routes. The OSS-branded large checkpoints add refusals and latency without answer gains. The multi-modal and near-duplicate 32Bs don’t change any decision our current set can’t already make. Multiple abliterated mids add noise once one abliterated reference demonstrates the refusal/quality trade. We keep a lean target set and re-run with a sharper judge panel.
Where smaller siblings “win” and why
Larger variants in a family often carry stronger refusal/hedging layers and stylized “helpfulness” templates optimized for product raters. Our axes penalize disclaimers that replace content, universalizing, and tone smoothing. Scale also amplifies distributional priors, which shows up as overgeneralization. Smaller siblings tuned with lighter refusal and less stylization hit the requested shape more often. This is not a paradox and not a defense of tiny models; it’s a consequence of current instruction recipes. With retrieval and verification in production, capacity can dominate again. In a controlled, answers-only harness, it often does not.
What stays constant
We still treat LLMs as probabilistic, generally unreliable components bounded by external enforcement. We still route by behavioral fingerprints, not leaderboard deltas. We still prefer typed interfaces, risk gates, and immutable logs over longer contexts and “think harder” prompts. Our change is tactical to swap one judge to reduce correlation and shrink the target list to components that shift routes or expose meaningful failure surfaces.
/en/blog/2025-08-30-nullbench-fingerprints-starting-an-operational-playbook/ “nullbench Fingerprints: Starting an Operational Playbook” ↩︎
/en/blog/2025-08-31-nullbench-fingerprints-v1.1-stability-updates-and-new-routes/ “nullbench Fingerprints v1.1: Stability Updates and New Routes” ↩︎
https://huggingface.co/zai-org/glm-4-9b-chat-hf “GLM-4-9B-Chat — model card (Zhipu AI / Hugging Face)” ↩︎
https://huggingface.co/zai-org/glm-4-9b “GLM-4-9B — model card (Zhipu AI / Hugging Face)” ↩︎
https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-32B-Instruct “EXAONE-3.5-32B-Instruct — model card (LG AI Research / Hugging Face)” ↩︎
https://github.com/LG-AI-EXAONE/EXAONE-3.5 “EXAONE-3.5 — official repository (LG AI Research)” ↩︎
https://arxiv.org/abs/2503.12524 “EXAONE Deep: Reasoning-Enhanced Language Models — arXiv” ↩︎
https://www.lgresearch.ai/data/upload/EXAONE_Deep__Model_Card.pdf “EXAONE Deep — model card PDF (LG AI Research)” ↩︎
https://huggingface.co/blog/mlabonne/abliteration “Uncensor any LLM with abliteration — Hugging Face blog” ↩︎