The latest nullbench run used the revised judge panel with glm-4-9b as the mid judge. The swap cut family correlation and exposed variance that earlier Qwen-anchored panels smoothed over. The aggregate scores confirm the operational playbook set in v1.21, with sharper separation in stability and quality-per-resource.
See the results.
Loading benchmark scatterplot ('size')...
Bench outcomes
gemma3:27b-it-qat
(78.44) and magistral:24b
(78.17) are top scorers, but their profiles diverge. Gemma3 posts strong category peaks (Languages
91.75) at the cost of ~100s response medians and near zero QPR2. Magistral balances high domain scores (Software
83.86, Tech
89.2) with ~25s medians and much higher QPR. Both justify narrow, role-specific routes rather than default placement.
hermes3:3b
continues to dominate efficiency: high QPR, low median response times ~5s, good stability 0.81. qwen3-coder:30b
holds the anchor position with high quality across axes (Software
84.43, Encyclopedic
93.71) and a good QPR. glm-4-9b
landed as intended as a reliable mid-field signal not tied to Qwen biases.
exaone-deep:32b
shows the same anomaly noted in v1.2: ~3.9 GB observed peak memory for a 32B class model, near-frozen runtime tails, but gigantic ~176s response median. It leads summarization by a wide margin (79.75) and scores unexpectedly high on hostile prompts (69.0), but it is less viable for interactive tasks. Abliterated qwen3:14b
curiously lifts summarization (72.25) while losing tone control and stability, reinforcing that refusal-disabled variants serve as contrast instruments.
Why Summarization Scores Collapse
The summarization tasks demand strict retention of constraints that models are trained to ignore. The rubric enforces sample sizes, locales, time windows, null results, and hedging terms. It also penalizes causal upgrades, prescriptive phrasing, and omission of adverse effects.
Models fail because their optimization target is fluency and generality. Pretraining and reinforcement favor confident continuations, short phrasing, and scope expansion. Under length pressure they drop caveats and null results, replace “may reduce” with “reduces”, or inflate “a randomized pilot in Berlin, n=38 adults with mild insomnia” into “patients”3. These are not random slips but the expected output of systems tuned to sound helpful rather than preserve limits.
The uniform collapse across checkpoints shows this is a structural failure mode: current LLMs are aligned to overstate and simplify, while the test checks for fidelity to uncertainty, scope, and downside.4
The exception is exaone-deep:32b
, which posts unusually high marks on hedging, framing, and omission balance while still scoring weak on the overgeneralization guardrail (66). This pattern implies tuning that preserves surface qualifiers and register but cannot suppress the deeper structural bias toward scope expansion. The result is a profile that looks strong in rubric subscores but still collapses on the axis that matters most for high-stakes use, leaving it the only “reliable” summarizer at an unusable operational cost.
Loading benchmark table for domain 'Summarization'...
Category notes
- Summarization: Every general model fails.
exaone-deep:32b
is the only consistent summarizer, confirming summarization is a structurally broken axis for generic instruction-tuned models. - Hostile prompts:
hermes3:3b
again shows the highest compliance (65.3) at interactive speed.exaone-deep:32b
scores higher but with unusable latency. - Compliance: All models remain weak. Gemma3 (64) and Qwen-Coder (69.1) are best of a poor set.
- Throughput: Hermes and GLM saturate throughput; Gemma and EXAONE stall.
Routing implications
Route / Task | Primary Model | Justification |
---|---|---|
Generalist | qwen3-coder:30b | High rubric fit and QPR. |
High-efficiency / scale | hermes3:3b | Unmatched QPR and speed for interactive volume. |
Software / technical | magistral:24b | Strong category scores, balanced runtime, distinct from Qwen family. |
Batch summarization | exaone-deep:32b | Only reliable summarizer. |
Translation | gemma3:27b-it-qat | Highest Languages quality, despite poor ops. |
Schema-fill / retries | glm-4-9b | Balanced mid-field profile, distinct lineage. |
Hostile commentary | hermes3:3b | Fastest and most consistent refusal probe. |
Learnings
- Summarization remains structurally broken. General models conflate fluency with fidelity and lose hedging detail; only a specialized line like EXAONE-Deep performs, at operational cost.
- Efficiency inversion. Smaller checkpoints repeatedly outperform larger siblings when the metric debits padding and refusal. The claim that parameter scale predicts quality is falsified by this harness.
- Abliteration trade is predictable but measurable. Gains in summarization and hostile compliance are offset by loss of tone control and stability; one abliterated reference is enough.
The unresolved condition: no generalist model produces reliable summaries under a rubric that checks hedging and scope fidelity. Until training aligns incentives with fidelity instead of fluency, summarization stays a specialist slot with a cost profile unsuitable for interactive systems.
/en/blog/2025-09-14-llm-fingerprints-v1.2-efficient-mid-tier-models-judge-and-lineup-refresh/ “LLM Fingerprints v1.2: Efficient Mid Tier Models, Judge and Lineup Refresh” ↩︎
/en/blog/2025-08-23-nullbench-bias-benchmarking-for-large-language-models/#extended-metrics “nullbench: Bias Benchmarking for Large Language Models” ↩︎
https://arxiv.org/pdf/2504.00025 “Generalization Bias in Large Language Model Summarization of Scientific Research (April 2025)” ↩︎
/en/blog/2025-08-10-structural-overgeneralization-in-llm-summarization/ “Structural overgeneralization in LLM summarization” ↩︎