We revised nullbench
with a stability metric that avoids collapse on single spikes and a category set split into more granular writing style and language comprehension tracks. The overall portfolio remains the same1, but two routes move under the new breakdown: writing now runs through qwen3-coder:30b
and translation shifts to gemma3:27b IT-QAT
.
The default router does not change. qwen3-coder:30b
still carries the highest quality-per-resource ratio, with throughput at 62.5 tokens per second and median response time at 6.2 seconds. Stability remains at 0.58, weaker than some larger models but predictable enough for routing. On the writing axes, it scores 89 for hard news, 92 for macro-polemics, 83 for market commentary, and 87 for niche topics. qwen3:30b
posts a slightly higher coarse writing score at 90, but the five-fold increase in latency and lower efficiency ratio make it unattractive outside batch jobs.
Language testing produces a clearer split. gemma3:27b IT-QAT
is the most balanced across directions: Japanese 91, Chinese 94, German 91, Russian 93. mistral small3.2:24b
peaks at 96 on Japanese production but drags on other languages. qwen3-coder:30b
is strong in Chinese and Russian (93 and 94) but drops to 85 in Japanese. With stability at 0.84 and consistent tails, gemma3
takes the translation role, with mistral
reserved when Japanese output must dominate.
See the results of the refined run 2025-08-30-2.
Loading benchmark scatterplot ('size')...
Writing quality and operations
Model | Writing score | Median RT (s) | QPR % | Notes |
---|---|---|---|---|
qwen3-coder 30B | 87.8 | 6.2 | 100 | Selected writer, efficient |
qwen3 30B | 90.0 | 33.9 | 16 | Higher raw score, slower |
mistral small3.2 24B | 88.0 | 33.9 | 7 | Solid, heavier footprint |
gemma3 27B IT-QAT | 86.5 | 99.6 | 1 | Used for translation only |
Translation highlights
Model | Composite signal | Production scores |
---|---|---|
gemma3 27B IT-QAT | Highest balance | Japanese 91, Chinese 94, German 91, Russian 93 |
mistral small3.2 24B | Japanese peak | Japanese 96, Russian 89, Arabic 85 |
qwen3-coder 30B | Chinese/Russian | Chinese 93, Russian 94, Japanese 85 |
Stability snapshot v1.1
Model | Stability | p50 RT (s) | p95 RT (s) |
---|---|---|---|
qwq 32B | 0.93 | 156.9 | 180.0 |
qwen3 32B | 0.92 | 152.4 | 180.0 |
gemma3 27B IT-QAT | 0.84 | 99.6 | 141.8 |
hermes3 3B | 0.85 | 6.2 | 9.0 |
mistral small3.2 24B | 0.79 | 33.9 | 57.0 |
qwen3-coder 30B | 0.58 | 6.2 | 17.4 |
llama3.1 8B | 0.80 | 11.3 | 17.8 |
gpt-oss 20B | 0.49 | 49.0 | 151.4 |
The long tails on some 32B models remain, but the v1.1 adjustment prevents them from crushing the score and shrinks small sample runs toward 0.85.
Routing policy
Default and writing both run through qwen3-coder:30b
. Translation and markets run through gemma3:27b IT-QAT
, with mistral small3.2:24b
for software and technical tasks. hermes3:3b
remains the draft and comment generator. gpt-oss:20b
is retired for lack of distinct upside, and qwq:32b
is dropped as redundant with qwen3:32b
.
Notes
Summarization is still weak, with qwen3-coder:30b
in the 36–53 range on granular tests. We tested specifically for overgeneralization 2 and saw the same failure patterns across models. This means strict length caps and retrieval checks remain mandatory. QPR percentage continues to serve as an early filter; qwen3-coder:30b
still anchors efficiency at 100 percent, which is why it now carries both the default and writing routes.