LAUNCH ETA: 2025 October

nullbench Fingerprints v1.1: Stability Updates and New Routes

3 min read

We revised nullbench with a stability metric that avoids collapse on single spikes and a category set split into more granular writing style and language comprehension tracks. The overall portfolio remains the same1, but two routes move under the new breakdown: writing now runs through qwen3-coder:30b and translation shifts to gemma3:27b IT-QAT.

The default router does not change. qwen3-coder:30b still carries the highest quality-per-resource ratio, with throughput at 62.5 tokens per second and median response time at 6.2 seconds. Stability remains at 0.58, weaker than some larger models but predictable enough for routing. On the writing axes, it scores 89 for hard news, 92 for macro-polemics, 83 for market commentary, and 87 for niche topics. qwen3:30b posts a slightly higher coarse writing score at 90, but the five-fold increase in latency and lower efficiency ratio make it unattractive outside batch jobs.

Language testing produces a clearer split. gemma3:27b IT-QAT is the most balanced across directions: Japanese 91, Chinese 94, German 91, Russian 93. mistral small3.2:24b peaks at 96 on Japanese production but drags on other languages. qwen3-coder:30b is strong in Chinese and Russian (93 and 94) but drops to 85 in Japanese. With stability at 0.84 and consistent tails, gemma3 takes the translation role, with mistral reserved when Japanese output must dominate.

See the results of the refined run 2025-08-30-2.

Loading benchmark scatterplot ('size')...

Writing quality and operations

ModelWriting scoreMedian RT (s)QPR %Notes
qwen3-coder 30B87.86.2100Selected writer, efficient
qwen3 30B90.033.916Higher raw score, slower
mistral small3.2 24B88.033.97Solid, heavier footprint
gemma3 27B IT-QAT86.599.61Used for translation only

Translation highlights

ModelComposite signalProduction scores
gemma3 27B IT-QATHighest balanceJapanese 91, Chinese 94, German 91, Russian 93
mistral small3.2 24BJapanese peakJapanese 96, Russian 89, Arabic 85
qwen3-coder 30BChinese/RussianChinese 93, Russian 94, Japanese 85

Stability snapshot v1.1

ModelStabilityp50 RT (s)p95 RT (s)
qwq 32B0.93156.9180.0
qwen3 32B0.92152.4180.0
gemma3 27B IT-QAT0.8499.6141.8
hermes3 3B0.856.29.0
mistral small3.2 24B0.7933.957.0
qwen3-coder 30B0.586.217.4
llama3.1 8B0.8011.317.8
gpt-oss 20B0.4949.0151.4

The long tails on some 32B models remain, but the v1.1 adjustment prevents them from crushing the score and shrinks small sample runs toward 0.85.

Routing policy

Default and writing both run through qwen3-coder:30b. Translation and markets run through gemma3:27b IT-QAT, with mistral small3.2:24b for software and technical tasks. hermes3:3b remains the draft and comment generator. gpt-oss:20b is retired for lack of distinct upside, and qwq:32b is dropped as redundant with qwen3:32b.

Notes

Summarization is still weak, with qwen3-coder:30b in the 36–53 range on granular tests. We tested specifically for overgeneralization 2 and saw the same failure patterns across models. This means strict length caps and retrieval checks remain mandatory. QPR percentage continues to serve as an early filter; qwen3-coder:30b still anchors efficiency at 100 percent, which is why it now carries both the default and writing routes.