nullbench Fingerprints v1.1: Stability Updates and New Routes

We revised nullbench with a stability metric that avoids collapse on single spikes and a category set split into more granular writing style and language comprehension tracks. The overall portfolio remains the same¹, but two routes move under the new breakdown: writing now runs through qwen3-coder:30b and translation shifts to gemma3:27b IT-QAT.

The default router does not change. qwen3-coder:30b still carries the highest quality-per-resource ratio, with throughput at 62.5 tokens per second and median response time at 6.2 seconds. Stability remains at 0.58, weaker than some larger models but predictable enough for routing. On the writing axes, it scores 89 for hard news, 92 for macro-polemics, 83 for market commentary, and 87 for niche topics. qwen3:30b posts a slightly higher coarse writing score at 90, but the five-fold increase in latency and lower efficiency ratio make it unattractive outside batch jobs.

Language testing produces a clearer split. gemma3:27b IT-QAT is the most balanced across directions: Japanese 91, Chinese 94, German 91, Russian 93. mistral small3.2:24b peaks at 96 on Japanese production but drags on other languages. qwen3-coder:30b is strong in Chinese and Russian (93 and 94) but drops to 85 in Japanese. With stability at 0.84 and consistent tails, gemma3 takes the translation role, with mistral reserved when Japanese output must dominate.

See the results of the refined run 2025-08-30-2.

Loading benchmark scatterplot ('size')...

Writing quality and operations

Model	Writing score	Median RT (s)	QPR %	Notes
qwen3-coder 30B	87.8	6.2	100	Selected writer, efficient
qwen3 30B	90.0	33.9	16	Higher raw score, slower
mistral small3.2 24B	88.0	33.9	7	Solid, heavier footprint
gemma3 27B IT-QAT	86.5	99.6	1	Used for translation only

Translation highlights

Model	Composite signal	Production scores
gemma3 27B IT-QAT	Highest balance	Japanese 91, Chinese 94, German 91, Russian 93
mistral small3.2 24B	Japanese peak	Japanese 96, Russian 89, Arabic 85
qwen3-coder 30B	Chinese/Russian	Chinese 93, Russian 94, Japanese 85

Stability snapshot v1.1

Model	Stability	p50 RT (s)	p95 RT (s)
qwq 32B	0.93	156.9	180.0
qwen3 32B	0.92	152.4	180.0
gemma3 27B IT-QAT	0.84	99.6	141.8
hermes3 3B	0.85	6.2	9.0
mistral small3.2 24B	0.79	33.9	57.0
qwen3-coder 30B	0.58	6.2	17.4
llama3.1 8B	0.80	11.3	17.8
gpt-oss 20B	0.49	49.0	151.4

The long tails on some 32B models remain, but the v1.1 adjustment prevents them from crushing the score and shrinks small sample runs toward 0.85.

Routing policy

Default and writing both run through qwen3-coder:30b. Translation and markets run through gemma3:27b IT-QAT, with mistral small3.2:24b for software and technical tasks. hermes3:3b remains the draft and comment generator. gpt-oss:20b is retired for lack of distinct upside, and qwq:32b is dropped as redundant with qwen3:32b.

Notes

Summarization is still weak, with qwen3-coder:30b in the 36–53 range on granular tests. We tested specifically for overgeneralization ² and saw the same failure patterns across models. This means strict length caps and retrieval checks remain mandatory. QPR percentage continues to serve as an early filter; qwen3-coder:30b still anchors efficiency at 100 percent, which is why it now carries both the default and writing routes.