LAUNCH ETA: 2025 October

nullbench Update: Expanding the Mid-Field and Refining the Judge Panel

4 min read

The nullbench framework continues to evolve toward reproducible, interpretable behavioral analysis of language models. Since version 1.3, which introduced a revised judge panel anchored by glm-4-9b, recent short exploratory runs with cogito:8b and devstral:24b have suggested two directions for further development. The first is an expansion of the tested model families to capture the increasingly competitive “mid-field”, where 3–14 billion-parameter systems show high goal alignment and efficiency. The second is a focused re-evaluation of small-model judges on the Compliance axis, the most sensitive measure for instruction fidelity and refusal balance.

Broadening the Mid-Field

The interim results made clear that the 8-billion Cogito checkpoint’s strong performance was not an anomaly. Cogito models, produced by Deep Cogito in San Francisco, use an Iterated Distillation and Amplification training regime that enables iterative reasoning refinement across checkpoints1. The 8B model preserved reasoning depth and epistemic restraint at runtime costs comparable to much smaller models. In parallel, devstral:24b reached the highest overall composite in its size class, exposing an inflection zone where scale no longer predicts fidelity.

These findings prompted a broader run covering both existing and newly released families. The next benchmark will include additional Gemma 3, Granite 4, Cogito, and Exaone-Deep checkpoints, together with Mistral-Nemo, a collaboration between Mistral AI and NVIDIA from 2024.

The Gemma 3 expansion fills the middle of the curve with 12b-it-qat and 4b-it-qat, variants that allow testing quantization-aware efficiency against the 27B reference. IBM’s Granite 4 series enters as a new hybrid Mamba/Transformer architecture intended to cut memory use and latency while retaining long-context capability2. These models (small-h, tiny-h, micro-h) follow the strong operational profile of granite3.3, previously used as a reliable judge.

The Cogito family will now be sampled across 3b, 14b, and 32b alongside the promising results we have gathered from the 8b model. The aim is to trace how IDA-based self-reflection scales under null-conditioned testing, where introspection is neither rewarded nor prompted. Exaone-Deep contributes smaller 7.8b and 2.4b checkpoints from LG AI Research, whose larger 32B model previously delivered unmatched summarization fidelity but untenable latency3. These smaller versions have demonstrated competitive reasoning and mathematical accuracy on external benchmarks while operating within efficient runtime envelopes4.

Finally, Mistral-Nemo, a 12-billion model with a 128k token context window and a multilingual “Tekken” tokenizer, will serve as a long-context reference in the same bandwidth as magistral:24b5.

Together these additions extend the testbed into the region where architectural design and training methodology—not raw scale—determine epistemic quality.

Removals we consider are qwen3:30b as it lands in the bottom of most tests, qwen3:4b due to the ’thinking’ overhead and consequently very slow responses for a model in that size class, exaone-deep:32b due to it’s unworkable response time and with the assumption that smaller exaone-deep models may perform as well in summarization. Further, gemma3:27b-it-qat has to prove it’s viability against this line-up given its slower median response time (~100s) and higher memory requirements.

Refining the Gatekeepers

The judge panel defined in nullbench: Judge Panel and Methodology (2025-08-24) has proven stable and cost-efficient. Its later evolution introduced glm-4-9b as a mid-field anchor, which improved variance exposure and balanced family correlation. Even so, compliance scoring—how faithfully a response follows task instructions without over-refusal—remains the most delicate component of the metric.

The next phase will revisit the Compliance axis exclusively, using a dense configuration of fast or small models as candidate judges. Each will be evaluated by a higher-capacity reference set: qwen3-coder:30b, glm-4-9b, cogito:8b, hermes3:3b, and either granite3.3 or a smaller Granite 4 variant depending on runtime. Restricting the task to Compliance allows a larger judge ensemble without inflating total compute cost.

The process mirrors the original meta-benchmark: judges will rate fixed prompts on the 1–5 axis schema, with refusals gated at two-thirds agreement. The metrics of interest are ordering preservation relative to the reference ranking, mean inflation Δ in compliance scores, and Krippendorff’s α for inter-judge agreement. A candidate maintaining correlation (τ ≥ 0.8) and low inflation (|Δ| ≤ 10) while matching baseline refusal rates will qualify for inclusion.

The objective is not to replace the existing panel outright but to confirm its resilience under the new anchor regime and identify small models that may replicate or improve its scoring fidelity. Models like granite4:micro-h or cogito:3b may serve as future lightweight judges if they preserve rank and demonstrate consistent gating behavior.

Looking Ahead

This dual initiative extends nullbench toward greater coverage and improved judgements. The next release will present updated behavioral fingerprints for the expanded families and revised judge-panel diagnostics. Results will clarify how modern architectures such as hybrid Mamba transformers and IDA-trained models behave under epistemic-integrity constraints and whether smaller, faster judges can sustain the benchmark’s reliability envelope.