nullbench Update: Iterating the Compliance Judge Panel

The compliance benchmark within nullbench serves as a fine-grained audit of a model’s capacity to follow instructions under constraint. Unlike accuracy tests, it measures the intersection of obedience, neutrality, and judgment under refusal pressure. Across two full passes of evaluation, the judge panel itself has proven to be the most sensitive instrument in the system—small adjustments to composition meaningfully reshape the ordering of target models.

This update covers the second iteration of the compliance judge panel, the expansion to five members, and the transition from granite4:tiny-h to the steadier granite3.3:8b, along with the promotion of cogito:3b to cogito:8b as the ideological specialist. This post is a follow up to recent developments and learnings from benchmark executions¹.

Loading benchmark scatterplot ('qpr')...

See the full results.

Why the Second Pass Mattered

The first panel established that open-weight judges could reproduce consistent compliance scores at low cost. But it also surfaced volatility in some candidates: granite4:tiny-h and hermes3:3b produced sharp swings in ideological and refusal metrics between runs. The second pass confirmed that granite3.3:8b—larger, older, and better-tuned—delivers far steadier outputs across all categories.

The larger panel (five judges instead of four) strengthened reliability. With five, the majority vote on refusals (4/5 gate) filters edge-case disagreements without flattening diversity. Each judge contributes a distinct diagnostic role rather than a redundant vote.

The Final Panel for Next Releases

Role	Model	Function
Anchor	qwen3-coder:30b	Defines the reference ordering and prevents drift.
Top All-Rounder	granite3.3:8b	Balanced across ideology, tone, and formatting; high internal consistency.
Quality Generalist	cogito:14b	Strong calibration and instruction compliance; maintains cross-family contrast.
Ideological Specialist	cogito:8b	Highest ideological symmetry; checks systemic bias.
Refusal Specialist	qwen3:1.7b	Fast, lightweight, and most accurate at refusal detection.

The panel is deliberately heterogeneous: three model families, varied sizes, and dissimilar training lineages. This mix reduces correlated blind spots and keeps the composite signal robust when vendors shift alignment policies.

Why cogito:14b over gemma3:4b-it-qat?

While gemma3:4b-it-qat achieved a marginally higher overall score, cogito:14b is the superior choice for the judge panel because its fitness for the specific role is demonstrated in the critical sub-categories where Gemma3 fails. A judge must be reliable when evaluating difficult content, and cogito:14b decisively outperforms gemma3:4b-it-qat on “Controversy & Refusal Boundaries” (a score of 79 vs. a failing 44), “Ideological Symmetry” (84 vs. 79), and “Consistency & Calibration” (77 vs. 70). Although Gemma3 is better at refusal detection, this is a less critical advantage as the panel already includes qwen3:1.7b as a dedicated specialist for that exact task. Ultimately, cogito:14b is chosen for its proven robustness, neutrality, and superior judgment under pressure, making it a far more resilient and trustworthy component for the panel.

Note on Exaone 4.0:1.2b

A late entrant, exaone4.0:1.2b, delivered an unexpected result in the second pass. Despite its compact 0.8 GB footprint (quantized model from the Ollama library), it achieved a GoalAlignment score above 66 — placing near the top tier. Its calibration (86) and ideological symmetry (85) were exceptional, showing how far efficient architectures have advanced. Yet the same run exposed severe fragility: low scores on Controversy & Refusal Boundaries (43) and Refusal Gate Sanity (35) disqualify it from serving as a judge. A model that refuses to evaluate or misclassifies refusals undermines the panel’s purpose. Its brilliance therefore reinforces the panel design: the five-judge configuration remains the most stable and complementary instrument for compliance benchmarking, while Exaone 4.0:1.2b stands as a noteworthy target model for future analysis.

On Metric Focus: Goal Alignment

From both operational and interpretive standpoints, GoalAlignment has emerged as the most informative single measure. Axis scores—clarity, tone, reasoning—capture stylistic variation but often obscure the fundamental question: Did the model actually do what it was asked to do?

GoalAlignment quantifies this directly. It reflects whether the output fulfills the user’s explicit intent, irrespective of rhetorical or tonal style. It cuts through subjective compliance noise and provides the clearest, high-signal indicator of model utility across task types. As a result, future nullbench reports will foreground GoalAlignment as the primary metric, with axis detail retained for diagnostic use.

Loading benchmark table for domain 'Compliance'...

Outcome

The revised 5-judge panel is both more discerning and more stable. It captures finer differences between models, aligns better with practical task performance, and sustains reproducibility across hardware and time. The inclusion of both cogito variants ensures nuanced evaluation of instruction quality and ideological balance, while qwen3:1.7b continues to anchor refusal gating with unmatched speed.

This configuration will serve as the standard compliance panel for upcoming nullbench releases, pending any new evidence of drift or superior open-weight judges in future runs.

nullbench: Expanding the Mid-Field and Refining the Judge Panel ↩︎