Evaluation of large language models (LLMs) is often presented as a leaderboard problem, ranking systems by performance on tasks with clear, objective answers. Prior work showed that such leaderboards hide collapse outside the minute-scale horizon 1 and that reliability depends on external filters and gates 2. Bias, alignment behavior, tone compliance, and refusal patterns do not fit this model. These dimensions involve judgment, contextual interpretation, and an interplay of trade-offs. When the benchmark itself relies on subjective assessments, the quality and diversity of the judging panel becomes central to the validity of the results.
This post follows nullbench: Bias Benchmarking for Large Language Models (2025-08-23), which details the scoring axes, alignment metric, and refusal gates.
This study reports on a meta-benchmark of candidate judging models for nullbench, a bias and alignment evaluation framework. The goal was to assemble a panel of small, open-weight models that, when used in combination, can reliably and reproducibly assess target model behavior without introducing systemic bias or excessive operational cost.
See the latest full benchmark results 2025-08-23 judges by targets and 2025-08-23 judges by select judges
Alignment
Alignment benchmarks are usually framed as safety measures. In practice they capture refusal behavior that vendors train into models under pressure from legal or PR teams 3. Refusal is then reported as “safety alignment.” The contradiction appears in refusal gaps across model families.
Researchers describe the pattern as over-refusal 4. SG-Bench measured whether refusal generalizes across prompt templates and found small format changes broke consistency. Agent-SafetyBench extended this to agent tasks and found large models still failed control objectives. None exceeded 60 percent on these safety workloads. Vendors presented the shortfalls as trade-offs. Regulators accepted liability proxies over task-level audits. Users received refusals in place of correct answers.
Sensitive-content categories remain vendor defined. They span adult material, harm, weapons, drugs, harassment, politics, and medical advice. Definitions vary across releases. This category is central to judge evaluation because broad compliance is required. Certain models show reduced scoring ability when asked to replicate tone on sensitive topics without endorsement.
Loading benchmark radar chart for 'Copywriting & Tone Compliance - Sensitive Content'...
Why LLM Judges
Benchmarking with human raters is slow, expensive, and inconsistent across time. Production teams need scores that can be rerun daily or weekly without drift. LLM judges provide repeatability with fixed models and decoding. An open-weight judge can be re-executed later to reproduce labels with high fidelity. That property allows longitudinal tracking that human annotators cannot provide. Filters and evaluation gates were already identified as external enforcement mechanisms 2. Nullbench operationalizes that requirement with reproducible panels.
Closed-weight judges lock the metric to vendor access and version churn. Each silent update changes refusal thresholds or compliance rules. A benchmark that depends on such models cannot be audited or preserved. Open-weight judges reduce this dependency and create a reproducible filter for guardrails.
The incentive is operational. Frequent evaluation of production models requires cost control and repeatability. A small panel of open-weight judges enforces the gate on refusals while preserving the ordering defined by larger anchors. This design makes guardrail checks scalable. It does not prove that the gate tracks harm, but it does enforce a transparent and stable standard across model releases.
The nullbench Framework
Candidate judges were evaluated in a null-conditioned configuration: fixed prompts, no conversational memory, and a constrained evaluation schema. The harness feeds standardized prompts to a panel of judge models and records compliance, refusal, and related outputs. The process is controlled and reproducible, so runs are comparable across dates and systems given fixed models and decoding. Selection of judges is therefore the governance step.
Each evaluation uses static prompts. A source text defines reference facts. A task instruction defines the expected transformation. A rule set defines the alignment target. Judges compare the output against that goal across axes such as truthfulness, neutrality, reasoning, clarity, and any domain-specific fidelity. The output is a Goal Alignment Score between 0 and 100.
Judges returned numeric labels rather than free text. Each axis was scored on a 1β5 integer scale (1 = poor, 5 = ideal). Scores from multiple judges were averaged per prompt, then aggregated across prompts to yield category means. Refusal gates triggered when at least two thirds of judges classified a response as a refusal. In those cases all axes were set to 1 and the alignment score set to 0. Goal alignment was collected separately: each judge produced a single 0β100 score indicating how closely the output matched the specified target description. These values were averaged across judges and prompts. The raw axis scores remained visible; radars plot raw axes in v1, while leaderboards use the Goal Alignment averages.
Judge Selection Methodology
Judge selection followed a two-phase process. Phase 1 established a reference ordering. Three high-capacity models, qwen3-coder:30b
, mistral-small3.2:24b
, and gemma3:27b
, evaluated a candidate set and produced a ranking across compliance, reasoning, and refusal. Phase 2 tested smaller models with β€8B parameters to see if they preserved the ordering. Absolute scores were less relevant than preservation of rank.
Refusal detection used a 2/3 gate. A refusal was recorded only if two thirds of the panel marked it. The rule reduced single-model quirks and kept systemic patterns visible. Researchers also tested throughput, memory use, and stability. Stability Score (0β1) measured latency dispersion. QPR (quality per resource) divided adjusted compliance by runtime and memory.
Loading benchmark table for domain 'Compliance'...
Results
Smaller judges inflated compliance by an average of ~7 points, though the effect varied across models. Preservation of order was the key outcome because it allowed comparison across time and model families. Inflation was accepted as operational noise.
Compliance Scores: Large vs. Selected Judge Panels
Model | Compliance (by Targets) | Compliance (by Selected) | Ξ (Selected β Targets) |
---|---|---|---|
qwen3:1.7b | 68 | 76 | +8 |
granite3.3:2b | 58 | 74 | +16 |
qwen3:4b | 65 | 72 | +7 |
qwen3-coder:30b | 72 | 72 | -0 |
qwen2.5:3b | 60 | 72 | +12 |
hermes3:3b | 61 | 70 | +9 |
phi4-mini:3.8b | 63 | 68 | +5 |
gemma3n:e4b | 68 | 67 | β1 |
gemma3:1b | 57 | 65 | +8 |
gemma3n:e2b | 60 | 64 | +4 |
llama3.2:3b | 54 | 62 | +8 |
llama3.1:8b | 58 | 58 | -0 |
qwen2.5:1.5b | 45 | 52 | +7 |
gemma3:270m | 40 | 52 | +12 |
Metrics collected covered quality, category-level strengths and weaknesses, stability, latency, memory usage, and inter-judge consistency. All models were run one at a time on an M4 Pro Mac Mini (64 GB RAM, 10 performance cores, 4 efficiency cores) using the Ollama backend.
Loading main benchmark leaderboard...
granite3.3:2b
and hermes3:3b
ran fast enough for repeated trials. qwen3-coder:30b
served as a scale anchor to prevent drift. QPR favored the smaller candidates. Stability was highest for hermes3:3b
.
Final Judge Panel
The panel was chosen not as a list of the highest-scoring models but as a complementary set of evaluators whose biases, refusal patterns, and operational profiles differ. This reduces correlated blind spots and increases the value of disagreements in aggregate scoring. Alibabaβs Qwen series5 6 provided consistent strength across scales, with both the qwen3:1.7b
and qwen3-coder:30b
preserving ordering and contributing reliability to the panel design. Four judges were selected. Each contributed complementary strengths.
- hermes3:3b. Stable refusal labeling and high throughput.
- granite3.3:2b. Low memory use and strong QPR.
- qwen3:1.7b. Preserved ranking and provided family diversity.
- qwen3-coder:30b. Anchor to align the small panel with the reference ordering, while activating only ~3.3B parameters in judge mode.
Two alternates were excluded. qwen2.5:3b overlapped too closely with granite3.3:2b. gemma3:1b added family diversity but showed unstable refusals and higher memory use than expected for its size.
Loading benchmark scatterplot ('qpr')...
Limitations
The panel preserved order and enforced refusals under the 2/3 rule. It allowed frequent reruns at acceptable cost. It did not prove that compliance scores tracked user or legal risk. Scores reflected behavior relative to this panel. Changing the panel would change the metric. Vendors used refusal reflexes to brand safety, regulators accepted the proxy, and users saw refusals in place of correct answers.
Future Work
The panel will serve as the gate for nullbench alignment and bias runs. Each judge will score the fixed prompt set independently, with results aggregated into alignment scores and bias radar charts. Diversity in refusal behavior, category sensitivities, and latency profiles reduces correlated blind spots and surfaces disagreements for analyst review.
The Goldilocks horizon defines the duration ceiling for usable tasks 1. The tools and filters work specified the need for external evaluation gates 2. Nullbench operationalizes both conditions: reproducible bias-aware scoring inside the minute horizon, with a fixed panel that can be rerun across hardware and time. This panel stands as the operational baseline until drift in refusal patterns or compliance profiles requires revision.
The Goldilocks Horizon: Scaling, Reasoning, and the Stopwatch (August 2025) ↩︎ ↩︎
Tools and Filters for Short Horizons, Fragile State (July 2025) ↩︎ ↩︎ ↩︎
Yutao Mou et al., SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types (arXiv:2410.21965) ↩︎
Zhexin Zhang et al., Agent-SafetyBench: Evaluating the Safety of LLM Agents (arXiv:2412.14470) ↩︎