AI remains an extreme speed and parallelism multiplier for tasks with clear specifications and cheap verification, but it is still not trustworthy autonomy. As per METR, the one-shot success rate near 50% on multi-hour, human-equivalent work means silent failure remains the default without gates, checks, and retries1. What improves in practice however is recoverability under orchestration.
This nullbench run does not show a new generation of open models becoming more capable, but we’re looking primarily at behavior redistribution across several fine-tuned variants. The ceiling for our test machine remains largely intact. The floor rises primarily when tooling and control systems compensate. Several recent claims of capability gains from aggressive fine-tuning—including 4chan-heavy datasets—fall into the same pattern we observe here. Behavior redistribution that changes ergonomics and refusal dynamics without raising the underlying reasoning ceiling.
See the full results.
Loading benchmark scatterplot ('qpr')...
Judge Panel Rationale
This benchmark uses a three-model judge panel chosen to balance capability, speed, and lineage diversity under a constrained runtime budget. All judges score all prompts.
- Qwen3-Coder-30B-A3B-Instruct with its role as a capability & reasoning anchor: The model provides the panel’s highest raw reasoning capacity. It is strong at instruction following, multi-step evaluation, and detecting logical or factual errors in complex responses. This makes it the primary signal for overall answer quality and correctness. Its inclusion ensures that high-performing target models are not bottlenecked by an underpowered evaluator.
- → prevents under-scoring of genuinely strong reasoning.
- Granite-4.0-h-tiny with a role for structure, compliance, and efficiency check: It is intentionally small and fast. Its value is not deep reasoning, but strictness: it is unforgiving toward padding, incoherent structure, evasive answers, and format violations. This prevents the panel from systematically rewarding verbosity or confident-sounding but shallow responses. Its speed also keeps total benchmark runtime within operational limits.
- → prevents over-scoring of bloated or performative answers.
- Gemma-3-4B-IT with a role for stylistic and natural-language judgment: It contributes a distinct linguistic and stylistic perspective from a different model family. It is comparatively sensitive to tone, phrasing, and naturalness, helping the panel penalize awkward, overly mechanical, or misaligned responses that may still pass pure correctness checks. This adds coverage on qualitative dimensions without the cost of a large specialist judge.
- → adds a human-language and stylistic lens from an independent lineage.
The panel spans three distinct model families, reducing correlated bias while remaining fast enough to run at scale. The result is a pragmatic, speed-constrained judging setup that favors consistent, well-reasoned, and well-structured outputs, rather than sheer length or rhetorical confidence.
Known limitation
This panel is not optimized for detecting subtle summarization fidelity issues2, such as loss of hedging or shifts in uncertainty. Summarization scores should therefore be interpreted primarily as measures of fluent compression and surface faithfulness, not guaranteed preservation of epistemic nuance.
Target Model Notes
We’ve added three new models to this lineup, each illustrating different aspects of behavior redistribution via fine-tuning, abliteration, or long-context extension.
Seed-OSS-36B-Instruct
Seed-OSS-36B-Instruct is ByteDance’s open Apache-2.0 36B instruct model, notable less for any unique personality than for its systems posture. It appears to be a large generalist with native long context (up to 512K) and an explicit emphasis on agentic and long-horizon tasks, including claimed ’thinking budget’ controls at the serving layer. In our run we evaluated a MagicQuant hybrid GGUF (mxfp4_moe-EHQKOUD-IQ4NL), which is interesting primarily as quant engineering—an attempt to make a 36B-class model locally tractable via mixed per-tensor schemes (e.g., MXFP4 for MoE components) rather than as evidence of a new capability regime. 3 4
Nemotron-UltraLong-8B (UltraLong)
Nemotron-UltraLong-8B is an 8B-parameter LLM series that extends a Meta Llama 3.1 instruct base (Llama-3.1-8B-Instruct, 128K) to 1M / 2M / 4M token contexts via a two-stage recipe: (1) one-shot continued pretraining at the target length on a ~1B-token long-document corpus plus YaRN-based RoPE scaling, then (2) minimal short-context SFT to preserve instruction-following while avoiding degradation of the newly learned long-context behavior5 6.
Key implementation details during continued pretraining: documents are packed into ultra-long sequences, separated with special separators (not BOS/EOS), and the cross-document attention mask is disabled so attention can flow across packed documents; the corpus is length-rebalanced (downsample short docs, upsample long docs) to emphasize long-context learning.6 7
Reported evaluation highlights (authors’ comparisons): on long-context suites (RULER, LV-Eval, InfiniteBench) UltraLong variants lead or tie the best Llama-based baselines they test; on standard short-context benchmarks (e.g., MMLU, MATH, GSM8K, HumanEval) they remain roughly competitive with the 128K base model5. In our harness, however, the expanded context does not translate into a higher reasoning ceiling once constraint density and procedural tracking are stressed. We’ve included it primarily as a comparison to the abliterated Nemotron base and the 4chan-fine-tuned variant below.
Assistant_Pepe_8B and 4chan Data
Assistant_Pepe_8B attracted our attention because early reports suggested a reversal of the usual fine-tuning pattern. The abliterated Nemotron base appeared to score above the original base, and a downstream fine-tune trained heavily on 4chan data appeared to score higher still. That sequence raised the possibility that alignment and platform-shaped data distributions impose a measurable performance cost rather than merely changing tone or style 8 9.
In our benchmark however, Assistant_Pepe_8B did not surpass the abliterated Nemotron base. The total scores were Nemotron UltraLong Instruct at 60.16, abliterated Nemotron at 62.8, and Assistant_Pepe_8B Q8_0 at 62.2. The uplift from abliteration is consistent and survives the rerun. The fine-tune does however not add net score on top of that in our harness. Any gains observed elsewhere are either benchmark-specific or small enough to sit inside variance once summarization fidelity, boundary calibration, and robustness are explicitly scored.
What changes sharply is the internal distribution of behaviors. Relative to the abliterated base, Assistant_Pepe_8B shifts strongly toward assistant ergonomics. Format and instruction compliance improved and refusal gate sanity moved from 37 to 65, eliminating many spurious refusals and topic-triggered misfires. Inference and logical conclusions rise from 69 to 85, and calibration improves modestly. The model becomes more decisive, cleaner in presentation, and easier to prompt in straightforward assistant tasks.
Those gains come with costs as procedural logic dropped from 63 to 50. Performance under orthogonal constraint stress falls from 49 to 36, indicating reduced robustness when prompts impose conflicting requirements. Summarization fidelity degraded sharply. Content omission fell from 31 to 22 and framing from 36 to 22, reflecting a systematic loss of careful compression and framing. The model answers faster and harder, but preserves less structure and nuance.
Seen through this lens, a popular claim that 4chan data improves truthfulness appears misaligned with what is actually moving in our tests. The effect visible in our numbers is a reduction in hedged verbosity, boilerplate uncertainty signaling, and refusal noise. Models that commit early and explain less tend to score well on many public benchmarks. A harness that explicitly measures summarization faithfulness, calibration, and boundary matching penalizes the same shift. Assistant_Pepe_8B lands inside that trade.
These results line up with long-standing qualitative observations from practitioners training on 4chan data. Anonymous, adversarial threads reward concise answers, rapid correction, and confidence. There is no upvote optimization, no persistent persona, and little incentive for careful framing. That distribution teaches models to answer first and argue later, often imprinting a stronger first-person voice and sense of speaker ego 9.
The contrast with Twitter-style data remains structurally instructive. Broadcast-optimized discourse emphasizes engagement and rhetoric over exchange and correction. Practitioners consistently report that even small amounts of that data degrade model utility, while large amounts of 4chan continue to push models toward low-friction assistance. The incentive structures point in that direction.
The finetuning effect is real and directional. Abliteration reliably removes behavior that suppresses benchmark scores. 4chan-heavy fine-tuning then reallocates behavior toward decisiveness, reduced refusal noise, and cleaner assistant ergonomics, while degrading summarization fidelity, procedural care, and robustness under conflicting constraints. Whether that trade is desirable depends entirely on what users actually value in practice.
Learnings
This benchmark does not show open models getting smarter but rather that behavior is being redistributed in predictable ways. Models in the 27–36B range lead by near-perfect comprehension and procedural reasoning. Small-model stylistic advantages do not overcome that in this rubric. Scale remains the primary predictor of ceiling capability, but fine-tuning, abliteration, and long-context extensions shift where models spend their behavioral budget. Long context expands reach rather than depth. Abliteration works by subtracting negative mass. It does not add new capability. We essentially clean the signal so the existing capability is easier to access.
If you want step-function improvements in shipped outcomes, the key lever is currently orchestration rather than base-model weights. Base models are extraordinarily effective components inside control systems that assume failure and recover cheaply.
Kwa, T., West, B., Becker, J., et al. Measuring AI Ability to Complete Long Tasks . METR Blog, Mar. 19, 2025. Updated Feb. 10, 2026. ↩︎
nullmirror. Structural Overgeneralization in LLM Summarization (blog post, Aug. 10, 2025). ↩︎
ByteDance Seed Team. Seed-OSS-36B-Instruct (model card, Aug. 2025). MagicQuant. Seed-OSS-36B-Instruct (Hybrid Quantizations) (GGUF release, 2025). ↩︎
nullmirror. Block-Floating FP4 for Local Inference in llama.cpp (blog post, Dec. 20, 2025). ↩︎
NVIDIA Research. UltraLong: Extending LLM Context from 128K to 4M Tokens (project overview, 2025). ↩︎ ↩︎
Chen, Y., et al. From 128K to 4M: Efficient Long-Context Training for Large Language Models (arXiv preprint, Apr. 2025). ↩︎ ↩︎
NVIDIA. Llama-3.1-Nemotron-8B-UltraLong-4M-Instruct (model release, 2025). ↩︎
Reddit (r/SillyTavernAI). Assistant_Pepe_8B 1M Context, Zero Slop (discussion thread, 2025). ↩︎
Reddit (r/LocalLLaMA). Can 4chan data really improve a model? Turns out… (discussion thread, 2025). ↩︎ ↩︎