nullbench Update: Iterating the Compliance Judge Panel
·4 min readThe compliance benchmark within nullbench serves as a fine-grained audit of a model’s capacity to follow instructions under constraint. Unlike accuracy tests, it measures the …
nullbench Update: Expanding the Mid-Field and Refining the Judge Panel
·4 min readThe nullbench framework continues to evolve toward reproducible, interpretable behavioral analysis of language models. Since version 1.3, which introduced a revised judge panel …
LLM Fingerprints v1.3: GLM-4 Judiciary, Summarization Collapse
·4 min readThe latest nullbench run used the revised judge panel with glm-4-9b as the mid judge. The swap cut family correlation and exposed variance that earlier Qwen-anchored panels …
LLM Fingerprints v1.2: efficient mid tier models, judge and lineup refresh
·6 min readBuilding on the previous nullbench1 methodology and early refinements2, we expanded the pool, kept decoding and scoring fixed, and asked if these additions change routing. The …
Retrievability is not discovery
·4 min readRAG pipelines retrieve a narrow slice of documents that look closest in vector space to a query, but treating that narrowing as discovery ignores the fact that anything framed …
The Reshaping of the Information Ecosystem
·6 min readThe way people reach information is undergoing a structural break. For two decades, the web grew on the back of search engines sending users outward: a query produced links, …
Agent Systems with SLMs, Workflow Engines and Adaptive Routing
·5 min readNVIDIA’s June 2025 paper Small Language Models Are the Future of Agentic AI argues that agent systems run better when built around small language models (SLMs) rather than …
nullbench Fingerprints v1.1: Stability Updates and New Routes
·3 min readWe revised
nullbench
with a stability metric that avoids collapse on single spikes and a category set split into more granular writing style and language comprehension tracks. The …Trust Boundaries for LLM Agents
·5 min readLarge language models capable of acting as agents introduces a new layer of risk. These systems are more than passive text generators but are often equipped with tools and can …
nullbench Fingerprints: Starting an Operational Playbook
·8 min readControl of execution defines ownership. When a large language model file sits on your machine, it runs the same way until you decide to replace it. No silent updates, no routed …