This evaluation was run on an Apple Silicon M4 Pro (64GB unified memory), using Ollama with official library model builds. Target models were configured to a 32K context window. The benchmark repeated our compliance benchmark for our judge panel selection, with a focus on formatting, instruction adherence, tone transformation, and a revised ideological symmetry test designed to expose inconsistencies under inversion.
See the full results.
Loading benchmark scatterplot ('qpr')...
This benchmark does not measure general intelligence but how quickly and reliably a model can produce a compliant answer under local inference constraints. Outputs are short, often just a few tokens, which shifts the entire workload toward first-token latency and per-request overhead. In this regime, architectural efficiency on paper matters far less than how the model executes in the runtime.
A key detail in this setup is that if responses take longer than roughly one minute, they are scored as failures. Larger models, or models that generate longer or more careful responses, are disproportionately penalized for our judge panel.
Across all tested models, Qwen 3 Coder (30B) and Qwen 3 Coder Next stand out as the most viable options. The older Qwen 3 Coder is the performance baseline—fast, stable, and predictable. The newer Qwen 3 Coder Next improves reasoning quality while maintaining some of that responsiveness. Everything else either introduces too much latency or fails to maintain consistent behavior under constraint.
Some models that appear comparable in size or architecture are an order of magnitude slower. This is most visible in Mixture-of-Experts (MoE) models, where theoretical efficiency does not translate into real-world performance on this hardware 1.
The core issue is execution, not architecture labels. MoE models reduce compute by activating only a subset of parameters per token, but they introduce routing overhead. Each token must be dispatched to selected experts, which requires additional computation and, more importantly, fragmented memory access. On Apple Silicon, inference is largely memory-bound. The cost is dominated by how efficiently weights and activations move through memory, not by raw floating-point operations.
In a short-response workload, this routing overhead becomes the dominant cost. There is no opportunity to amortize it across long generations or large batches. The model spends most of its time deciding where to send the token and fetching the relevant weights, rather than performing useful computation.
This at least is our explanation for why some MoE models perform poorly despite similar active parameter sizes. Labels like A3B or A4B describe how many parameters are used per token, but they say nothing about how those parameters are accessed. Two models with similar active sizes can have completely different memory access patterns, and that difference determines latency.
Qwen’s MoE implementation appears to be more optimized for this constraint. Its routing is simpler, its expert layout is more locality-friendly, and its execution path aligns well with the Ollama stack. The result is a model that behaves like a low-latency sparse transformer rather than a throughput-oriented MoE system. It avoids the worst penalties of expert routing and maintains high responsiveness even for very short outputs.
By contrast, models like Gemma MoE are designed for a different operating point. They make sense in environments where you can batch requests or generate longer outputs. In those scenarios, routing overhead is amortized and compute savings become meaningful. In this setup, those advantages never materialize. Instead, the overhead dominates, leading to significantly worse performance.
Another factor is runtime maturity. Ollama’s execution path is not uniform across models. Some architectures are better optimized than others, and Qwen models in particular benefit from strong support in the underlying stack. This includes better kernel fusion, more efficient quantization layouts, and generally more predictable execution. These details matter more than model architecture in this context.
Even with that advantage, Ollama itself is not the most efficient runtime. Compared to equivalent models running in llama.cpp, token throughput is consistently lower. From an operational standpoint, if the goal is maximum performance on Apple Silicon, llama.cpp remains the better choice. Ollama is still easier to use and integrates models cleanly, but it leaves performance on the table.
Overall, Qwen 3 Coder still provides the best baseline performance. Qwen 3 Coder Next offers a better quality/performance tradeoff while remaining sufficiently practical. Most other models, especially MoE variants optimized for throughput, do not perform well in this regime.
The broader lesson is that local LLM performance is determined less by theoretical architecture and more by execution characteristics: memory access patterns, routing overhead, and runtime optimization. Benchmarks that emphasize short responses make these factors dominant. Models that minimize overhead and maintain locality win, regardless of how efficient they appear on paper.
We will re-test these models with llama.cpp alongside the top performers with ollama for a side-by-side comparison of execution efficiency. After judge panel selection, these new models will be integrated into the evaluation pipeline for ongoing monitoring and future benchmarks.
A note on Ideological Symmetry Adjustment
The original symmetry test produced flat results across models because it allowed for a generic neutral tone regardless of content, echoing the recent Trendslop analysis2.
The next iteration fixes this by forcing paired transformations under controlled inversion. The model is required to apply the same structural and tonal transformation to two ideologically opposed inputs. This exposes whether the transformation process itself is consistent.
Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” (Google Research) https://arxiv.org/abs/1701.06538 ↩︎
“Trendslop: Why LLMs Struggle with Strategy” (nullmirror, 2026) /en/blog/2026-04-12-trendslop-why-llms-struggle-with-strategy/ ↩︎