Switching our Inference Backend from Ollama to llama.cpp

For pragmatic reasons, Ollama has been the default local backend in our prior benchmark runs. Our recent article on Ollama and Open WebUI practices¹ illuminated the need for an alternative. We’ve now added the llama.cpp backend, so that we can re-check our assumptions. The assumption is that once you strip away templates and variable defaults, model quality converges and the only real differences would be operational: throughput, memory footprint, process model, and control surface. The experiments described here confirm exactly that, and they do it on the Apple Mac mini M4 Pro 64 GB hardware we actually run.

See the full results for why we are switching the default backend in nullbench from Ollama to llama.cpp.

Loading benchmark scatterplot ('throughput')...

Benchmark setup

The earlier nullbench passes took the pragmatic path: install Ollama, pull the recommended models, and run them through the API. That gave quick coverage but made it impossible to tell how much of the behavior came from the model and how much from Ollama’s engine, templates, or default sampling parameters. To compare Ollama and llama.cpp fairly we changed the setup in several ways.

We now run each model once per backend, serially, and then keep them hot while firing requests. On the Ollama side the server runs as usual and we feed it a warm-up prompt before taking measurements. On the llama.cpp side we run llama-server and wait for a health check to pass before starting. Earlier experiments had no explicit warm-up, which penalized backends that load larger models more slowly. The new procedure removes that noise; cold start is still measurable, but our focus is the main throughput figures that come from steady-state runs.

The larger change is that we avoid vendor-specific models when comparing engines. Instead of picking whatever models Ollama exposes, we extract the GGUF files from Ollama’s local cache and load those exact files into llama.cpp. That aligns the model family, size, and quantization level by construction. Where possible we use Q4_M_K quantization across both backends.

Finally, we fix decoding parameters ourselves rather than inheriting provider defaults. In earlier runs each backend used its own idea of top-p, top-k, repeat penalty, and stop strings from the Ollama model file. That made it difficult to know whether differences in quality came from the model, the sampler, or the template. In the new runs we set all of these explicitly and hold them constant across backends.

Temperature scales the randomness of the softmax: zero yields greedy decoding, values around 0.2–0.7 give focused sampling, values above 1 explore more. Top-p (nucleus sampling) restricts choices to the smallest prefix of tokens whose cumulative probability exceeds p; top-k keeps only the k most probable tokens. Repeat penalty down-weights recently used tokens, and stop strings terminate decoding. When both top-p and top-k are set, runtimes usually intersect the candidate sets, then apply temperature over that restricted distribution.

For better comparisons we run with temperature=0, top_p=1, top_k=0 or unset, repeat_penalty=1.0, and no stop strings beyond those strictly required by the chat template. In llama.cpp that hits the greedy fast path and avoids unnecessary sampler work; in Ollama it gives the closest possible analogue. If two engines still diverge under those conditions, the difference must live between tokenizer and sampler.

Hardware and models

All measurements discussed here were taken on an Apple Mac mini M4 Pro with 64 GB of unified memory. The main fairness runs use families where we can pair Ollama’s image with a GGUF that llama.cpp can load directly:

exaone4.0 around 1.2–1.3B parameters
cogito in the 3–3.6B range
llama3.2 at 3B
gemma3 around 270M
granite4 in the 340–350M range

Across the matched models we record:

quality_pct, a judged quality score from 0–100 computed by the nullbench grader.
Median response time and tokens per second.
Average and peak memory usage per process.
A derived quality_per_resource_qpr score that combines the above.

The focus here is on the backend pairs where the weights and decode settings are aligned and the benchmark has a chance to answer “is this engine actually better” without confounding factors.

What the data actually say

Quality is backend-independent at parity

Once we extract the GGUF from Ollama’s image, load it into llama.cpp, and lock decoding, the quality numbers fall into a narrow band.

Look at the quality_pct differences (llama.cpp minus Ollama) across the matched pairs:

exaone comes in with under a one-point advantage for llama.cpp, with both runs landing around 75–76 on a 0–100 scale.
cogito shows about a three-point loss for llama.cpp relative to Ollama.
llama3.2 3B gives llama.cpp a small positive edge of a couple of points.
gemma3 270M matches exactly to two decimal places.
granite4 in the 340–350M range also matches exactly.

Every delta sits within roughly three and a half points on a 0–100 scale. Subcategory scores behave similarly: individual sub-metrics shuffle a few points up or down, but there is no consistent pattern where one engine wins “Answer Checking” or “Disambiguation” across the board. For gemma3 and granite4 the curves are essentially identical.

On this evidence backend choice does not control quality once weights and decoding are aligned. The models behave the same.

Throughput favors llama.cpp

While quality collapses to noise, throughput does not. For each paired model, llama.cpp delivers more tokens per second than Ollama.

exaone sees llama.cpp reach about 180 tokens per second versus roughly 154 for Ollama, a gain on the order of seventeen percent. cogito lands around 102 for llama.cpp versus 92 for Ollama, close to a ten percent edge. llama3.2 3B comes out near 100 tokens per second on llama.cpp and mid-80s on Ollama, again around an eighteen percent advantage. gemma3 270M is more dramatic: roughly 278 tokens per second on llama.cpp compared with about 193 on Ollama, in the forty percent range. granite4 shows a similar pattern, with llama.cpp breaking a hundred tokens per second and Ollama staying in the low eighties.

Median latencies are harder to compare because the median output length differs between backends on some runs. That is exactly why tokens per second is the more reliable metric: it isolates engine throughput from prompt or answer length. On that axis llama.cpp consistently wins.

Memory footprint favors Ollama

The cost of that throughput is memory. Unless we missed a process where this allocates towards, across the same pairs, Ollama’s daemon appears to hold a much smaller resident set than llama.cpp.

exaone shows llama.cpp averaging a bit above 3.2 GB while Ollama sits around 1.3 GB. cogito pushes llama.cpp toward 6.7 GB while Ollama uses roughly 1.8 GB. llama3.2 3B sees llama.cpp around 3.6 GB and Ollama again near 1.8 GB. gemma3 270M and granite4 show similar two-to-one or higher ratios. Peak memory follows the same pattern.

Earlier Mac mini runs done with a different llama.cpp build and template stack had a few cases where llama.cpp ended up lighter than Ollama, for example with Qwen3-Coder 30B and GPT-OSS 20B. Those results came from a different configuration and were part of the reason we went back to extract GGUFs and clean up sampling. In the aligned parity runs, Ollama is clearly more memory-efficient.

So the raw trade-off is straightforward:

llama.cpp produces more tokens per second for the same model weights.
Ollama consumes less memory doing the same work.

Which side matters more depends entirely on the constraint of the target system.

QPR and what it is really measuring

nullbench compresses quality, latency, and memory into a single figure, quality_per_resource_qpr. The goal is to approximate “quality per unit resource” with a compact scalar. That is useful when scanning leaderboards, but it can also hide where gains actually come from.

The current definition computes:

QPR = quality_pct / (T^1 * M^0.5)

where quality_pct is the judged quality from 0–100, T is average response time in seconds, and M is average memory in gigabytes. A small epsilon is added to the denominator to avoid division by zero in degenerate cases. Latency is weighted linearly, memory with a square-root factor, on the assumption that time dominates perceived cost while memory still matters but less sharply. For presentation, QPR values within each table are min–max normalized into a qpr_pct percentile so that scores are comparable inside a domain/category but not across unrelated tables.

We recently changed QPR to use average memory instead of peak memory, because peak values were sensitive to short-lived spikes and made rankings noisy. This change raises scores for models with spiky peaks and reflects typical cost more accurately. It also made the Ollama versus llama.cpp trade-off visible in a cleaner way.

On these parity runs, QPR tends to favor Ollama for the exaone, cogito, llama3.2, and granite4 pairs, because its smaller memory footprint dominates the denominator even when quality is flat and tokens per second are lower. The exception is the tiny gemma3 270M model, where llama.cpp is both faster and still reasonably light; QPR reflects that and gives llama.cpp a much higher score.

That reveals more about our cost function than about engine quality. On a Mac mini with 64 GB of unified memory, the difference between 1.8 GB and 3.6 GB per process is not the limiting factor. Throughput and behavior tweaks matter more. QPR, as currently defined, implicitly encodes a world where memory is scarce enough that a two-to-one reduction always dominates a twenty percent speedup. That is reasonable in some deployment settings, but it does not match the laptop and dev-node hardware we care about for nullbench.

The conclusion is that QPR is useful for ranking models within a fixed backend but cannot decide backend choice by itself. For this decision we treat the raw dimensions directly.

Integrating llama.cpp into nullbench

The decision to treat llama.cpp as a first-class backend required some plumbing changes. We now supervise a dedicated llama-server process, wait for a readiness check before routing any traffic, and speak an OpenAI-compatible chat API to it. Requests and responses flow through the same abstraction used for other providers, so existing benchmarking, caching, and judging code paths remain unchanged.

On the telemetry side, timings from different llama.cpp builds are normalized into a single schema, and model size is reported directly from the GGUF file on disk. That makes llama.cpp runs visible in the same dashboards as Ollama, OpenAI, and other backends instead of as a special case.

Sampling controls are standardized across providers. Calls carry explicit temperature, top-k, top-p, repetition penalty, and stop lists. Those values are forwarded to llama.cpp, Ollama, or any remote backend that understands them. Seeded generation is hooked through the request context where backends support seeds, providing determinism for repeated runs. Logit bias is supported per request and as a default overlay: entries can reference token IDs or text fragments, and the merging logic combines several bias sources into a single map that the backend sees. This gives a consistent surface for suppressing or encouraging particular continuations, enforcing grammars, and tuning behavior without rewriting prompts.

The practical result is that llama.cpp lives in the same evaluation and accounting path as everything else. Switching the default backend is a one-line change in configuration, not a structural rewrite.

Why llama.cpp becomes the default

Once the experiments eliminated quality as a differentiator, the question became operational. On one side sits a C/C++ runtime that can be linked directly, exposes low-level logits and decode controls, supports grammar-constrained sampling and seeds, and can be run as a single process without an always-on network daemon. On the other side sits a vendor-maintained daemon with its own lifecycle, registry integration, GUI client, and network surface. The benchmark shows no quality advantage for the daemon and a consistent throughput disadvantage, offset by better memory efficiency.

For our usage, throughput, determinism, and control surface matter more than raw memory efficiency. Embedding or supervising llama.cpp gives us direct ownership of the decode loop, fewer hidden defaults, and a smaller attack surface. The open MIT licensing and repository-centric governance reduce drift risk across updates and avoid surprises from changing service behavior or GUI-driven defaults.

Ollama remains useful as a pragmatic quickstart with homebrew, and in contexts where its memory profile is critical or where its desktop application is explicitly desired. Given equal quality, higher tokens per second on the hardware we actually own, and better alignment with the way we want to own inference, llama.cpp is the backend we will run by default and the one future work will target first.

See /en/blog/2025-11-01-local-ai-capture-ollama-open-webui-and-llama.cpp/ “Open Weights and Local AI: Why Self-Hosting Matters” ↩︎