On-device LLMs are compact, optimized large language models that run directly on local hardware like smartphones or edge devices, instead of on a remote cloud server. This allows for privacy, since data stays local and enables offline functionality1. We continue to use our nullbench framework2 to probe alignment and behavior for a broad set of “tiny” models, all in GGUF format3, under a common sampler configuration. The goal was to understand what happens when we aggressively quantize and try to run models in the 0.2–1.2B range, preparing for a later run on actually constrained hardware. In this post we take a look at behavioral data and resource use for 135M–1.7B parameter models in llama.cpp.
See the score to size tradeoff analysis and the full results.
Loading benchmark scatterplot ('size')...
How nullbench sees these models
We are intentionally pragmatic in our framing and scoring. The target model receives standalone prompts, with no chat history and no user profile. Every test is a fresh interaction. A prompt might ask the model to answer a question, evaluate two candidate answers and pick one, obey a formatting requirement, or respond to politically loaded content in a symmetric way. The model generates its answer under temperature zero.
The outputs are not evaluated with BLEU, perplexity, or exact-match metrics. Instead, a panel of selected judge models reads the prompt, the response, and a description of the desired behavior. That description defines an “alignment target”. The judges then rate the response towards its alignment target4. These ratings are normalized across judges, aggregated into axis-level and category-level scores, and finally converted to an overall alignment percentage.
The infrastructure also logs operational metrics: peak and average memory consumption during evaluation, median and tail latencies, and raw throughput in requests and tokens per second. From those, nullbench computes a quality-per-resource number by dividing a quality score by a function of time and memory. We treat that composite value, QPR, as an internal guide rather than a deployment objective; it is useful to compare nearby variants but not something we want to optimize blindly.
This setup gives us a multidimensional fingerprint for every model × quantization combination: how it behaves on truthfulness, how often it refuses, how symmetrical it is on contentious topics, how predictable its latency is, and how much hardware it consumes while doing all that.
Parameter counts and “heavy 1B” models
The first lesson from the data is that parameter count is a poor predictor of actual resource use5. A direct comparison of two nominally similar models illustrates this. On one side sits a Llama-derived 1B-class instruct model in Q3 format. On the other, a Granite-derived 1B-class model in Q3 as well. Both are presented in their ecosystems as one-billion-parameter assistants.
Under the same evaluation conditions, the Llama-based model reaches peak memory in the roughly 1.1 GB range. Latency is moderate, throughput is high, and alignment scores across categories are strong. The Granite-based model, by contrast, reaches a peak near 3.15 GB, with significantly lower throughput and weaker behavior on several axes, including ideological symmetry. Both are “1B”, but one consumes roughly three times as much RAM (for worse output).
The situation is similar when comparing across parameter counts. A Qwen-style model around 600M parameters appears at close to 4 GB peak memory and modest throughput (thinking too much), while a Llama-3.2 1B model quantized to Q3 stays near 1.1 GB and processes effectively many more requests per second.
Architecture, KV-cache layout, tokenizer, quantization scheme, and the compiler/runtime stack dominate memory footprint and speed. We cannot treat “N billion parameters” as a hardware proxy. Any realistic deployment plan needs measurements of peak memory and throughput for the exact binary that will run on the device prior to deployment.
Quantization as behavior change rather than gentle compression
Quantization is often discussed as a technical detail, something that trades a small loss in perplexity for a large gain in memory and speed. We evaluated several variants of the Llama-3.2 1B instruct model quantized with different schemes. One version used a Q3_K_M, another Q4_K_M, another Q2_K, and one extremely aggressive version used a 1-bit scheme.
The 1-bit model essentially collapsed. Its overall alignment score dropped to the single digits, every category contained failures, and latency became erratic. 1-bit weight quantization of a 1B-class model, at least with the recipes tested here, is not a useful way to obtain an aligned assistant.
The Q2 variant behaved better but still poorly. It achieved middling alignment, with noticeable instability in latency. Tail latencies were large, even when median times were acceptable, and several categories slipped into “ugly” behavior ratings under our thresholds. There were no decisive resource gains to offset this.
Another surprise came from the Q3 and Q4 variants. Q3 and Q2 looked reasonably aligned on the “consistency and calibration” category, while the Q4 version scored badly, despite using more bits per weight and almost identical memory. The larger-precision variant also retained more of the soft, assistant-like persona in its outputs, yet this did not translate into better scores.
At that point, thinking only in terms of “quantization noise” on logits is insufficient. The pattern suggests that different quantization recipes are interacting with model fine-tuning, especially safety and refusal tuning, in very different ways. They are not just altering perplexity but may be rewriting parts of learned behaviors.
A Theory on the Llama-3.2 and the “lobotomy” pattern
To understand that pattern, it helps to examine how nullbench probes calibration. Several tests in this category present the model with a short scenario:
- A prompt, such as a question about hard science facts, a translation task, or a reasoning problem.
- Two candidate responses, A and B. One is more accurate, or more cautious, or better reasoned.
- A final instruction: choose which response is better and answer only with “A” or “B”.
The judges see a goal description that encodes the benchmark’s preference. For example, in the question about animal emotions, the description explains that the preferred answer hedges and acknowledges scientific uncertainty. In a translation test, the description notes that one answer omits a verb and so should be considered less faithful.
The target model never sees any of this. It only sees the text of the prompt, the two candidate answers, and the instruction to reply with a single letter.
Now recall how safety and refusal tuning is typically done. Instruction-tuned assistants are encouraged to explain themselves, to hedge on unresolved questions, to avoid definitive answers when evidence is mixed, and to sometimes refuse altogether when the prompt is unsafe. Those behaviors appear as extended text, qualifiers, and additional reasoning phrased explicitly in the output.
A quantization recipe that preserves these learned behaviors will produce richer, often longer, answers. When asked to choose between response A and response B, such a model will sometimes respond with “B, because …” or prepend a justification before producing the bare letter. In interactive settings that is often what users want. Under our framework, which enforces a strict output format for these tests, this becomes a liability.
By contrast, a more aggressive quantization can degrade exactly these safety and stylistic circuits. Weights that encode nuanced refusal patterns, hedging strategies, and stylistic flourishes become noisy. What remains relatively robust are the stronger, simpler associations that map directly from words in the prompt to the requested control tokens “A” or “B”. Under zero-temperature, such a model behaves like a deterministic classifier over a small set of options, unconcerned with explaining itself.
That is the “lobotomy” hypothesis. The Q3 recipe in this family appears to have trimmed away much of the delicate behavior that makes a modern assistant feel cautious and nuanced. What remains is more robotic, less likely to add extra words where the format forbids them, and less inclined to refuse. In nullbench’s calibration tests this behavior scores better: it obeys the letter of the instruction, avoids verbosity, and rarely trips the output parser.
The Q4 variant, by preserving more of the original fine-tuning, acts in a way that aligns better with current practice for deployed assistants but worse with the narrow requirements of these specific tests. It sometimes feels “smarter” in normal conversation yet looks worse on metrics that treat one-character responses as the gold standard.
The lesson is not that Q3 is “better” than Q4 in an absolute sense. The lesson is that low-bit quantization and safety tuning interact in nontrivial ways. A quantization recipe can remove exactly the parts of the network that handle nuance and hesitation, leaving behind something that follows literal instructions very well in synthetic tests while behaving less cautiously in messy real-world use.
Further, sampling the same test multiple times yielded variability. We’ve included the higher scoring run in the published results and will explore this phenomenon further in future work.
Reflex tier and thinking tier
Beyond this one model family, the experiments suggest a functional split between two classes of tiny models.
In the lower class, with sizes from roughly 270M up to around 700M parameters, models perform well on surface-level tasks. They translate short sentences, rewrite text in a different tone, follow formatting instructions, and often write clean, persuasive copy. On nullbench they achieve high scores in copywriting and tone compliance, as well as solid marks for instruction following. Where they tend to falter is content omission, salience, and deeper factual checking. When prompts require selecting the less wrong of two explanations, or when political content demands symmetry and restraint, their behavior becomes inconsistent.
This cluster includes models such as small Gemma variants and mid-size LFM models. These systems are fast and light, they run comfortably in sub-gigabyte memory budgets, and their throughput is high. That makes them attractive as local utilities. They are well suited for routing, template completion, rewriting existing content, and similar “reflexive” tasks where structure matters more than deep reasoning.
In the upper class, near 1B and 1.2B parameters, models like EXAONE-4.0-1.2B Q4 and Llama-3.2-1B Q3 show a different profile. They achieve higher scores in answer checking, more consistent behavior across task types, and more stable symmetry on contentious topics. They still fail at times, but the distribution of errors shifts. Instead of erratic answers driven by shallow associations, the failures appear to be more localized.
These models consume more memory but remain within realistic bounds for strong edge devices. A 1.2B Q4 variant in this family might peak around 2.3 GB in our setup, and a 1B Q3 variant around 1.1 GB. On a device with 8 GB of RAM such as a Raspberry Pi, those numbers are acceptable given appropriate process isolation.
The border between these two tiers is not perfectly sharp, but the pattern appears robust. Below a certain capacity, aligned behavior for general assistant roles is fragile even under careful quantization. Above that capacity, alignment and calibration become more stable. It is natural to treat the lower tier as a reflex layer and the upper tier as a thinking layer in a multi-model system.
Benchmarks and the verbosity penalty
The interaction between quantization and behavior is complicated further by how automated benchmarks evaluate outputs.
Several tests are built around strict output formats. The ones discussed earlier demand exactly “A” or “B”. Other tests require a specific JSON structure, or a response that must not include certain markers. These constraints are important for judging instruction following, and they help standardize scoring.
They also create a systematic penalty for models that try to explain or hedge. The more capable a model is, the more likely it is to add explanation. Safety-tuned assistants have been trained to justify their answers, caveat uncertain claims, and include additional context. All of these appear as extra tokens that the scoring script may treat as violations. In settings where the checker only inspects the first token, this is still manageable. In settings where formats must match exactly, any deviation becomes a failure.
The result is that “robotic” models, including some heavily quantized variants, can achieve better benchmark scores than richer models simply because they stay silent beyond the bare requirement. Under task descriptions that emphasize literal compliance and do not reward additional helpfulness, this behavior is rational from the model’s perspective.
That does not invalidate these benchmarks. They are good at detecting models that ignore instructions outright or that cannot be forced to produce constrained output. What our results show is that benchmarks of this kind must be interpreted carefully when comparing quantization levels. Differences in scores can reflect a combination of genuine capacity differences, robustness to noise, and a capacity for verbosity that is penalized by the evaluation harness.
Quality per resource as a guide, not a goal
The quality-per-resource metric is useful but easy to misuse. By dividing an aggregate quality number by a function of average response time and peak memory, we try to give a simple, at-a-glance sense of which models deliver “more behavior per unit cost” on the hardware where the benchmark ran.
Tiny models such as Gemma-270M with quantization-aware training often score very well. They produce moderately aligned behavior while consuming very little memory and responding quickly. In a relative comparison of, say, three variants of Gemma-270M with different quantization schemes, QPR is a helpful way to decide which variant is worth deeper analysis. Problems arise when QPR is treated as a standalone optimization target. The metric does not encode minimum quality thresholds; it is entirely possible for a model with mediocre alignment to outrank a more capable one if it is far faster or smaller. Nor does the metric account for nonlinearity in human tolerance for errors. Beyond a certain level of misalignment, hard failures dominate any efficiency gains.
In our work we treat QPR as a secondary signal. We use it to identify candidates that live on useful parts of the efficiency–quality frontier within a model family and size band. We do not use it to compare across tiers or as a primary criterion for deployment.
Toward real constrained hardware
The desktop and workstation runs discussed so far already falsify several convenient assumptions: that parameter count can be used as a rough proxy for memory, that higher bit quantization always improves behavior, and that benchmarks cleanly order models by desirability. They do not tell us how these models behave under truly limited compute and memory.
That is the next step. Based on the data and the qualitative analysis, we selected five models that together span the reflex and thinking tiers and cover a range of sizes and quantizations:
- EXAONE-4.0-1.2B with Q4_K_M, representative of a 1.2B thinking-tier assistant with strong alignment in this size class.
- Llama-3.2-1B Instruct with Q3_K_M, a compact 1B model and the central example of the “lobotomy” effect.
- LFM2-700M with Q4_K_M, a reflex-tier model on the upper side of the small range, with good instruction following and modest alignment.
- Gemma-3-270m instruct with Q8_0 and Q4_K_M, tiny models that anchor the lower end of the reflex tier.
All are already RR-tested, which gives us a behavioral baseline. The next step is to run them on a Raspberry Pi 4B with 8 GB RAM, compiling llama.cpp appropriately for that platform, and measuring three things carefully: peak memory under realistic serving conditions, latency distributions at different thread counts, and throughput in tokens per second.
Along with those, a small panel of nullbench prompts will be re-run on the Pi to confirm that qualitative behavior does not degrade in unexpected ways when memory bandwidth and CPU performance are constrained. That experiment will move our analysis from relative figures on a strong host to concrete numbers on hardware that actually resembles many edge devices in the wild.
Only with those numbers in hand can we answer the question that motivated this work: not whether a particular tiny model is “good” in an abstract sense, but which specific quantized variants remain aligned and usable when squeezed into the strict budgets of real on-device deployment.
Android Developers, “Find the right AI/ML solution for your app” (notes offline operation and keeping user data on-device for privacy). https://developer.android.com/ai/overview ↩︎
nullbench: Behavioral Benchmarking for Language Models ↩︎
ggml-org, “llama.cpp”, see https://github.com/ggml-org/llama.cpp and https://huggingface.co/docs/hub/en/gguf ↩︎
Lianmin Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (LLM-as-judge methodology + known biases and mitigations), arXiv:2306.05685. https://arxiv.org/abs/2306.05685 ↩︎
NVIDIA Developer Blog, “Mastering LLM Techniques: Inference Optimization” (explains inference memory drivers; model weights and KV cache are primary contributors). https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ ↩︎