Raspberry Pi Inference: Tiny Quantized Models at the Edge

This post reports a small, benchmark run on a Raspberry Pi 4B (8GB RAM) device using llama.cpp across five compact GGUF models spanning ~270M to ~1.2B parameters, with aggressive quantization. The aim was to measure how “tiny assistant” or “agentic model” behavior and basic serving characteristics shift when the same evaluation harness is moved from a desktop-class machine to constrained CPU and memory bandwidth.

Loading benchmark scatterplot ('throughput')...

See the full results.

Setup and what the score represents

Our pragmatic evaluation protocol is standalone prompts (no chat history), temperature zero, and a judge-panel scoring scheme that maps each response to an “alignment target” per test. The overall “score” is an aggregate across the compliance categories present in this run, alongside operational logging (memory, latency distribution, throughput). These operational metrics are not normalized for response length, so they must be interpreted alongside median output length and any evidence of truncation or omission-style failures.

Raspberry Pi Setup

We ran this benchmark on a Raspberry Pi 4 Model B with 8 GB RAM using Raspberry Pi OS (previously called Raspbian) and llama.cpp built from source. The goal of this section is reproducibility: pin the toolchain, show the exact build steps, and record the runtime envelope used for all models.

Install a minimal build toolchain plus Go (for nullbench) and curl development headers (used by some local tooling):

sudo apt update
sudo apt install -y git build-essential cmake curl libcurl4-openssl-dev tmux golang

Build llama.cpp on-device:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Binaries land under:

./build/bin (e.g. ./build/bin/llama-server)

Models staged on the Pi (GGUF)

Copy the GGUF files onto the device, we’ve used specifically:

LGAI-EXAONE/EXAONE-4.0-1.2B-Q4_K_M.gguf
unsloth/Llama-3.2-1B-Instruct-Q3_K_M.gguf
unsloth/LFM2-700M-Q4_K_M.gguf
unsloth/gemma-3-270m-it-Q8_0.gguf
unsloth/gemma-3-270m-it-Q4_K_M.gguf

Thermal + power handling

A Raspberry Pi 4B running at 60°C is generally considered normal, especially under load, 50°C degrees seem close to idle state in our configuration under light load with Raspberry Pi OS. Targeting that band for inference, with 5s cooldown between invocations, temperatures were still ramping up quickly. And dropped around ~4 degrees with 10s cooldown while shooting back up to the lower 60 degree ranges upon inference rather quickly. We settled on a 15s cooldown between inference runs as it allowed for an ~8-10 degree cooldown to lower 50s between invocations. We’re monitoring temperature with:

Monitor temperature:

vcgencmd measure_temp
# temp=61.3'C

The CPU won’t throttle due to thermals until it reaches 80-85°C but we can monitor this with:

vcgencmd get_throttled
# throttled=0x0

We observed initially that the Raspberry Pi was experiencing under-voltage and throttling events during intensive inference tasks. The get_throttled command returned a value of 0x50000, in binary 0101 0000 0000 0000 0000, which indicates that both under-voltage (bit 16) and throttling (bit 18) events have occurred since the last reboot.

To avoid under-voltage and throttling, ensure that the Raspberry Pi is powered with a quality power supply unit (PSU) and use a good-quality USB-C cable. After changing the cable, subsequent runs showed no under-voltage or throttling events.

Our Raspberry Pi 4B power consumption was ~6.7W during inference, up from ~2.5W idle, as read from a power bank with power measurement capabilities.

`llama-server` baseline configuration

We served each model locally with a fixed context window and conservative concurrency to keep behavior and tail latency measurable on Pi hardware. Example for EXAONE:

./llama.cpp/build/bin/llama-server \
  -m ~/models/LGAI-EXAONE/EXAONE-4.0-1.2B-Q4_K_M.gguf \
  --ctx-size 2048 \
  --alias EXAONE-4.0-1.2B-Q4_K_M \
  --host 127.0.0.1 \
  --port 34285 \
  --threads 4 \
  --parallel 1 \
  --no-prefill-assistant \
  --no-webui

This binds the server to localhost, uses four threads (matching the Pi 4B core count), and limits to one in-flight request (--parallel 1) to avoid hiding saturation effects behind queueing. The web UI is disabled to keep overhead out of the measurement path.

Headline results

Across this set, EXAONE-4.0-1.2B (Q4_K_M) is the clear leader on behavioral score, with strong performance on consistency/calibration and meta-instruction resolution. The smaller models run faster and lighter, but their low median response lengths and weak “answer checking”/task-type resolution suggest many “speed wins” coincide with under-answering or failing the task rather than efficient completion.

A compact view of the run:

Loading main benchmark leaderboard...

Two immediate takeaways emerge.

Quality and latency decouple on constrained hardware. EXAONE’s aggregate behavior is substantially better, yet interaction latency is measured in minutes for the median case under this configuration, aligned with its far longer outputs.
“Fast” can mean “did not do the job.” The 270M-class Gemma variants and the 700M LFM2 variant show shorter median outputs and weak scores in omission/salience and answer checking. That profile sometimes corresponds to refusal-like behavior, format-only outputs, or premature termination, which can look great on latency charts while scoring poorly on the behavioral axes.

For local GGUF-style inference, a practical rule of thumb is that we treat ~5 tokens per second decode speed as “usable”, since interactive responses that do not feel stalled, while higher rates mainly improve responsiveness and perceived snappiness. The same threshold, or even slightly lower (3–5 t/s), remains acceptable for edge devices running background workloads such as periodic operational summaries, where end users do not watch tokens stream. For example, an edge model processing a 3,000-token context at 20 tokens per second and generating a 200-token summary at 3 tokens per second would finish in a few minutes, which may be adequate for an hourly status report while keeping model size and hardware demands modest. This is, provided edge AI is really needed for the use cases.

Model profiles

EXAONE-4.0-1.2B (Q4_K_M): best behavior, heavy interaction cost

EXAONE achieves the strongest overall score (90.5) and is unusually consistent across several categories: consistency/calibration (98), ideological symmetry (99), and task-type/meta-instruction resolution (98). Its weaker point in this run is format and instruction compliance (68), which is still not disastrous but stands out given the rest of its profile.

Operationally, EXAONE is anoutlier. Peak memory is ~2.14 GB (average ~1.63 GB), which is workable on an 8 GB device, but latency is extreme: ~136 s median with a long tail (P95 ~271 s, max ~445 s). This is the closest thing in the batch to a “general assistant,” but the default behavior is too slow for typical interactive use unless output is capped, prompts are constrained, or generation is offloaded.

Interpretation: EXAONE looks like a viable “thinking-tier” baseline for Pi-class deployment experiments, but only if the serving envelope is redesigned around short answers, streaming UX, or a two-stage pipeline where a smaller model handles most requests.

Llama-3.2-1B-Instruct (Q3_K_M): moderate behavior, moderate cost, uneven reliability

The Llama-3.2-1B Q3 variant lands near LFM2 on overall score (52.0) but with a different failure shape. It shows strong copywriting/tone compliance (93) and decent format compliance (85), while scoring poorly on answer checking (36), consistency/calibration (33), and especially task-type/meta-instruction resolution (25). That combination typically reads as “sounds like an assistant” while missing the core task more often than desired, particularly when prompts require strict role selection or constrained outputs.

Resource use is midrange: peak RAM ~1.06 GB (average ~0.95 GB) and median latency ~12.3 s with a P95 ~41.7 s. On this host, a 1B-class model can still be compute-bound even for short answers.

Interpretation: this variant looks like a plausible middle ground when EXAONE is too expensive, but the behavioral score indicates it needs either stronger routing (only ask it tasks where its “assistant persona” helps) or supplementary checks when correctness is required.

LFM2-700M (Q4_K_M): lightweight, but high omission risk

LFM2-700M matches Llama-3.2-1B on overall score (~52.29) while being substantially cheaper in memory (peak ~0.63 GB) and faster in median latency (~9.2 s). However, its category scores suggest a different kind of brittleness: answer checking (50) and consistency (51) are mediocre, ideological symmetry is low (30), and the run records content omission/salience fidelity at 0. Coupled with a short median response length, it implies many outputs were more often minimal, incomplete, or structurally noncompliant in ways that the judges penalized.

Interpretation: LFM2’s hardware profile is attractive, but this run indicates a need to audit completion behavior carefully. If the model is frequently returning short answers, throughput and latency will look excellent while utility collapses. It may still be useful in narrow “reflex” roles (rewrites, small formatting transforms) where short outputs are valid and correctness is easy to verify.

gemma-3-270m-it (Q8_0 and Q4_K_M): very fast, low behavioral ceiling here

Both Gemma 270M variants are the speed leaders (median ~2.0–2.2 s, toks/s ~4.25–4.44) with low memory footprints (peak ~0.48–0.53 GB). Their behavioral scores, though, are materially lower: 33.86 for Q8_0 and 17.57 for Q4_K_M in this run. Both show “format compliance” that is not terrible (79–86), but very weak task-type/meta-instruction resolution (10 and 0) and weak consistency/calibration.

The Q8_0 variant outperforms Q4_K_M on behavioral score despite being “less compressed,” which is consistent with quantization sometimes acting as a behavior change rather than a monotone “quality dial.” On these prompts, that difference shows up as Q8_0 being less degraded across categories than the Q4_K_M variant, even though both remain in a low-score regime overall.

Interpretation: Gemma-270M remains compelling as an edge utility model when the task is shallow and tightly constrained, but this run does not support using it as a general assistant on Pi-class hardware without strong guardrails and narrow task selection.

Efficiency metrics and the problem of “good-looking” speed

The run includes a quality-per-resource signal (QPR). The Gemma variants score extremely well on QPR by virtue of low latency and low memory, yet their behavioral scores are low. In other words, they are efficient at producing something quickly, but the benchmark suggests that “something” often fails the alignment target or omits key requirements.

Conversely, EXAONE’s QPR is low despite being the best-behaved model, largely because it generates far more tokens and does so slowly on this CPU. This is a reminder that on constrained hardware, “efficiency” metrics need a minimum-quality floor, and they need normalization that accounts for output length and completion rate.

A practical way to read this run is to treat the models as occupying different operating regimes rather than a single quality ranking.

EXAONE appears closer to an assistant, but has an interaction cost that is hard to justify without aggressive token caps and routing.
Llama-3.2-1B Q3 and LFM2-700M Q4 are mid-tier options that may be workable for structured roles, but show meaningful gaps in answer checking and task-type resolution.
Gemma-270M is a fast utility model with a low behavioral ceiling on this suite.

What this implies for deployment design

These results point again toward a systems conclusion rather than a single “best model” answer.

If the edge model is expected to behave like a general assistant, EXAONE’s behavior is the only one in this set that clears that bar, but the latency profile demands product-level constraints (short answers, streamed output, tight timeouts) and likely a smaller front model for triage.
If the model is expected to provide local automation primitives (rewrite, classify, template-fill), the tiny models remain attractive, but only when the task definition tolerates terse outputs and you can cheaply validate correctness.
The mid-tier models are the most ambiguous. They are cheap enough to serve, but the benchmark suggests they can fail silently in ways that look like compliance (polished tone, correct format) while missing the underlying intent. That is a hard failure mode for automation workloads.

A useful next step within the same harness would be to re-run with explicit output caps and explicit “minimum length” requirements where appropriate, so that speed no longer benefits from producing near-empty answers. That would separate genuine efficiency from omission-driven performance and give a more deployment-relevant frontier for each model family.