Practical Long-Context LLM Inference with llama.cpp

We can run serious long-context inference on commodity Apple silicon, but long context is hard. In this post we’ll touch on what Grouped-Query Attention (GQA) changes, and how to size a context window on ~ 64 GB unified-memory class Apple M series machines, that we consider commodity hardware, using llama.cpp and Ollama.

“Attention” and “heads” in brief

Attention is the step where each token decides which other tokens matter to it right now. The model makes a query vector for the current token, compares it to key vectors from allowed tokens (masking out the future in causal models), turns those similarities into weights, then blends the corresponding value vectors to produce an updated token representation. A head is one independent copy of this mechanism with its own small projection matrices, working on a slice of the model’s width; multi-head attention runs many heads in parallel so the model can learn different patterns at once (one might focus on nearby words, another on entities, another on punctuation). After all heads finish, their outputs are concatenated and mixed back into the model’s full width.

Per head, Q, K, V are just learned linear transforms of the token’s vector: (q=xW_Q), (k=xW_K), (v=xW_V). In standard multi-head, every head has its own Q/K/V, so for long prompts you must store a lot of K and V data in the KV cache—one set per head per token. Grouped-Query Attention (GQA) keeps many query heads but shares a smaller number of key/value heads across them, which slashes KV memory and bandwidth while keeping diverse query patterns. If a config shows num_key_value_heads smaller than num_attention_heads, it’s using GQA.

Why long context stresses memory and time

Self-attention is the operation where each token looks at other tokens and decides what to pay attention to. During prompt ingestion (“prefill”), vanilla attention touches pairs of tokens, which is why time scales with the square of prompt length. During generation, we reuse past keys/values so time per new token scales roughly with the length so far.

The memory pressure comes from storing those past keys and values, the KV cache. KV memory grows linearly with context length and is independent of the number of tokens we generate later. FlashAttention accelerates attention and avoids materializing giant score matrices, which cuts memory traffic and improves speed, but it doesn’t remove the need to keep a KV cache around for the whole prompt¹.

A useful mental model is: every extra token we admit into the window adds a fixed slice of KV data per layer. The size of that slice depends on the model’s hidden size, number of layers, and how many KV heads it keeps (that last part is where GQA helps).

What GQA is and why it helps

Standard multi-head attention uses separate Q, K, and V per head. With GQA, several Q heads share a smaller set of K/V heads. If a layer has 32 attention heads but only 8 KV heads, then each KV head serves 4 Q heads. That reduces KV cache memory and bandwidth by a factor of 4 compared to “no GQA,” because the cache scales with KV heads, not Q heads. Frameworks expose this via num_key_value_heads in the model config; if it’s smaller than num_attention_heads, the model uses GQA².

Plenty of open-weight models do this by default. Llama 3.1 8B ships with 32 attention heads and 8 KV heads, and its long-context variant sets position embeddings for 128k tokens³ ⁴. Mistral 7B also uses 32 attention heads with 8 KV heads in common releases⁵.

What a 64 GB unified-memory Mac mini can really handle

Apple’s M4 Pro Mac mini can be configured with 64 GB of unified memory. Unified means CPU, GPU, and the Neural Engine all draw from the same pool, so weights + KV cache + scratch space must fit together⁶ ⁷.

A practical budgeting approach:

we keep ~8 GB for the OS, drivers, and headroom;
subtract our weight file size (e.g., a quantized 20–30 GB GGUF for a larger model);
the remainder is our KV budget.

We do not need an equation to use this day to day. Treat KV cost per token as a constant once we fix the model. For an 8B/GQA model like Llama 3.1 8B (32 layers, hidden size 4096, GQA 8/32) with fp16 KV, a good thumb-rule is ≈128 KB per token. On a 64 GB machine with a ~25 GB model (larger or higher-precision than a typical 8B Q4 build), that leaves ~31 GB for KV, which corresponds to ~250k tokens at batch size 1. If our runtime enables int8 KV (1 byte value instead of 2), the same setup gets us ~500k tokens.

A simpler rule of thumb on this 64 GB target: 16k always fits even with 25–30 GB weight files, and it will feel snappy.

Two caveats matter:

Served window vs trained window. A model trained or fine-tuned for 128k (e.g., Llama 3.1 long-context) usually behaves well there. Pushing beyond the trained window with only a rope-scaling flag can degrade alignment or retrieval ability. When we truly need 1M-token windows, look for weights meant for it, such as Qwen 2.5-14B-Instruct-1M⁸.
Time still grows with length. Even when memory fits, million-token prompts have slow prefill. FlashAttention helps; patience is still required for huge documents¹.

Picking a model for long context on a Mac

For local, single-box inference, we want three things at once: GQA (to shrink KV), long-context-ready weights (trained or robustly rope-scaled), and reasonable size so weights leave room for KV.

Llama 3.1 8B (128k) is a strong baseline, has GQA (8 KV heads), and widely available quantizations. Its config commonly shows max_position_embeddings: 131072 and num_key_value_heads: 8 in long-context releases³ ⁴.
Mistral 7B/Mixtral variants generally use GQA (8 KV heads). If we grab a 32k/64k/128k tuned build, it behaves well on a Mac and leaves headroom for KV⁵.
If we truly need near-million-token memory and we’re okay with a medium model, use something with long-context training and GQA. The Qwen 2.5-14B-Instruct-1M weights exist specifically for that regime; on a 64 GB Mac with int8 KV, our memory budget supports very large windows provided we process a single request at a time⁸.

Regardless of model, keep our first target modest (e.g., 128k), measure responsiveness, then step up to 256k and beyond only when a workload needs it.

How to confirm a model uses GQA

Open the model’s config.json and compare num_key_value_heads to num_attention_heads. If KV heads are fewer, it’s GQA. Mistral 7B v0.2 shows "num_attention_heads": 32 and "num_key_value_heads": 8 in the public config. Llama 3.1 8B long-context configs also show num_key_value_heads: 8 and a high max_position_embeddings. Most runtimes and libraries reflect this field, including vLLM and Transformers docs² ⁵ ⁴.

A path we can follow today

Start with a GQA model that advertises the window we want, enable FlashAttention, and turn on KV-cache quantization (q8_0). On a 64 GB Mac mini:

16k is trivial for any 7B–13B model with plenty of room to spare.
128k is realistic and responsive with Llama 3.1 8B long-context builds.
~250k–500k is a memory-feasible range for 8B–14B GQA models when KV is int8, at batch size 1.
Million-token windows require purpose-built weights and patience; treat them as a special-case workflow rather than our default.

When we hit limits, don’t just push num_ctx higher. Trim the prompt, use retrieval to include only relevant chunks, or step down to a narrower, higher-quality window for day-to-day generation and keep a “long-context preset” around for archival or analysis tasks.

FlashAttention (exact attention with IO-aware tiling) reduces memory traffic and speeds attention, which helps long-context prefill. Tri Dao et al., 2022. ( arXiv ) ↩︎ ↩︎
GQA in configs. Libraries surface this as num_key_value_heads; if it’s smaller than num_attention_heads, the model uses GQA. See vLLM’s config notes. ( docs.vllm.ai ) ↩︎ ↩︎
Llama 3.1 announcement including 128k context windows in the open models. Meta AI blog, 2024-07-23. ( ai.meta.com ) ↩︎ ↩︎
Llama 3.1 8B long-context config often shows num_key_value_heads: 8 and max_position_embeddings: 131072. Example MLX/HF repo. ( huggingface.co ) ↩︎ ↩︎ ↩︎
Mistral 7B config with num_key_value_heads: 8. Hugging Face config.json. ( huggingface.co ) ↩︎ ↩︎ ↩︎
Mac mini M4 Pro can be configured with 64 GB unified memory. Apple Store page. ( Apple ) ↩︎
Unified memory architecture background from Apple’s WWDC session. Developer video. ( Apple Developer ) ↩︎
Qwen 2.5-14B-Instruct-1M model card describing 1M-token context support. Hugging Face model page, 2025. ( huggingface.co ) ↩︎ ↩︎