Block-Floating FP4 for Local Inference in llama.cpp

MXFP4, aka Microscaling Format for 4-bit Floating-Point, is essentially a very small floating-point format that borrows one big trick from signal processing and older “block floating point” designs, in that we don’t try to give every single weight its own exponent. The format groups weights into fixed-size blocks, gives the whole block one shared scale, then stores each element as a tiny float inside that block. The OCP Microscaling Formats (MX) spec standardizes this idea and its semantics. ¹

In MX notation, a block holds a scale ($X$) and per-element payloads (${P_i}$). Reconstruction is basically

$$ w_i \approx X \cdot P_i $$

where ($X$) is shared across the block. That one shared scale is the reason “4-bit floats” become usable for real neural nets; without it, a raw 4-bit float would run out of dynamic range constantly.

MXFP4 at the bit level

The common MXFP4 configuration uses block size ($k = 32$). Each block stores:

one scale ($X$) in E8M0 (8-bit exponent-only scale)
32 elements ($P_i$) in FP4 E2M1 (4 bits each)

So per 32 weights we store ($32 \times 4 = 128$) bits of element payload plus 8 bits of shared scale, totaling 136 bits. Divide by 32 and we get 4.25 bits/weight. That number is the core economic value proposition of MXFP4. ²

The E2M1 element format is extremely coarse. It’s a sign bit plus a tiny exponent and a single mantissa bit, so each element is drawn from a small representable set. Precision comes primarily from choosing a good ($X$) per block, not from rich mantissas per element. We thereby shift where the information lives from per-weight mantissas to per-block scaling.

Compute semantics and why kernels like it

The MX spec defines dot products in a scale-factored way: we can compute an inner sum over the element payloads and then apply scales at block granularity. Conceptually, for vectors ($A$) and ($B$) (same block size), a block dot product behaves like

$$ \text{Dot}(A,B) \approx X_A X_B \sum_{i=1}^{k} P_{A,i}, P_{B,i} $$

This is friendly to optimized inference because the expensive part is the inner accumulation; the scales are a low-overhead post-factor. It also maps well to vectorization because the block size is fixed and small.

It should be noted that conversion from FP16/BF16/FP32 into MX is not one algorithm. The spec and accompanying literature give reference conversions (max-based scale selection, then quantize normalized values), while allowing implementation-defined variations. The practical implication is that conversion policy (how we pick ($X$), how we round, how we clamp) can materially affect accuracy in 4-bit. The Microscaling Data Formats paper makes this explicit in that it provides a working conversion algorithm that follows the OCP semantics, but notes alternatives are allowed. ³

MXFP4 for local inference

Historically, FP4 meant datacenter hardware paths, and consumer stacks often treated it as unsupported or emulated in a way that erased most benefits. llama.cpp (via ggml) integrated native handling for OpenAI’s gpt-oss weights in MXFP4 across its major backends. The maintainer announcement is very direct that the new model is supported “in native MXFP4 format” on CUDA, Vulkan, Metal, and CPU. ⁴

That matters because many users first encountered MXFP4 through other toolchains that gate it behind newer GPU requirements. A Japanese write-up shows Transformers throwing an error on an RTX 4090 saying MXFP4 needs compute capability ≥ 9.0 (H100/B100 class), pushing the author to run via Ollama/llama.cpp instead. ⁵

OpenAI’s own gpt-oss introduction frames the 20B variant as runnable with about 16 GB of memory, positioning it for on-device and local inference. ⁶ Combined with 4.25-bit weight economics and higher-quality models can now fit inside the memory footprints that used to cap us at much smaller parameter counts.

Inference Improvements

Local decoding at batch size 1 is often constrained by memory bandwidth and cache behavior more than by peak FLOPs. Each generated token needs many matrix-vector/matrix-matrix ops, and the dominant cost frequently becomes how fast we can stream weight tiles from VRAM/RAM into compute. Cutting weight storage from BF16 (16 bits) down to ~4.25 bits can cut weight traffic by roughly 3–4× for the tensors that use MXFP4. One empirical FP4/MX study noted that MXFP4(32) yields a 3.76× reduction in memory/communication overhead versus BF16. ²

That doesn’t automatically mean 3.76× tokens/sec. Real kernels have overheads: dequantization, scale handling, layout transforms, and the fact that not every tensor is necessarily in MXFP4. Still, when we see speedups, the root cause is usually that the runtime is moving fewer bytes per token and hitting caches more effectively.

MXFP4 also changes the shape of practical model distribution. When weights are shipped already in MXFP4 (native-trained or at least carefully packaged), we avoid some of the brittle calibration problems that show up in aggressive post-training integer quants. The llama.cpp ecosystem leans into this by treating quantization type as a per-tensor property in GGUF: different tensors can be stored with different encodings, and the runtime dispatches matching kernels. The llama-quantize manpage even calls out an advanced option to selectively quantize tensors, which enables MXFP4 in general while keeping a few tensors at Q8_0/Q6 style. ⁷

Even within MXFP4-branded distributions, it’s common to see mixtures where some tensors stay at higher precision because they’re disproportionately sensitive. llama.cpp’s own OpenCL backend docs note that the MXFP4_MOE quantization for OpenAI’s gpt-oss is a mixture of MXFP4 and Q8_0. This explains why we can see two “MXFP4-ish” files with noticeably different quality and speed, as the label is describing the dominant encoding, not a guarantee that every tensor is FP4.

Limitations

MXFP4 targets weights. Local inference bottlenecks don’t come only from weights. The KV cache can dominate memory at long context lengths, and MXFP4 doesn’t automatically compress KV. Some runtimes offer KV-cache quantization separately, but that’s an entirely different story. So if the workload carries a very long context, and modest model, the biggest wins might come from KV decisions rather than weight bitwidth.

MXFP4 also isn’t a free lunch for accuracy. The element format E2M1 is extremely low precision; the shared scale makes it viable, but it can still break on tensors whose value distribution doesn’t play nicely with per-block scaling (outliers, heavy tails, mixed-scale structure inside a 32-value block). This is why hybrid packaging is so common. We keep the bulk in MXFP4 where it behaves well, and we protect the tensors that act like thin points in the network with higher precision (embeddings, output head, some attention-adjacent projections).

A useful mental model is that quantization errors are not uniformly harmful. If a tensor is used in a way that repeatedly amplifies small numeric biases (or sits on a sensitive pathway like logits), coarse per-element precision can cause disproportionate behavioral drift. If a tensor is part of a big overparameterized MLP block, the network can sometimes absorb much more error without noticeable output degradation.

Choosing and running models locally

MXFP4 shifts the consumer decision from deciding on the global 4-bit preset toward selecting packaged weight format and runtime paths that are most compatible with the respective hardware.

We want to interpret benchmarks in a way that matches how MXFP4 helps. If a file is smaller and faster but only slightly worse on perplexity, that is consistent with a memory-bandwidth win. If a file is tiny and suddenly incoherent, that is consistent with scale misfit or excessive FP4 coverage in sensitive tensors. Model tables sometimes are showing pure MXFP4 collapsing on a small model while hybrids remain near-lossless are the exact pattern we’d expect from the above mechanics.

MXFP4 is a meaningful new baseline for local inference. It standardizes an FP4 representation that can travel across toolchains, it aligns with bandwidth-bound decoding realities, and in 2025 it crossed a threshold where a major local runtime can run native MXFP4 models broadly across consumer backends. ⁴

https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf “OCP Microscaling Formats (MX) Specification Version 1.0” ↩︎
https://aisystemcodesign.github.io/papers/FP4.pdf “An Empirical Study of Microscaling Formats for Low-Precision …” ↩︎ ↩︎
https://arxiv.org/pdf/2310.10537 “Microscaling Data Formats for Deep Learning” ↩︎
https://github.com/ggml-org/llama.cpp/discussions/15095 “llama.cpp supports the new gpt-oss model in native MXFP4 …” ↩︎ ↩︎
https://zenn.dev/kun432/scraps/2f7224893bfb1b “OpenAIのオープンウェイトモデル「gpt-oss」を試す” ↩︎
https://openai.com/index/introducing-gpt-oss/ “Introducing gpt-oss” ↩︎
https://manpages.debian.org/unstable/llama.cpp-tools/llama-quantize.1.en.html “llama-quantize(1) — llama.cpp-tools — Debian unstable” ↩︎