In search of wasted bits: how much information do LLM weights carry?
If you store a model’s weights in bfloat16, each parameter gets 16 bits. That’s the budget. The question is whether we’re spending it well.
Information theory gives us a clean way to ask this. Shannon entropy measures the average information content per symbol in a stream of data. If every possible byte value appears equally often, entropy is maximal and there’s nothing to squeeze out. If certain values dominate, entropy drops below the bit-width, and the difference is waste: bits allocated but carrying no information.
It’s the kind of question you can just answer empirically.
How?
We pulled weight files from as many open-weight models as we could get our hands on Coincidentally, we serve almost all of these models at doubleword - sign up today for 10$ in free tokens. — different labs (Qwen, DeepSeek, Google, OpenAI, Moonshot, MiniMax, NVIDIA, StepFun, Zhipu), different scales (0.6B up to 1.4T parameters), different storage formats (BF16, FP8, MXFP8, MXFP4, NVFP4, INT4) — and computed the Shannon entropy of each weight distribution.
Shannon’s source coding theorem tells us that the minimum average number of bits needed to represent a single symbol from an i.i.d. Independently and identically distributed. That trained weights actually look i.i.d. surprised me: you might expect weights with similar roles (functional units within layers, MoE expert groups) to develop coupled distributions, but apparently they specialize without it showing up statistically. source is exactly its Shannon entropy. The gap between the entropy of a model’s weights and the bits a format actually allocates is the slack we’re looking for.
Baseline: 16 bits per weight
BF16 first. BF16 weights carry about 10.6 bits of entropy per element, out of the 16 the format allocates. A third of the budget is slack.
Why?
The charts are interactive — click a format chip to toggle its models in or out. Each bar stacks the entropy each field actually carries (sign, exponent, mantissa) against what the format allocates; the grey cap is unused budget, and the percentage above each bar is the total waste. Hover any segment for the per-bit numbers.
Across every BF16 model we measured, the slack is concentrated in one place. The exponent carries about 2.6 bits of entropy, out of 8 allocated. The mantissa carries roughly 7 of its 7. The sign behaves like a fair coin — 1 bit out of 1. Of the 16 bits we spend on a BF16 weight, the only wasteful one is the exponent.
And the gap is remarkably stable. Two orders of magnitude in parameter count, four labs, different training recipes, and the exponent entropy lands in a band roughly 0.05 bits wide.
Why the exponent?
The mantissa uses its full budget; the sign behaves like a fair coin. Only the exponent has slack. So the question is what, about the way these weights are distributed, makes the exponent specifically the wasteful field.
A floating-point number’s exponent is determined by its magnitude — roughly, log₂|w|. So the entropy of the exponent across a tensor is a direct function of how spread out the weight magnitudes are. If weights were distributed evenly across all the magnitudes BF16 can represent, the exponent would be near-maximally informative and we’d be using all 8 bits. If they’re all clustered in a narrow magnitude band, most of the 256 possible exponent values never appear, and the entropy collapses.
It turns out they’re clustered. Every trained model we measured has its weight magnitudes peaked sharply somewhere between and — a narrow, unimodal shape with a long left tail toward small magnitudes.
Each line is one model. Hue encodes lab; lightness within a hue encodes parameter count within that lab (darker = larger). Solid lines are BF16; dashed lines are quantized formats. Hover any line to see the model.
And it’s not just that each model is narrow — they’re all the same narrow. Shift each distribution by its mean, rescale by its standard deviation, and (almost) every model collapses onto the same curve.
and are the mean and standard deviation of for each model. Weights with are bucketed into a separate “small” bin (visible at the left edge of the magnitude histogram) and excluded from and so a handful of MoE models with dead expert weights don’t distort the collapse.
So the slack in BF16 isn’t sensitive to specific architectural or training-recipe details. It’s a property of the format combined with a robust regularity in how trained weights are distributed: BF16’s exponent is sized for a wider range of magnitudes than any model actually uses.
We’re obviously not the first to notice that less precise number formats are useful for LLMs — NVIDIA’s headline FLOPS numbers have been doubling on the back of halving bitwidths for a couple of years now. Do these narrower formats cut out all the slack?
Half the bits: 8 bits per weight
The simplest narrower format is FP8. Same three-part structure as BF16, allocations halved: 1/4/3 instead of 1/8/7.
Format of the diagram stolen from this amazing article & adjusted for bf16.
The recipes that got these models to FP8 are all different — native FP8 pre-training, MXFP8 from QAT during post-training, BF16 plus a fine-grained quantization pass, others — but the magnitude distribution at the byte level is the same across all of them.1
FP8 weights carry about 6.5 bits of entropy out of 8, vs BF16’s 10.6 of 16 — roughly 80% of the budget used, vs 66%. So shrinking the format does close some slack, in absolute terms. Interestingly, it does so by reducing the precision of the mantissa, and not by reducing the slack in the exponent.
Below the byte floor: 4 bits per weight
Up to FP8, the magnitude distribution didn’t have to move. Every byte-level format we’ve looked at — BF16, FP8, MXFP8, Qwen’s block-FP8 — gives the per-element exponent at least 4 bits, comfortably more than the ~2.6 bits of entropy the distribution wants. The slack is real but the format absorbs it; the distribution sits inside whatever budget is available.
FP4 is where this breaks. The per-element exponent has 2 bits — 4 codes for a distribution that wants something closer to 6. Either the distribution has to change, or the format has to factor its budget.
Sub-byte formats factor the budget. A per-block scale absorbs the absolute magnitude of each block of elements, leaving the per-element bits to cover variation within the block MXFP8 and MXFP4 use 32-element blocks with an E8M0 scale (an 8-bit power-of-two exponent). NVFP4 uses 16-element blocks with an E4M3 FP8 scale, plus a single FP32 per-tensor scale on top (double-quantization). Qwen’s block-FP8 uses 128-element blocks with BF16 scales. Moonshot’s INT4 packs four-bit integers with a higher-precision scale per group.. But the per-element budget stays at 2 bits, and the question simply moves from the global distribution to the within-block one. Can the model arrange itself so within-block ranges fit in 2 bits? If so, the distribution adapts. If not, information is lost.
“Magnitude” here means the reconstructed weight: per-element code × block scale (× any tensor scale). Within a single block you’d only see ~8 distinct magnitudes; across thousands of blocks with different scales, those discrete codebooks slide and overlap into the continuous-looking curve.
It adapts. The magnitude histogram for sub-byte models doesn’t sit on the same curve that BF16 and FP8 do — it’s narrower, often with a different peak. The distribution that gets you low loss at MXFP4 isn’t the distribution that gets you low loss at BF16.
So below 8 bits, for the first time, the format is shaping the distribution rather than the other way around. Up to FP8 the universality of the magnitude distribution was a property of the model: train under loose enough constraints and the weights settle into the same shape. Below FP8, the constraint tightens past the 2.6-bit floor, and the model has to give.
And we can see where the entropy went.
At FP4 the per-element bits get pushed close to saturation. The exponent is nearly fully used (~1.9 / 2) and the mantissa is fully used (~1 / 1) — the per-element slack has mostly drained out. What’s left over depends on the format. MXFP4 keeps a little headroom in both places: the per-element exponent isn’t quite full, and the per-block scales carry only 0.03–0.10 bits of entropy out of the 0.25 allocated. INT4 and NVFP4 push the per-element bits all the way to the floor, and the entire residual lives in the scales — about 0.26 bits per element out of 0.5 allocated.
Net per-element-plus-scale entropy lands around 93% used across all three formats, vs ~80% at FP8 and ~66% at BF16. Close to the floor, with the remaining slack now living in the scales for INT4 and NVFP4, and split between exponent and scales for MXFP4.
What’s left?
Narrow weight distributions leave ‘slack’ in the fixed length formats we use to represent LLM weights. Lots of effort has gone into making that fixed length smaller, and the gap is closing with each quantization generation — but it isn’t closed. Shannon entropy sets a lower bound on how many bits per symbol you actually need, and depending on the format, we’ve still got between 7% and 30% extraneous bits sitting in our LLM distributions.
That matters because so much of LLM inference is transferring data from one place to another and then computing on it when it’s there — weights and KV cache in inference kernels, KV cache between tiers, activations and weights between accelerators. The most frustrating bottlenecks in these systems are when you’re memory-bound: compute units sitting idle because the data bus feeding them isn’t fast enough.
The trick we need is to transform memory into compute — to transfer less data in total, and recover the original through additional computation on the other side. On some level, that’s exactly what decompression is. Quantization is a kind of compressionThe information theoretical framing for quantization is Shannon’s lossy source coding theorems, specifically, vector quantization. TurboQuant is a recent implementation of these ideas., with the nice side effect that you don’t need to decompress: computation in the compressed format is also more efficient than in the original. But that side effect also means that you never actually trade memory for compute — you transfer half as much data to a place where you can do twice as much computation.
Fixed-length formats are great for hardware. Can we get out the last 7–30% of the slack by leaving them behind?
Footnotes
-
How each low-precision model in the chart got to its precision.
- DeepSeek-V3.1 (FP8): pre-trained at FP8 using the UE8M0 scale data format on weights and activations. Model card.
- MiniMax-M2.7 (FP8): trained at FP8. HF discussion.
- DeepSeek-V4 attention (MXFP8) and experts (MXFP4): pre-trained at FP8, with MXFP4 quantization-aware training applied to MoE expert weights during post-training. V4 technical report §3.4.
- GLM-5.1 (FP8): INT4 QAT in SFT and FP8 used in RL rollouts. GLM-5 technical report.
- Step-3.5-Flash (FP8): shipped as a full FP8 model; paper is light on training-precision detail. Step-3 paper.
- Qwen FP8 family — Qwen3-14B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3-VL-30B-A3B, Qwen3-VL-235B-A22B: BF16 models with fine-grained FP8 quantization (block size 128) applied at release. Example.
- gpt-oss-20b / 120b (MXFP4): native MXFP4 post-training. gpt-oss-120b.
- Kimi-K2.6 experts (INT4): INT4 quantization-aware training. Kimi-K2-Thinking.
- Nemotron-3-Super-120B (NVFP4): pre-trained at NVFP4. Nemotron-3 Super technical report.
- Nemotron-3-Nano-30B (NVFP4): post-training quantization from BF16 with quantization-aware distillation. Nemotron QAD page.
Last modified: 5 May 2026