Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model

The size of LLM context grows by the day. KV caching is what makes running those long contexts affordable: it trades compute for memory so the model doesn’t re-prefill work it has already done. But as agentic workflows push contexts ever longer, storing and moving the cache starts to dominate everything. To get to the next order of magnitude of LLM capability, we need it to be smaller.

You can make it smaller lossily. TurboQuant is a (somewhat controversialPerformance exploration, accusations of academic misconduct.) recent example, dropping the bit-width of K and V and absorbing the resulting quality loss. The cost of that route is that the loss isn’t something you specify in advance: you find out what degraded by running evals and hoping they catch whatever you killed. Lossless compression sidesteps the question entirely, by reconstructing the cache exactly.

We explored lossless compression of LLM weights in a recent post. The numbers for KV cache look much the same: empirically, the bytewise entropy of a bf16 cache is about 11 bits per scalar, around 30% smaller than the raw representation. Worse, as we showed thenThat post was about weight compression, but KV cache is reasonably similar. See here for a nice paper., this slack collapses as the bitwidth comes down. FP4 weights are within 5-7% of saturating their format, and the same goes for caches stored at lower precision. Given that low bitwidths have such obvious benefits for performance, we ought to treat them as the baseline.

In this post we introduce Speculative KV coding, a method for losslessly compressing the KV cache of a large target model by up to ~ $4\times$ This $4\times$ is on top of lossy fp8 compression of the cache - the gross benefit is ~ $8\times$ . using a cheaper predictor model — a faster model whose forward pass on the same prompt predicts what the target’s cache will be. By analogy with speculative decoding (Leviathan et al., 2022), the predictor runs in parallel on both encode and decode sides; an arithmetic coder then encodes the true cache at a bitrate set by how well the predictor fits the target.

A KV cache isn’t really random§

Quick refresher on entropy codingI’ve written about the mechanics of arithmetic coding before — see the rANS and tANS posts. For this post all you need is the bitrate formula above; the coder is a black box that achieves it.. You have a stream of symbols drawn from some distribution $p$ . Shannon’s source coding theorem says the best any lossless coder can do, on average, is $H(p)$ bits per symbol. Real coders work with a model distribution $q$ instead of the true $p$ , and end up paying $H(p, q) = H(p) + \mathrm{KL}(p \,\|\, q)$ bits per symbol. The KL term is the slack: the price you pay for your model not being the truth.

The KV cache isn’t really a list of samples from a random source. The whole cache is the deterministic output of a forward pass through known weights on a known prompt. There’s exactly one tensor it can be. So the “true” distribution $p$ is a delta on that tensor, and a delta has zero entropy. Every bit the coder spends is pure KL: $H(p, q) = -\ln q(KV_\mathrm{true})$ . The bitrate is a direct measurement of how much weight your model $q$ gives the correct KV cache.

What we need, then, is a calibrated model of the specific forward pass that produced one, in a form an arithmetic coder can consume.

What should $q$ look like?§

Suppose for the moment that we had access to something that, given the prompt, produced a reasonable per-scalar prediction $\mu$ of the KV cache and a calibrated sense $\sigma^2$ of how wrong that prediction tends to be. The natural way to turn that into a real distribution is a Gaussian centred on $\mu$ with variance $\sigma^2$ , so that $q(x) = \mathcal{N}(x;\, \mu, \sigma^2)$ , in which case the cost (in nats per scalar) of encoding the true value splits cleanly into two terms:

-\ln q(\text{KV}_\text{full}) \;=\; \underbrace{\tfrac{1}{2}\ln(2\pi\sigma^2)}_{\text{spread cost}} \;+\; \underbrace{\frac{(\text{KV}_\text{full} - \mu)^2}{2\sigma^2}}_{\text{miss cost}}.

The two terms pull in opposite directions: the spread cost wants $\sigma^2$ small, while the miss cost punishes us when $\text{KV}_\text{full}$ lands far from $\mu$ relative to $\sigma$ , and blows up entirely if we set $\sigma^2$ too small for the actual error. Taking the expectation over the data and minimising in $\sigma^2$ puts the optimum at $\sigma^2 = \mathbb{E}[(\text{KV}_\text{full} - \mu)^2]$ , with an expected per-scalar bitrate of $\tfrac{1}{2}\ln(2\pi e\, \sigma^2)$ , which is just the log of the typical error magnitude. Better $\mu$ buys us bits directly; miscalibrated $\sigma^2$ wastes them either way, paying for spread we didn’t need if too large and seeing the miss cost run away if too small.

What predicts a KV cache?§

The natural class of things to use as a $\mu$ predictor is any model whose forward pass on the same prompt carries information about the target’s KV cache. The bitrate we can hope to reach is bounded below by the conditional entropy $H(\text{KV}_\text{full} \mid M_\text{pred}(\text{prompt}))$ , and the question of how much compression is available reduces to how small that conditional entropy can be made by choosing $M_\text{pred}$ well.

The full pipeline is then:

Both sides re-run the predictor on the prompt to reconstruct the same per-scalar $(\mu, \sigma)$ . Only the encoder touches the target model. The arithmetic coder turns $(\text{KV}_\text{full}, \mu, \sigma)$ into bits on the encode side, and turns $(\text{bits}, \mu, \sigma)$ back into $\text{KV}_\text{full}$ on the decode side.

Both sides hold the promptOr the prompt is transferred with the bits — it’s of negligible size relative to the KV cache itself.. Both sides run the predictor and reconstruct the same $(\mu, \sigma)$ deterministically. The encoder additionally runs the target model, feeds $(\text{KV}_\text{full}, \mu, \sigma)$ into the arithmetic coder, and emits the bitstream; the decoder consumes the bitstream alongside its locally-reconstructed $(\mu, \sigma)$ and recovers $\text{KV}_\text{full}$ exactly. The scheme is lossless because the coder is, and because both sides reconstruct $(\mu, \sigma)$ from the same prompt and the same predictor.

What this leaves us picking is the predictor, and the choice is a cost-versus-bits tradeoff bounded by two extremes. At one end the predictor is the target model itself, so $\mu = \text{KV}_\text{full}$ and $\sigma = 0$ , no bits ever cross the wire, and we’ve paid an extra full target-model forward pass to compress nothing of our own forward pass, which is silly. At the other end the predictor outputs pure noise, so $\mu$ is arbitrary and $\sigma$ has to be set large enough to cover the bf16 range; the predictor is essentially free, but the coder ends up paying close to 16 bits per scalar. Real predictors sit between the two, and the question for any particular construction is how cheaply we can drive the conditional entropy down.

An optimised version of the same model§

The most direct predictor we can pick is an optimised version of the same model: same architecture, same prompt, with some structure-preserving optimisation applied to make it cheaper to run while keeping its forward pass close to the original. The output $\text{KV}_\text{opt}$ lives in the same shape as $\text{KV}_\text{full}$ and lies close to it elementwise, because a small, well-behaved perturbation of the parameters perturbs the forward pass in a small, structured way. The residual $\text{KV}_\text{full} - \text{KV}_\text{opt}$ is small and structured, which is exactly the regime in which $\mu = \text{KV}_\text{opt}$ pays cheaply under the Gaussian model from the previous section.

For example, take quantisation. The optimised model is the target with its weights cast to a narrower format (FP8, INT4, MXFP4, and so on)The predictor doesn’t have to be specially constructed; quantised variants of major open-weights models ship alongside their full-precision counterparts, so both sides of the pipeline can pull the same artifact off the shelf., and the perturbation in question is the rounding noise introduced at every weight tensor.

Quantisation is well-studied enough that we know its forward-pass error is small for any reasonable scheme; that property is what makes serving a quantised model a viable substitute for the full one in the first place, and the same property gives us $\text{KV}_\text{opt}$ as a low-residual $\mu$ for free, with no training step in sight. The per-channel statistics of the residual (which we plug in as $\sigma$ ) can be measured once on a small calibration set and then frozen, so encoding and decoding both reduce to running the quantised model on the prompt and feeding $(\mu, \sigma)$ into the coder.

Early results§

The simplest concrete instance: predictor is the FP8 version of the target, $\mu = \text{KV}_\text{quant}$ taken directly, and $\sigma^2$ is fitted once on training data as the per-(kv, head, channel) empirical residual variance, pooled across positions.

We use the Qwen3 model family, since we have both:

A wide range of model sizes
Off the shelf fp8 block-quants for each model in the family

On top of $(\mu, \sigma)$ , we found that encoding under a three-component mixture boosted compression ratesThe idea is that in practise, the distribution of ‘residuals’ $KV_\mathrm{full} - KV_\mathrm{quant}$ isn’t really a Gaussian; it’s long-tailed, with outliers.:

q(x) \;=\; 0.95\,\mathcal{N}(x;\, \mu, \sigma^2) \;+\; 0.03\,\mathcal{N}(x;\, \mu, (3\sigma)^2) \;+\; 0.02\, \widehat{p}_\text{bf16}(x)

where $\widehat{p}_\text{bf16}$ is the empirical bf16-symbol distribution measured on training data. The wide-Gaussian component covers moderately-mispredicted values; the empirical-marginal component absorbs deep outliers at roughly the unconditional bf16 cost (~11 bits) instead of letting the Gaussian’s tail blow up on them.

The following results are for the held-out C4-validation set, after calibrating on 128 train examples from C4 at 1024 tokens each. Swept across Qwen3 target sizes:

target	$\mathcal{N}(\mu,\sigma)$	mixture	ratio
0.6B	6.8651	6.7419	2.37×
1.7B	6.6385	6.5303	2.45×
4B	6.4188	6.3290	2.53×
8B	6.2576	6.1793	2.59×
14B	6.0760	6.0097	2.66×
32B	5.9785	5.9185	2.70×

The bitrate falls monotonically with target size — bigger targets compress better, by a clean 0.9 bits/scalar from 0.6B to 32B.

Native FP8 KV caches§

Lossy compression of KV caches is already well-studied, and FP8 caches are increasingly the default: shipped in vLLM, SGLang, and TRT-LLM, and adopted in recent model releases like DeepSeek V4, which defaults to FP8 KV cache for non-RoPE entries. What’s perhaps most interesting here is not only that our method stacks well with lossy quantization — it’s actually more effective when applied to pre-quantized caches. Re-running the same pipeline against an FP8 e4m3 targetSpecifically vLLM’s kv_cache_dtype="fp8", SGLang’s generic FP8 path, and TRT-LLM’s fp8_kv. One fp32 scale per (layer, kv), RTN to e4m3. and encoding the FP8 symbols under the bin-integrated $\mathcal{N}(\mu, \sigma^2)$ predictor:

target	b/FP8-elem	vs raw FP8 (8 b)
0.6B	2.594	3.08×
1.7B	2.454	3.26×
4B	2.323	3.44×
8B	2.220	3.60×
14B	2.109	3.79×
32B	2.053	3.90×

Composed with the underlying bf16 → FP8 quantization, that’s between 6× and 8× total compression on the original bf16 cache, depending on target size.

What’s next§

This is currently an early research note, but the data are promising. How can we do better?

A better residual model§

The Gaussian-plus-mixture distribution above is a convenient first go, but not a very principled one. The residual the predictor leaves is heavier-tailed than a Gaussian, has meaningful joint structure across positions, channels, and layers, and varies enough in size from one prompt or position to another that a static per-channel $\sigma$ pays both ways. Modelling any of this, without touching the predictor itself, buys back further bits.

A different predictor model§

The FP8-of-target predictor used here is the easiest case and the most expensive one. The more interesting direction is a genuinely different model: different size, different family, different training, anything whose forward pass on the same prompt happens to carry information about the target’s KV cache. What’s missing is a shape match, since the predictor’s hidden size, head dimension, layer count, and number of attention heads all differ from the target’s; we can no longer take the predictor’s KV directly as $\mu$ , and some small piece of machinery has to bridge the gap.

The predictor (left) is a smaller transformer with light fine-tuning; its per-layer K, V outputs are passed through a learned linear map W into the shape of the target’s per-layer KV cache, giving $\mu$ . The target itself (right, dashed) is frozen and never run on the predictor side; only its KV-cache shape matters here.

Engineering: throughput and bit-identical predictors§

For this to pay off in practice, the arithmetic coder has to keep up with whatever consumes the decompressed cache. If compressed bytes live behind a slow channel at bandwidth $B_\text{slow}$ and feed a consumer at $B_\text{fast}$ , the framework needs a compression ratio of at least $B_\text{fast}/B_\text{slow}$ and a decoder sustained at $B_\text{fast}$ .

Separately, the predictor has to produce bit-identical $(\mu, \sigma)$ on both sides, since the arithmetic coder doesn’t tolerate even one-ULP CDF disagreement. This is much more tractable than it used to be (see Thinking Machines on defeating nondeterminism in LLM inference).

What’s it good for§

Wherever a KV cache has to move across a slow channel or sit in scarce memory, compressing it trades compute on each end for bandwidth or capacity. Some places that could pay off:

Cross-datacenter disaggregated prefill: Disaggregated prefill splits prefill from decode and transfers KV cache between them. That works inside a high-bandwidth domain but breaks across slower links, where the cache is too big to move. Recent work from the Kimi team argues the cross-DC version becomes viable once KV size is cut enough, getting 10–36× reductions from hybrid attention. Speculative KV coding compresses what’s left, losslessly, and stacks multiplicatively with that lever.
Bigger prefix caches: LLM serving caches KV for shared prefixes (system prompts, retrieved documents, repeated turns in agentic loops) so requests don’t re-prefill them. Compressing stored entries grows the effective cache at the cost of a decompression pass on each hit. The same idea extended to host-RAM offload across PCIe pays for itself more cleanly: you spend decompression compute to shrink a transfer that’s already bandwidth-bound.

The question that comes up in each is: can we make the predictor pay for itself? Figuring that out is the work that comes next.