Width vs. depth: speculating on the margin
Here’s a funFor some non-universal definition of fun. interview question. Imagine you’re running Qwen3.6-35B-A3B at batch size , on a single GPU, and you decide you want to increase throughput. For whatever reason, your engine can only work on tokens at a time. You have two choicesAssume the draft model’s forward pass is free. Assume you don’t have to prefill anything to batch. Assume that the sequences are short enough that the KV cache movement doesn’t come into play.:
- Run at batch size by batching random user sequences together.
- Run at batch size , speculating token ahead — so the verify works on positions, the token you just sampled plus one draft — with a per-token acceptance rate .
Which is better, assuming you only care about the total number of tokens being output per second?
Here’s a sensible answer:
Assuming that everything is memory bound, batching is always better for , because there’s no chance that a token added to the working set by increasing the batch size will be rejected.
But here’s something that comes out when you do the modelling:
Spending your 2 positions on one speculating sequence produces, globally, more output tokens per second than spending them on a batch of 2, even with .
How come?
The first story: depth can be cheaper than width
To find out, we ought to look at the dataCollected from our last
post: half a million draft
rounds for each of our two draft models, recording the drafter’s own per-depth
confidence and the number of tokens actually committed, plus separate captures
of which experts every token routed through — and the same routing captures
for DeepSeek-V4-Flash (all published as
specdec-calibration).. The answer lives in MoE routing.
MoE routing is a weird part of performance analysis of LLMs: one of the places
where the semantic content of the data affects what work gets done. In
principle, it can do confusing things, like make benchmarks on random data
unrepresentative.
First, the empirical distribution of routed expertsPreviously, we’ve just assumed uniform routing, which gives us coupon-collector maths:
Surprisingly non-uniform! Fitting the rank-vs-share curve, it decays roughly exponentially with rank: the busiest expert pulls several times its fair shareI wonder if there’s a proxy metric in here for the data distribution: given pretraining with expert load balancing loss, can we determine the distribution of the training data by how balanced the experts are on different types of data?. It varies by domain, by model and by layer.
This doesn’t by itself explain anything. But — let’s look at the difference between the two choices on the table when we decide between width and depth: work on two randomly chosen tokens, or on two tokens that follow one from the other:
The distinct experts one verify forward touches as grows, three ways: separate sequences (width), one sequence running consecutive positions (depth), and the uniform coupon-collector.
So there’s the answer. At batch size 1 we’re memory-bound, and verifying a two-position speculative run moves less expert weight than adding a second sequence would. It does so by co-activation — speculated runs are more similar than randomly chosen data, so they activate more of the same expertsJosh did some great work on trying to recover co-activation for batching here.. Even throwing away 10% of speculated tokens at , depth beats width.
This is a toy problem, though it’s interesting to think about how we could make use of the insightIn real engines, it gets washed out by the cost of the draft model. It comes back if you do long sequences, where you’re bound on KV read, but then controlling across depth and width in order to tell the story here gets hard.. In practice, we run at larger batch sizes, expert activations saturate across the batch, and the effect ought to wash out. But the tax doesn’t: every speculated token still risks rejection. Do we have to pay it on every sequence equally — or can we spend depth only where it’s likely to pay off?
The second story: where should we spend our depth
Batches formed over the life of the inference engine are not homogeneous: some decode requests are much easier to speculate from than others. If we can get some sense of how confident the drafter is, we can spend depth only where we need it, saving both drafterFor drafter systems that propose tokens sequentially, like EAGLE. With DFlash you get all the tokens regardless. and verifier compute — savings we can spend speculating further elsewhere, or on adding more sequences to the batch.
The first question is how much variation there is across sequences in how many tokens get accepted — if there’s none, this isn’t worth doing at all. It turns out there’s a lot. The number of committed tokens isn’t clustered around the mean: on a given round the drafter tends to be right about almost everything it drafts, or wrong almost immediately. Rounds split between ‘mostly misses’ and ‘clean sweeps’One question to explore: is this spike at the maximum draft length representative of the fact that the model could have gone further, but wasn’t able to?:
A single fixed depth across the batch has to compromise between those two populations, deep enough to cash in the easy rounds, shallow enough not to waste too much verification time on requests that will be almost all rejections.
In fact, the drafter already has an opinion about which kind of round it is before the verifier runs. Below, we plot the accept length the drafter’s own per-depth confidences imply — the sum of their cumulative products — against the total length that actually committed. The diagonal is ‘perfectly calibrated’: i.e. the empirical accepted tokens are the same as the drafter’s confidences predicted. Below the diagonal, the drafter was overconfident; above, not confident enough.
There’s lots of blue along the diagonal, and that’s free signal we could be using to improve our drafting policy.
How can we do that?
Write round ‘s expected committed tokens at depth from its own per-depth confidences ,
The homogeneous policy constrains the whole batch to one depth and maximises once — committed tokens over , the wall-clock cost of the step:
The confidence gating idea is that you can do better, since the confidences are per sequence in the batch. For per-sequence gating, the object is the vector of depths :
So at each step, we solve for the vector, by appropriately discounting future tokens based on how unlikely they are under the distribution of the draft model. In practice, solving the argmax completely is too hard: we pick a single batch-wide from the live confidences, then trim it greedily per sequence, based on whether truncating a sequence benefits the expected throughput. Running the simulations with picked in this way:
Gains over the best homogeneous depth, in three stages: one batch-wide depth picked from the live confidences, per-sequence (ragged) depths, and perfect calibration — the oracle case, where the policy only drafts tokens it knows ahead of time will be accepted. Simulated: Qwen3.6-35B-A3B on a B200, decode-only.
The gains are big at small batch, where the expert coupon-collector makes mis-speculation expensive. As the batch grows, the batch-averaged signal washes out, and a global chosen from the confidences stops beating the fixed one, because a bigger batch is more heterogeneous. Then the gains reappear around the compute-bound cliff, where speculation only pays in the parts of the model still memory bound.
The simulations show confidence gated speculation is worth real throughput at various operating points. The catch is that your engine has to support efficient ragged speculation — different draft depths inside one verify. vLLM and SGLang can’t do this well today.
So the answer to where we should spend our depth is: unevenly, based on where the drafter thinks we should be spending it.
Conclusion
Speculative decoding really is all you need: nothing else in inference optimization comes close to the throughput gains of a well-tuned speculator.
And the opportunity isn’t going away. Modern LLMs at decode time are almost designed to be hard on the hardware: MoEs, attention, and autoregressive sampling all stop us driving the FLOPs the silicon wants to do. We’re hitting the memory wall again at decode, even while the logic of training says to keep building models with FLOP headroom to spare.
The realisation is gathering pace — fast enough that ideas alive and well a week ago are outdated today.
Last modified: 2 Jul 2026