How fast can an LLM go?

How fast can an LLM go? Great resources for doing this kind of of thinking: here, here

Reading these benchmarks Article here made me curious as to how good our inference software is actually getting. Lets make some assumptions, and then work it through.

Pick a configuration from the selector on the left to change the values throughout this article.

Configuration

Parameter	Value
Tensor parallelism
Concurrent users
ISL/OSL
Accelerator
Precision

If you click on the datasheet links you always get FLOPS (with sparsity). Is someone using this? Is everyone just going to be dividing their numbers by 2 from now until forever?

NVIDIA H100
Compute (FP8)	1979 TFLOPS
Memory bandwidth	3.35 TB/s

Working it through§

Transformers have a lot of different operations in them: softmax, RMSNorm, matrix multiplications, biases, attention, SwiGLU, etc. etc. But the good news for doing stuff with rules of thumb is that almost none of them matter. Matrix multiplications $(F, D) \times (D, B)$ take a number of FLOPS $O(2 BDF)$ , and require us to transfer $O(n_{\mathrm{bytes}}(BD + DF + BF))$ $n_{\mathrm{bytes}}$ is the number of bytes per parameter - i.e. $1$ for fp8, $2$ for bf16. This changes because of quantization. The FLOPS is the bitwidth specific FLOPS for that accelerator - for example, fp8 FLOPS on NVIDIA are 2x fp16 flops. values from HBM to & from the compute units. This scaling dominates everything else (softmaxes, norms, etc. are all linear).

Since almost all the model’s parameters are in the matmuls, and the number of FLOPS for an individual matmul is $2 \times \# \mathrm{parameters}$ , then we can get the total FLOPS for a forward pass through the transformer from just $2 \cdot P$ , where $P$ is the number of parameters in the model We ignore the embedding parameters, since those don’t actually get matrix multiplied. . Read the article here to see this worked out ¹

There are two meaningful bottlenecks that well-optimized GPU software should hit: we can be memory bandwidth bound, or we can be compute bound. GPUs at the highest level do just two things - transfer data to compute units, and do computation on that data. Since the two can happen at the same time, we only need to care about the longer of the two. Which one we hit depends on our workload.

Which bucket does transformer inference fall into? Well, it depends. For a single $(F, D) \times (D, B)$ $B$ here is doing double duty in this article - below its the batch size passing through the transformer (i.e. number of sequences), here its the number of tokens passing through the matmul. Also, the approximation comes from $D \gg B$ . matmul to be compute bound, we need the arithmetic intensity to exceed the accelerator’s compute to bandwidth ratio:

\frac{2BDF}{n_\mathrm{bytes}(BD + DF + BF)} \approx \frac{2B}{n_\mathrm{bytes}} \geq \frac{\text{accelerator compute}}{\text{accelerator bandwidth}}

So, we get our first result, the threshold number of tokens at which transformer inference becomes completely compute bound Mouse over the underlined numbers to see their source.

This number changes with the accelerator, but is (only because of the approximation) independent of the model. :

Threshold for matmuls to be compute bound:

B ≥ 3.96e15 ÷ (6.70e12 ÷ 1) ÷ 2 = 295 tokens

Try changing the accelerator to see how this threshold changes.

Prefill§

Transformer inference has two phases: prefill, where we process the input prompt and generate KV cache, and then decode, where we generate output tokens one at a time. Prefilling happens in parallel across all the input tokens. For all the configurations in the benchmark, prefilling is compute bound, even without batching across multiple requests.

Compute bound means that we’re bounded by the FLOP/s of the accelerator: i.e. its calculating just as hard as it can.

We worked out the FLOPS in a transformer forward pass above: its $2BP$ , for a model of size $P$ , and batch size $B$ . So, given the FLOP/s of the accelerator, we can figure how long prefill should take for a single sequence: Since we’re compute bound, computing for $N$ sequences just means multiplying this number by $N$ .

Matmul FLOPs:

1024 × 2 × 6.93e10 = 1.42e14 FLOPs

Attention FLOPs:

4 × 1024² × 8192 × 80 = 2.75e12 FLOPs

Total FLOPs:

1.42e14 + 2.75e12 = 1.45e14 FLOPs

Total compute available:

3.96e15 FLOP/s

Prefill time per sequence:

1.45e14 ÷ 3.96e15 = 36.58 ms

Decode§

What about decode? Since decode deals with 1 token at a time per sequence, if the batch size is below the compute bound threshold, we will be memory bandwidth bound.

Bandwidth bound means our compute units are starved of work, and we’re spending more time transferring data than computing on it. If we’re memory bandwidth bound, then, instead of comparing FLOPs, we need to compare the memory transfers to the accelerator memory bandwidth.

There are two things we need to transfer for each decode step: the weights and the KV cache. For each token, we store both a key and value vector for each layer. The sequences run from ISL to ISL+OSL ISL=Input Sequence Length and OSL=Output Sequence Length. This calculation is slightly wrong, because we transfer $1$ token’s worth of KV cache for the first step, $2$ tokens worth of KV cache for the second step, etc, so average memory transferred per step over $n$ steps goes as $\sum_i^ni/2 \approx n^2/2$ . But its good enough. tokens, so we’ll assume that the average sequence length is in the middle.

KV cache per token:

2 × 80 × 8 × 128 × 1 = 1.64e5 bytes

Average sequence length:

1536 tokens

KV cache per sequence:

2.52e8 bytes ≈ 251.66 MB

Total KV cache:

64 × 2.52e8 = 1.61e10 bytes

Model weights:

6.93e10 × 1 = 6.93e10 bytes

Total bytes per decode step:

6.93e1064 × 2.52e8 = 8.55e10 bytes

Memory bandwidth:

6.70e12 bytes/s

Average decode time per step:

8.55e10 ÷ 6.70e12 = 12.75 ms

Theoretical results§

Time to first token§

The fastest possible time to first token is just the prefill time for the input sequence:

Time to first token:

36.58 ms

End to End latency§

The fastest end-to-end latency that’s possible for any single request is the prefill time plus the decode time multiplied by the number of output tokens:

Prefill time:

36.58 ms

Decode time per token:

12.75 ms

Output tokens:

1024 tokens

Total latency:

36.58 + (1024 × 12.75) = 13097.34 ms ≈ 13.10 s

Throughput§

Throughput is a little harder. To calculate throughput, we need to account for both prefill and decode phases, and the calculation depends on how we schedule these phases. If we do what was done in popular inference engines in the last few years, we run decode until we get new sequences to prefill, and then pause to prefill them. This is easy to reason about - we just calculate how long it will take to prefill and decode all the sequences in a batch, and divide the total number of output tokens by that time.

Most engines don’t do this any more. What we do instead, is chunked prefilling. Instead, of segregating prefill and decode phases, we build ‘heterogeneous’ batches, that contain parts of a prefill and a number of ongoing decodes.

This means in principle we can use some of the idle compute during decodes to prefill new sequences. How much we can get ‘for free’ depends on how much spare compute across all the decodes that we do. See the footnote² for a broader discussion of how this changes the maths. Toggle the ‘Chunked Prefilling’ checkbox to see how this changes the theoretical throughput.

Chunked prefilling

Total prefill time:

0 ms (fully overlapped)

Total decode time:

1024 steps × 12.75 ms = 13060.76 ms

Total time:

13060.76 ms ≈ 13.06 s

Output tokens per cycle:

64 × 1024 = 65536 tokens

Throughput:

65536 ÷ 13.06 s = 5018 tokens/s

Per GPU (TP = 2):

2509 tokens/s per GPU

Real numbers§

We’ve been at the highest level - comparing the datasheet numbers with the theoretical work that the system has to do. In practise, there are overheads:

Framework overhead: Kernel launch overheads, data movement overheads, scheduling overheads, etc.
Imperfect comms overlapping: We haven’t talked about comms at all - we’ve been implicitly assuming that all data transfers - both between accelerators, and from HBM to compute units - for an op happen perfectly asynchronously with compute. This definitely doesn’t (and potentially can’t) always happen.
Chunked prefilling overheads: There’s more memory movement with chunked prefilling than we’ve accounted for. 4. Achievable numbers: The maximum achievable numbers for the hardware are usually lower than the numbers on the datasheet. See here for a great attempt to actually measure this, or this for the surprising sorts of things that can matter.
Lots of other stuff: thermals, bugs, cosmic rays, etc.

With that all in mind, how do real world numbers compare?

For the configuration selected on the left, here are the results from the inferenceMAX benchmarks:

The real world gets pretty close! Most configurations, across most accelerators are between 20-50% of the possible performance. It’s worth noting that our two numbers should trade off against each other here - its impossible, with our assumptions, to achieve minimal latency and maximal throughput at the same time. In fact, our latency & TTFT numbers are theoretical minimums, whereas the reported latency numbers are averages. It’s possible that the actual minimum latencies over benchmark runs are much closer to our numbers, but they’re not reported in the benchmark numbers. This is especially bad for time to first token, where some sequences will have to queue to achieve maximum throughput.

Across configurations§

Toggle the dropdowns on the right to change the X axis. Changing the value not on the X axis (in the configuration selector on the left) will change the lines shown.

Throughput§

Performance drops as tensor parallelism increases. This is probably because we neglected comms, and it looks like popular frameworks don’t overlap it perfectly. I’m assuming it’s theoretically possible with the inter-accelerator bandwidths here, though we didn’t work it through.
Relative performance decreases as the number of concurrent users increases. Its hard to say why this is the case - there are lots of ‘framework things’ that scale with the number of concurrent users (i.e. scheduling overheads). Also the ceilings are much higher here, so harder to hit.
AMD by and large does worse than NVIDIA, relative to their theoretical peak performance. This is borne out in the maximum achievable FLOPS numbers too - although those also still depend on the software stack. It’s hard to say why - maybe AMD chips are structurally less able to be driven to maximum performance, or maybe its just that the low level libraries (i.e. the matmul kernels) are less well-optimized.
For high ISL, you can see a ‘kink’ in the theoretical throughput, as the KV cache loading starts to dominate the weight loading.

X-axis

Show Relative Performance

End-to-End Latency§

You can see the throughput latency tradeoff here in the benchmark data and the theoretical data - more concurrent users means higher latency. This comes from increasing the KV cache transfer time (i.e. model weight transfers are constant with batch size, but KV cache transfers aren’t). In the real numbers, we probably also have more chunked prefills contending with decodes, and queueing effects too.
Generally, the results aren’t too far off what you’d expect from just the differences between average and minimum E2E latencies. It would be interesting if the benchmarks reported minimum latencies too (also very easy to just do).

X-axis

Show Relative Performance

Outlook: how to get faster§

With the theoretical assumptions we’ve made here, a reasonable guess is that there’s single to low double digit percentages left on the table in software.

Can we break our assumptions to raise the ceilings? The two big ones are:

Decode is memory bandwidth bound, and prefill is compute bound, and one chip has to do both. This is the most flexible one. Things that weaken this:
- Speculative decoding, which increases the number of tokens processed per decode step, increasing the arithmetic intensity. Even for high decode batch sizes, we can start to become bandwidth bound because of KV cache transfer, so there’s space for good speculators to help here.
- Disaggregated prefilling, with which we do the prefill and decode steps on different hardware. This means we don’t need one chip to have the highest values of both bandwidth and compute. Mostly, this lets us optimize cost at the moment See here for how. (since B200s have an absolute advantage in both), but specialization will likely help in the future. I.e. groq’s LPUs are perfect decoding chips.
The Llama-3 architecture.
- Llama-3 is pretty old, and since it came out, almost everything has moved over to MoE models. For MoE models, the figure of merit for inference performance is active parameters, not total parameters, so roofline performances for fixed total parameters go up.
- There are loads of other architectural improvements that help - flashMLA, linear attention, &c, &c.

All in, these benchmarks are a really useful resource. I’m looking forward to seeing how the numbers change as both software and hardware improve.

It’s common to ignore the attention FLOPs, but after publishing this it was pointed out to me this doesn’t quite work here. Self-attention has $4BSTD$ FLOPS, where S is the number of new tokens, and T is the number of previous tokens to attend to. It transfers $2n_{\mathrm{bytes}}(BSD + BTD)$ bytes. So the arithmetic intensity at prefill time (i.e. $S=T$ ) is $O(T)$ and at decode time ( $S=1$ ) is $O(1)$ . This makes where to put attention on the memory-bound vs. flops bound spectrum pretty easy to figure - it’s FLOPs bound at prefill time, and memory bound at decode time. So we should include the attention FLOPs in the prefill calculation. There’s many fewer of them at the sequence lengths we’re looking at than matmul FLOPs, but enough to matter. ↩
vLLM’s default scheduler (as of v0.10.0) uses chunked prefilling. The idea is that prefills and decodes have complementary resource usage: prefills are compute bound (wasting memory bandwidth), while decodes are memory bandwidth bound (wasting compute). By mixing them in the same batch, we can increase utilization. To do this, we need to write kernels that can handle both at the same time, but kindly somebody did it.

The scheduler works as follows: at each step, we have a token budget (max_num_batched_tokens). Ongoing decodes take priority, using 1 token per sequence. Any remaining budget is filled with prefill tokens from incoming requests. If a prefill is too large to fit entirely, it gets split into chunks that are processed across multiple steps.

In steady state, we have all the concurrent users continuously decoding. When a sequence completes, a new request will arrive to take its place. That new request will be prefilled (potentially in chunks) alongside the ongoing decodes. In order not to interrupt decodes, we require the chunk size to be equal to:
$\mathrm{Chunk~size} = \mathrm{Compute~bound~threshold}-B$
The condition here is that decode batches are made perfectly compute bound by the addition of prefill chunks How good is this modelling? Mixed. The big thing that’s missing is that chunked prefilling incurs extra memory transfer, which will slow down the decodes regardless of the compute threshold. In fact, it incurs more memory transfer (in total) than doing prefill and decode separately, since some parts of KV cache have to be transferred multiple times. For example, for an $ISL$ of 8192 tokens and a chunk size of 1024, the KV cache for the first 1024 tokens is transferred 7 times. In principle it’s $O(ISL^2)$ extra memory transfer, but the chunk size brings down the scaling by a lot. . If this is the case, then we can do $OSL$ prefill chunks ‘for free’ in parallel with the decodes. There are:
$B \times (ISL/\mathrm{Chunk~size})$
chunks to process in total. So the number of non-overlapped prefill tokens is:
$\begin{aligned} (B \times (ISL/\mathrm{Chunk~size}) - OSL) \times \mathrm{Chunk~size} \\ B \times ISL - OSL \times \mathrm{Chunk~size} \\ B \times (ISL + OSL) - OSL \times \mathrm{Compute~bound~threshold} \end{aligned}$
For the currently selected configuration:

Non-overlapped prefill tokens:
64 × (1024 + 1024) - 1024 × 295 = 0 tokens
All prefills can be completely overlapped with decodes!
↩

Configuration

How fast can an LLM go?

Working it through§

Prefill§

Decode§

Theoretical results§

Time to first token§

End to End latency§

Throughput§

Real numbers§

Across configurations§

Throughput§

Throughput Chart

End-to-End Latency§

Latency Chart

Outlook: how to get faster§

Configuration

How fast can an LLM go?

Working it through§

Prefill§

Decode§

Theoretical results§

Time to first token§

End to End latency§

Throughput§

Real numbers§

Across configurations§

Throughput§

Throughput Chart

End-to-End Latency§

Latency Chart

Outlook: how to get faster§

Footnotes§