Inference Arithmetic

Understanding LLM capacity from first principles

Fergus Finn

Doubleword

We're building an LLM API designed for high volume

You send us requests, we return them completed within an SLA. Not real-time, but substantially cheaper.

Running this at scale means knowing how many GPUs can serve how many requests. Experimentation is slow and expensive. We need to understand what we can do from first principles.

That requires some inference arithmetic.

What do LLMs do?

Matmuls, mostly. Everything else is negligible.

Total FLOPs for a single sequence:

2 × ISL × P

= 2 × 1,024 × 69.35B = 142.0 TFLOPs

Why 2 × ISL × P?

A weight matrix of shape (D_in, D_out) has D_in × D_out parameters. Multiplying it by ISL vectors does one multiply and one add per weight per vector. That's 2 × D_in × D_out × ISL FLOPs.

Since D_in × D_out is just the number of parameters in that matrix, each weight contributes 2 × ISL FLOPs. Sum over the whole model: 2 × ISL × P.

Everything else (norms, activations, softmax) does O(D) work per layer. The matmuls do O(D²). The ratio is ~8000x for Llama 70B, so we ignore everything else.

What about attention?

Self-attention has 4 × S × T × D × L FLOPs.

At prefill time (S = T = ISL), the arithmetic intensity is O(T), so attention is compute-bound. At decode time (S = 1), it's O(1), so attention is bandwidth-bound. Same regimes as the matmuls.

For prefill, attention FLOPs are much smaller than matmul FLOPs at the sequence lengths we care about, but not negligible. We include them in the prefill calculation.

Two bottlenecks

Memory-bandwidth bound: compute units are starved of data. We're waiting for weights to arrive.

Compute bound: data arrives fast enough. Compute units are fully saturated.

Which regime we're in depends on the number of tokens processed simultaneously. Above a threshold, we're compute-bound. Below it, bandwidth-bound.

compute ÷ (2 × bandwidth ÷ bytes_per_param)

= 3958 TFLOP/s ÷ (2 × 6.7 TB/s ÷ 1)

= 295 tokens

Prefill and decode

Inference has two phases. They land in different regimes:

Prefill: process the entire input prompt in parallel. Tokens at once = ISL. Well above the threshold. Compute-bound.

Decode: generate one token per sequence per step. Tokens at once = number of concurrent users. Typically well below the threshold. Bandwidth-bound.

How long to process a prompt?

Prefill is compute-bound. Time = work ÷ compute:

(2 × ISL × P + attention) ÷ compute

= (2 × 1,024 × 69B + 2.7×10¹²) ÷ 3958 TFLOP/s

= 36.6 ms

How long to generate a token?

Decode is bandwidth-bound. Each step reads all weights plus KV cache for every concurrent sequence. Time = bytes ÷ bandwidth:

(weights + B × kv_cache) ÷ bandwidth

= (69.3 + 16.1) GB ÷ 6.7 TB/s

= 12.75 ms per token

Putting it together

End-to-end latency for a single request:

t_prefill + OSL × t_decode

= 36.6 ms + 1024 × 12.75 ms

= 13.10 s

Throughput for B = 64 concurrent users:

(B × OSL) ÷ (B × t_prefill + OSL × t_decode)

= (64 × 1024) ÷ (64 × 36.6 ms + 1024 × 12.75 ms)

= 4255 tok/s

Reality check

How close do real systems get to these limits?

InferenceMAX benchmarks (Llama 70B, various accelerators): real systems achieve 20-50% of theoretical peak throughput.

The gap comes from framework overhead, imperfect comms overlap between GPUs, non-ideal kernels, and scheduling inefficiencies. But the model gives us the right ceiling.

Throughput: theory vs reality

X-axis

Show Relative Performance

Latency: theory vs reality

X-axis

Show Relative Performance

Why batch is cheaper

A real-time API is latency-constrained. Each user is waiting, so you can't increase the batch size without increasing their wait. Decode stays bandwidth-bound. Compute units sit idle.

A batch API doesn't have this constraint. We can accumulate requests and run at much higher batch sizes. Push decode toward compute-bound. Both bottlenecks saturated.

Better utilization, same hardware, lower cost per token.

Try the batch API

Same models, lower cost, higher throughput.

app.doubleword.ai