70x faster cold(ish) starts for SGLang

17 min read

It takes nearly twelve minutes to start serving a 122 billion parameter MoE modelQwen3.5-122B-A10B-FP8, MoE (10B active of 122B total), FP8, served with SGLang v0.5.10. on a B200 in KubernetesMicroK8s v1.35.0, containerd 2.1.3, runc 1.2.5, NVIDIA GPU Operator v26.3.0 with CDI enabled, driver 580.126.20. 8x B200 (Blackwell, 192GB each)..

That’s a long time.

Who knows where the time goes§

First, measure.

Cold start 695s
Container start
Python imports
HF config + tokenizer
Weight loading
FlashInfer autotune + DeepGEMM
CUDA graph capture
Server warmup

A lot of slow stuff! 21 seconds of python imports. Autotuning, JIT compilation. 531s of weight loading!echo 3 > /proc/sys/vm/drop_caches, clear all the kernel caches, then load from scratch. I thought this number was off, but it reproduces. It’s a genuinely cold load, i.e. no page caching of the downloaded sglang image, no page caching of the downloaded model weights. You’d likely never see a start this cold except post-reboot. To be clear, this all makes sense with SGLang’s API contract. SGLang getting 10% more tokens per second is worth way more than launching 10s faster. Over time, what happens is the launch time creeps up and up.

Most of this startup work produces artifacts that SGLang already knows how to cache. Compiled kernels, autotuned configs, JIT outputs. If we persist them across restarts1, we get a warm start:

Warm start 88s
Container start
Python imports
HF config + tokenizer
Weight loading
FlashInfer autotune + DeepGEMM
CUDA graph capture
Server warmup

Much better. The kernel caches cut most of the compilation time, and the OS page cache does the rest. The weights are still on disk, but once they’ve been read once, Linux keeps them in RAM. Second load reads from page cache instead of NVMe.

We’re still not doing great though. 88s! Only 31s of that is actually loading weights. The other 57s is imports, config parsing, autotuning, warmup. All overhead we’re paying every time, even though the results are the same. The only irreducible part is moving 117 GB of weights onto the GPU.

On this machine we have PCIe gen 5 x16 (64 GB/s theoretical to GPU) and an NVMe gen4 SSD (7-8 GB/s to RAM). So the bandwidth floor for weight transfer is somewhere between 117/815s117/8 \approx 15s from disk and 117/641.8s117/64 \approx 1.8s from RAM. Can we get close?

Here’s one I made earlier§

One way to approach this problem is to take the timeline above and start hitting it with the optimization hammer. None of this stuff has to take this much time, and where it does need to take time, it could be cached, parallelised, pipelined.

It’s hard though. SGLang is a big project, with a lot of moving parts. And the current structure is rational! We should take lots of time in startup to make sure we’re being as performant as we can at runtime. We should expect that they will stay rational. If we make startup fast, it will later become slow again.

What we really want to do is just take everything that SGLang does on startup (including whatever they might choose to do in the future) and cache it, all together in some nice package that’s fast to restore from. In the systems world this is called checkpoint/restore. The leading solution on Linux is CRIU, which can checkpoint and restore a process tree through existing kernel APIs. Its CUDA plugin calls NVIDIA’s cuda-checkpoint tool to capture and restore GPU device state.

One catch: a naive CRIU checkpoint of SGLang would include all GPU memory, 192GB on our B200. Most of that is weights and KV cache that SGLang already knows how to reload. So we strip them out, shrinking the checkpoint from 192GB to 6.6GB, but we pay for it with a weight reload step on wake. Here’s the restore timeline:

Restore baseline 32.1s
Container setup
CRIU + CUDA plugin
Wake + reload

Lots better!You might have noticed the weight reload dropped from 31s to 19s. More on this later. We’ve shaded over some frustrating work here: stripping weights and KV cache from the checkpoint2, packaging it as an OCI image for kubernetes3, GPU device remapping4, waking up after restore5, and getting CRIU and SGLang to play nice6.

Down to the wire§

32s, but still far from the bandwidth floor. Of the 32s, only 19s is actually loading weights. The other 13s is overhead: container setup (6s) and CRIU process restore (7s). Before we tackle the weight loading, we can cut this overhead roughly in half.

Containerd was doing redundant work on every checkpoint restore that it doesn’t do on normal pod launches7. Two patches cut container setup from 6s to 3s. CRIU itself was also slower than it needed to be. The main fix is zero-copy page restore: mmap the checkpoint pages directly instead of copying them into fresh allocations8. After that, most of the remaining CRIU time is the cuda-checkpoint driver call itself.

+ containerd & CRIU patches 24.5s
Container setup
CRIU + CUDA plugin
Wake + reload

Overhead is down to about 6.5s. That leaves the weights.

Keep the home fires burning§

It’s finally time to tackle the big green bit. We’ve got lots of RAM on the deviceIf we had a spare GPU already running the same model, we could load P2P over NVLink at 1.8TB/s. But we’re scaling from zero here., so the best we can do is PCIe bandwidth: 64GB/s for this machine setup. How are we going to get there?

First, the weight reload. We sneakily dropped it from 31s in the warm start to 19s in the restore baseline. Instead of loading from the safetensors files on disk (like on the initial load), on checkpoint we dump a serialized representation of the actual allocationstorch_memory_saver interposes SGLang’s CUDA allocations via virtual memory APIs. It already supports backing weights to CPU during release_memory_occupation, but not to disk. We add a batched device-to-host transfer that writes a single flat file (doublewordai/torch_memory_saver#2). On reload, we just need to get the allocations back into their proper places. SGLang supports this weight reload via update_weights_from_disk, but 1. it requires reload from safetensors, which can be slow, and 2. The reload path for lots of models drifts from the original load path (its actually broken for this model).. Then, when we load back, we load through a ring buffer with direct IO. We hit the disk ceiling pretty handily for this device.

But the disk ceiling is is pretty low. We want the RAM ceiling: 1.8s. We can add faster disks, GPU Direct Storage, all this stuff. But if we don’t want to re-spec our machine, and, like I do here, we’ve just got an NVMe-backed virtio disk, we need to get the weights into RAM before the restore starts.

The problem is, by definition, the process doesn’t exist before it starts restoring. So it’s going to have to get this RAM-backed weights file from someone else.

This ‘someone else’ is a daemon on the node, whose job is to watch the weights checkpoint directory and stage its contents into RAM. It exposes a unix socket to restored containers. On wake, torch_memory_saver queries the socket. If the daemon has the weights staged, it passes the file descriptor over and reload happens from RAM instead of disk. Otherwise, we fall back to regular restore. Depending on how much memory you give the daemon, the chance of a cache hit can be very high or very lowThe daemon can also partially stage: it hands out a pre-populated ring buffer, which it refills from disk as the consumer drains it..

The remaining problem is getting 118GB from RAM to the GPU quickly. A naive cudaMemcpy from an unpinned buffer gets nowhere near PCIe speeds. You need the driver to know the buffer is stable in RAM, so it can issue DMA directly. But the registration call (cudaHostRegister) is slow, scaling with the number of pages backing the buffer. We solve this with hugepages (fewer pages to register) and by pipelining registration with the transferThe linux kernel can back a buffer by regular 4KiB pages, or by 2MB (or 1GiB) hugepages. Fewer pages per GB means lower registration overhead. The daemon owns a pool of hugepages, allocates into them, and hands them out over its socket. For the pipeline: you register a block, then issue an async H2D copy on that block. While that copy is executing, you register the next block. I didn’t think this would work (the advice I’ve seen is that cudaHostRegister overhead kills the transfer benefit) but it really does seem to with hugepages. And apparently registration and copying don’t serialize. I expected them to, otherwise why don’t we just do this all the time instead of CPU bounce buffers. The consumer side is a short ceremony: open a Unix socket, receive the shared memory fd via SCM_RIGHTS, mmap it in-process, cudaHostRegister each 1 GiB span (pipelined with the H2D DMA), and copy to GPU. Operationally it’s a DaemonSet that owns the hugepage reservation, and restored pods just mount the checkpoint directory plus the Unix socket (doublewordai/torch_memory_saver#3)..

The net effect is big:

+ TMS daemon 9.6s
Container setup
CRIU + CUDA plugin
Wake + reload

We’re getting 38GB/s effective loading from RAM to the GPU. It’s not 64GB/s (~50 GB/s is perhaps a more realistic expectation), but it’s dead fast.

Conclusion§

Here are all the reload paths on the same absolute scale:

Cold start 695.0s
Warm start 88.0s
Restore baseline 32.1s
+ containerd & CRIU patches 24.5s
+ TMS daemon 9.6s

9.6s. 70x faster than cold, 9x faster than a warm start. Most of what’s left is the cuda-checkpoint driver call (3.5s) and the weight DMA (3.1s at 38 GB/s). The theoretical floor is around 1.8s. We’re not there yet, but we’re getting close9.

Footnotes§

  1. Kernel caches to persist across restarts:

    • FlashInfer kernel cache (~/.cache/flashinfer)
    • Triton kernel cache (TRITON_HOME, defaults to ~/.triton)
    • Torchinductor cache (TORCHINDUCTOR_CACHE_DIR)
    • TVM FFI cache (~/.cache/tvm-ffi)
    • FlashAttention CUTE DSL cache (FLASH_ATTENTION_CUTE_DSL_CACHE_ENABLED=1, FLASH_ATTENTION_CUTE_DSL_CACHE_DIR)
  2. CRIU’s integration with cuda-checkpoint goes in stages - the plugin first calls cuda-checkpoint to move the GPU allocations & device state into host RAM, and then that host RAM gets serialized by CRIU to disk. Since the B200 we’re running this experiment on has 192GB RAM - this means we get a 192GB checkpoint (actually substantially larger, since we store all the host memory the process holds too). Writing and reading these whole device checkpoints to and from disk takes an age. And cuda-checkpoint by itself gets nowhere near PCIe bandwidth even for its part of the checkpoint process. The problem is that we’re pulling a lot of stuff that SGLang already does really well into our checkpointing process. KV Cache: We’re checkpointing the entire state of the KV cache. This is nice! but it’s also completely unnecessary. The KV cache memory hierarchy - where we have KV cache pools on Peers, RAM, Disk, Object storage - paged in on demand into a running SGLang server, is one of core functionalities of modern distributed inference frameworks. So we should just throw away our KV cache - if we need it back, there are better places to get it. Once we do this, then our checkpoint size scales with the size of the weights, but not the size of the GPU. Weights: Putting the weights into the checkpoint runs them through a process that, again, SGLang is tightly optimized to do better. Just throw them away! SGlang knows how to load weights at near NVMe speeds. It can even do it via its RL API - just add --enable-memory-saver to the SGLang invocation, then call /release_memory_occupation, /resume_memory_occupation and /update_weights_from_disk. Once we do this, the checkpoint size is only the host memory, the CUDA + torch state that’s left on the GPU when no weights are allocated. The average checkpoint size is about 6.6GB - about 30x smaller.

  3. A CRIU checkpoint is a specific set of files on disk. But we want to get back to where we started. criu restore will take those files, and rebuild the process on the host - we can get a running process that looks exactly like our original one. But our original target wasn’t a process running happily unisolated on the host, it was a container running in kubernetes. So we don’t just need to restart the process. We need to ‘reschedule a pod’. We need to build a packaging format or orchestration system such that we can tell Kubernetes that we want to schedule a ‘checkpoint restore’ - not to start a process from scratch. Checkpoint/restore has been moving into kubernetes core for a while. The kubelet even has an HTTP API for creating a checkpoint. I found generally that this didn’t expose enough control to get this performant enough (it tells a lower layer (runc) to checkpoint, but doesn’t let you configure how it tells it). So instead, we build our checkpointing flow on runc itself, run from the host on which we make our checkpoint. To create a checkpoint that can be restored by containerd (the kubernetes container runtime I’m using) requires a specific image format, and an annotation to indicate that containerd should restore it using CRIU instead of just launching its entrypoint. The result is a standard OCI image - it can be pushed & pulled from a repository, run on different nodes, etc.

  4. The restored container, by default, looks exactly like the original one. This is obviously kinda the goal - but it has a few interesting implications. For one - say you have a machine with 8 GPUs, and you take your checkpoint on device 4. When you restore, your checkpointed container might get device 2 from the GPU operator. But it will still restore onto device 4! All the information on what device it owns gets baked into the checkpoint, and all the clever CDI kubernetes stuff gets skipped completely. So you have to build a flow to remap the GPU ids inside the checkpoint. This requires support from the driver for ‘device migration’ - which only landed in driver version 580. And you have to remember in an accessible format in the checkpoint image what devices are baked into the cuda context, so you can then pass them to the cuda-checkpoint tool and remap them to their new, CDI allocated GPUs when CRIU runs cuda-checkpoint restore. We just write the device identities into a container layer (since we have it at checkpoint time).

  5. Ideally - ‘reschedule a pod’ means just that - write a kubernetes manifest and apply it. You don’t want to have to do any extra work to load these checkpoints above and beyond scheduling the original podsFor example, dynamo’s snapshot framework runs a persistent root daemon that takes over more control of the snapshot restore. This gives them lots more control, but has more moving parts.. But we just said in the last section that we wanted SGLang to cooperate with us - which means that we need to call its HTTP API after restore! The way we get around this is with a neat trick. We can get the container to ‘wake itself up’. The way we do it is, when we make the checkpoint, we inject a process (with kubectl exec) into the container’s process tree. This process polls constantly for the existence of a specific file, and when that file exists, calls /restore_memory_occupation (and whatever other post restore hooks we might want to call). Then, when CRIU checkpoints the process, this process gets quiesced and bundled up with the SGLang process. When the image is stopped on disk, we write the file that the process polls for into the OCI image. Then - on restore, since the file exists, the ‘helper’ will immediately enter its wake flow!

  6. Lots of things can make containers and pods hard to checkpoint/restore. Important ones: (1) Threads that hold a cuda context but don’t have anything resident on the device. cuda-checkpoint ends up choking on them. For SGLang these come from torchinductor’s compilation process. So we set TORCHINDUCTOR_COMPILE_THREADS=1, so no spare threads get created. (2) In k8s, lots of stuff in the pod will bind to the pods IP instead of localhost. When the checkpoint is restored, the pod will have a different IP, and everything dies. The solution is to get everything in SGLang that listens on TCP to bind either 0.0.0.0 (i.e. all interfaces) or localhost. For single GPU pods, GLOO_SOCKET_IFNAME=lo suffices. MultiGPU pods are a whole other kettle of fish that we can go into later. (3) io_uring: SGLang uses io_uring via uvloop. io_uring can’t be checkpointed yet. There’s an environment variable to force uvloop onto epoll, but it’s (if nothing else needs it) easier to just disable it on the host sudo sysctl -w kernel.io_uring_disabled=2. (4) HF_HUB_OFFLINE=1: Prevents outbound TCP connections to Hugging Face Hub. With tcp-established in runc.conf, any active TCP connection at checkpoint time gets preserved. Outbound connections to external hosts would fail on restore in a new pod network context, so models should be downloaded on the host first.

  7. Containerd re-extracts the checkpoint tarball to a staging directory on every restore (6GB of disk writing, identical each time), and resolves the base image from a remote registry regardless of pull policy. For normal container launches both are skipped if they’ve been done before. My B200 machine is in Sydney, so the registry round-trip is especially painful. doublewordai/containerd#1 caches the extracted-checkpoint rootfs so subsequent restores skip the re-unpack. doublewordai/containerd#2 resolves the base image locally instead of unconditionally calling Pull(config.RootfsImageRef) inside CRImportCheckpoint.

  8. 7s to restore a 6.6GB checkpoint seems like a lot. Starting processes is milliseconds, and transferring that data shouldn’t take more than a second even cold from disk. CRIU makes a lot of syscalls, and does a lot of very cool stuff, but still. The biggest win is zero-copy page restore: instead of allocating anonymous pages and reading into them with preadv (thousands of syscalls), we mmap the pages-img file directly with MAP_PRIVATE|MAP_FIXED. Pages that are never touched stay as zero-cost file-backed references; pages that get written trigger copy-on-write (doublewordai/criu#2). The other wins are smaller: parallelising the cuda-checkpoint plugin across SGLang’s three GPU-owning processes instead of running them serially (doublewordai/criu#1), and caching per-process restore_tid values at checkpoint time so we don’t have to probe for them on restore. Each live probe costs ~1.1s; caching saves about 0.9s (doublewordai/criu#4).

  9. Where could each phase go from here? Container setup (3.0s) is containerd and runc doing their thing. If we didn’t want to run in k8s, this would all disappear. Smaller or flatter container images might help too; this is general-purpose container orchestration optimization. CRIU + cuda-checkpoint (3.5s): zero-copy page restore means CRIU itself is fast. We can make it faster by being more invasive in SGLang with regards to what ends up in the checkpoint (reducing threads, being parsimonious with host-side memory), but most of this time is the cuda-checkpoint driver call restoring GPU context, which is in NVIDIA’s hands. Weight reload (3.1s): 38 GB/s effective out of a theoretical 64 GB/s. The gap is mostly cudaHostRegister overhead that doesn’t fully overlap with the DMA (empirically doesn’t improve with 1 GiB pages either). The larger refactor is to give the daemon its own CUDA context and have it start transferring to the device as the container is starting (or before, if no other container is running), passing device-side allocations over CUDA IPC. The daemon could keep its buffers registered between loads, bringing the transfer down to the PCIe ceiling and potentially hiding it behind the cuda-checkpoint restore. Another trick: this machine has 8 PCIe x16 slots connected via NVLink (1.8 TB/s). We could load 1/8 of the weights through each slot and all-gather.

Suggest an edit

Last modified: 22 Apr 2026