<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Fergus&apos;s blog</title><description>LLM inference thoughts</description><link>https://fergusfinn.com/</link><item><title>Control Layer Benchmarking</title><link>https://fergusfinn.com/blog/control-layer-benchmarking/</link><guid isPermaLink="true">https://fergusfinn.com/blog/control-layer-benchmarking/</guid><description>Benchmarking the Doubleword Control Layer
</description><pubDate>Tue, 21 Oct 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/control-layer-benchmarking/md</mdUrl></item><item><title>Bringing up DeepSeek-V4-Flash on AMD MI300X</title><link>https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/</link><guid isPermaLink="true">https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/</guid><description>A story of sharp edges, segfaults, and standards</description><pubDate>Mon, 01 Jun 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/deepseek-v4-flash-mi300x/md</mdUrl></item><item><title>The Doubleword Control Layer</title><link>https://fergusfinn.com/blog/control-layer/</link><guid isPermaLink="true">https://fergusfinn.com/blog/control-layer/</guid><description>We&apos;re releasing our AI Gateway. Why are AI gateways hard to do right, and why do we think we&apos;ve done it right.
</description><pubDate>Tue, 21 Oct 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/control-layer/md</mdUrl></item><item><title>The economics of speculative decoding</title><link>https://fergusfinn.com/blog/economics-of-speculative-decoding/</link><guid isPermaLink="true">https://fergusfinn.com/blog/economics-of-speculative-decoding/</guid><description>Two underexplored axes: what MoE routing does to the decode roofline, and how
compressed attention takes away the slack that used to make speculated tokens free.
</description><pubDate>Mon, 08 Jun 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/economics-of-speculative-decoding/md</mdUrl></item><item><title>Anatomy of a high performance EP kernel</title><link>https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/</link><guid isPermaLink="true">https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/</guid><description>A walk through of the anatomy of an expert-parallel dispatch and combine kernel.
</description><pubDate>Fri, 05 Jun 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/md</mdUrl></item><item><title>Pushing memory bound kernels beyond the speed of light with lossless decompression</title><link>https://fergusfinn.com/blog/faster-than-speed-of-light/</link><guid isPermaLink="true">https://fergusfinn.com/blog/faster-than-speed-of-light/</guid><description>The fastest a memory bound kernel can go is set by the time required to transfer the data to the SMs. How can we do better?
</description><pubDate>Tue, 26 May 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/faster-than-speed-of-light/md</mdUrl></item><item><title>Cloudburst: 70x faster cold(ish) starts for SGLang</title><link>https://fergusfinn.com/blog/fast-sglang-starts/</link><guid isPermaLink="true">https://fergusfinn.com/blog/fast-sglang-starts/</guid><description>Checkpoint/restore with CRIU and cuda-checkpoint, from 12 minutes to 10 seconds on a B200.
</description><pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/fast-sglang-starts/md</mdUrl></item><item><title>How fast can an LLM go? </title><link>https://fergusfinn.com/blog/inference-arithmetic/</link><guid isPermaLink="true">https://fergusfinn.com/blog/inference-arithmetic/</guid><description>Comparing InferenceMAX to the hardware limits
</description><pubDate>Wed, 22 Oct 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/inference-arithmetic/md</mdUrl></item><item><title>Speculative KV coding: losslessly compressing KV cache by up to ~4× using a predictor model</title><link>https://fergusfinn.com/blog/kv-entropy-coder/</link><guid isPermaLink="true">https://fergusfinn.com/blog/kv-entropy-coder/</guid><description>Lossless compression of a target model&apos;s KV cache by up to 4×, using a cheaper predictor model to drive an arithmetic coder.
</description><pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/kv-entropy-coder/md</mdUrl></item><item><title>LLM guided scheduling</title><link>https://fergusfinn.com/blog/llm-guided-scheduling/</link><guid isPermaLink="true">https://fergusfinn.com/blog/llm-guided-scheduling/</guid><description>An idea on how to use LLMs to help with scheduling
</description><pubDate>Tue, 23 Sep 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/llm-guided-scheduling/md</mdUrl></item><item><title>Paged attention</title><link>https://fergusfinn.com/blog/paged-attention/</link><guid isPermaLink="true">https://fergusfinn.com/blog/paged-attention/</guid><description>A discussion of paged attention
</description><pubDate>Mon, 22 Sep 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/paged-attention/md</mdUrl></item><item><title>Scheduling in inference engines</title><link>https://fergusfinn.com/blog/scheduling-in-inference-engines/</link><guid isPermaLink="true">https://fergusfinn.com/blog/scheduling-in-inference-engines/</guid><description>A look into how inference engines choose which requests to process
</description><pubDate>Mon, 22 Sep 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/scheduling-in-inference-engines/md</mdUrl></item><item><title>Using caching for fast speculative decoding</title><link>https://fergusfinn.com/blog/spacelike-speculative-decoding/</link><guid isPermaLink="true">https://fergusfinn.com/blog/spacelike-speculative-decoding/</guid><description>Speculative decoding speeds up LLM inference, but using another model works poorly.
</description><pubDate>Mon, 22 Sep 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/spacelike-speculative-decoding/md</mdUrl></item><item><title>Scaling Curation with LLM Comparisons</title><link>https://fergusfinn.com/blog/llm-powered-content-discovery/</link><guid isPermaLink="true">https://fergusfinn.com/blog/llm-powered-content-discovery/</guid><description>Building a content discovery system using parallel primitives and BST-based ranking with LLM comparisons
</description><pubDate>Fri, 16 Jan 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/llm-powered-content-discovery/md</mdUrl></item><item><title>In search of wasted bits: how much information do LLM weights carry?</title><link>https://fergusfinn.com/blog/weight-entropy/</link><guid isPermaLink="true">https://fergusfinn.com/blog/weight-entropy/</guid><description>An empirical investigation into the byte-level entropy of model weights
across numeric formats and model families.
</description><pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/weight-entropy/md</mdUrl></item><item><title>tANS: precomputing rANS</title><link>https://fergusfinn.com/blog/understanding-tans/</link><guid isPermaLink="true">https://fergusfinn.com/blog/understanding-tans/</guid><description>An intro to tANS: a table-based entropy coder that removes rANS&apos;s per-symbol division while keeping Shannon-optimal compression.
</description><pubDate>Mon, 27 Apr 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/understanding-tans/md</mdUrl></item><item><title>Also-rANS: Asymmetric Numeral Systems for entropy coding</title><link>https://fergusfinn.com/blog/understanding-rans/</link><guid isPermaLink="true">https://fergusfinn.com/blog/understanding-rans/</guid><description>An intro to rANS: an entropy coding method for losslessly encoding &amp; decoding streams of bytes quickly.
</description><pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/understanding-rans/md</mdUrl></item><item><title>Weighted random fallback flattens to uniform under high error rates</title><link>https://fergusfinn.com/blog/weighted-fallback-flattening/</link><guid isPermaLink="true">https://fergusfinn.com/blog/weighted-fallback-flattening/</guid><description>When a weighted-random fallback rejects samples and retries without replacement, high error rates cause low-weight models to be selected far more often than their weights suggest.
</description><pubDate>Tue, 17 Feb 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/weighted-fallback-flattening/md</mdUrl></item><item><title>Parallel Primitives for Multi-Agent Workflows</title><link>https://fergusfinn.com/blog/parallel-primitives-blog/</link><guid isPermaLink="true">https://fergusfinn.com/blog/parallel-primitives-blog/</guid><description>Exploring coordination patterns from parallel computing for multi-agent LLM systems
</description><pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/parallel-primitives-blog/md</mdUrl></item><item><title>LLM powered data structures: A concurrent, lock-free binary search tree</title><link>https://fergusfinn.com/blog/bst-expensive-comparisons/</link><guid isPermaLink="true">https://fergusfinn.com/blog/bst-expensive-comparisons/</guid><description>A lock-free binary search tree optimized for expensive async comparisons, with threaded linked list for O(1) sorted iteration
</description><pubDate>Tue, 13 Jan 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/bst-expensive-comparisons/md</mdUrl></item><item><title>Large-Scale Semantic Search Without Embeddings</title><link>https://fergusfinn.com/blog/arxiv-llm-search/</link><guid isPermaLink="true">https://fergusfinn.com/blog/arxiv-llm-search/</guid><description>Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
</description><pubDate>Thu, 01 Jan 2026 00:00:00 GMT</pubDate><mdUrl>https://fergusfinn.com/blog/arxiv-llm-search/md</mdUrl></item></channel></rss>