A lock-free binary search tree optimized for expensive async comparisons, with threaded linked list for O(1) sorted iteration
Applying parallel primitives to search and rank 2.4 million arXiv papers using LLM judgments
Exploring coordination patterns from parallel computing for multi-agent LLM systems
Comparing InferenceMAX to the hardware limits
Benchmarking the Doubleword Control Layer
We're releasing our AI Gateway. Why are AI gateways hard to do right, and why do we think we've done it right.
An idea on how to use LLMs to help with scheduling
A discussion of paged attention
A look into how inference engines choose which requests to process
Speculative decoding speeds up LLM inference, but using another model works poorly.