InfiniBand, RoCE, and all that

The standard path for sending data over a network works like this:

The standard path. On each host the data is copied through a kernel buffer (the orange hops), so the CPU is in the critical path on both ends.

There are variations and optimizations, but the basic shape has the kernel involved on both sides, and the data being copied into a kernel buffer. For a web server handling HTTP requests this is fine, since the per-message overhead is negligible relative to all the stuff the application wants to do with the data.

There are lots of places where this is not so fine. The ones that matter most, nowadays, are in AI training & inference. In training, a gradient all-reduce — the step where hundreds of GPUs each combine their gradients with everyone else’s before the next step can start — is a barrier that every GPU has to wait on. In distributed inference, an expert parallel kernel synchronizes the waiting GPUs in much the same way EP is just an example, all the other parallelisms (except data-parallel) have the same property..

What these workloads need is for data to move directly from an application buffer on one machine to an application buffer on another, without either CPU touching it in the critical path, and without a memory copy. This is Remote Direct Memory Access (RDMA):

RDMA. The NIC reads and writes the application buffers directly, so the kernel is bypassed and nothing is copied.

Building hardware that provides it reliably and efficiently is the problem InfiniBand is designed to solve.

InfiniBand as the answer§

In 1999, the industry agreed that RDMA was something that needed doing. Two competing proposals — Future I/OA switched-fabric interconnect for host-to-host and host-to-I/O communication. See Future I/O (IEEE)., backed by Compaq, HP, and IBM, and Next Generation I/OIntel announced NGIO in November 1998: Intel Introduces Next Generation I/O for Computing Servers., backed by Intel, Microsoft, and Sun — merged into a single effort and produced the InfiniBand Trade AssociationVersion 1.0 of the spec followed in 2000. The vision went further than even the I/O-stack framing suggests: devices would attach to the fabric as endpoints rather than as slots on a local bus. Wikipedia has a good summary of the early history.. The ambition was extraordinary: InfiniBand was not designed as a networking technology but as a replacement for the entire server I/O stack—the PCI bus for device I/O, Ethernet for networking, Fibre Channel for storage.

The result was designed to be technically coherent from the ground up. The big idea is credit-based flow control at the link layer: a sender cannot transmit unless the receiver has signaled it has buffer space. This makes the fabric inherently lossless. Losslessness isn’t strictly required for RDMA, but it makes the transport much simpler: nothing in the fast path has to recover from a dropped packet. The programming model that grew up around this, “the verbs API”“Verbs” is not a formal API specification. The IBTA spec defines a set of abstract operations — ibv_post_send, ibv_open_device, and so on — that must exist and behave in certain ways, without prescribing an exact interface. The de facto implementation is libibverbs, developed by the OpenFabrics Alliance and merged into the Linux kernel in 2005., is a coherent stack built on the guarantees the fabric provides.

InfiniBand then lost almost every battle it entered. PCIe won the device I/O bus. Ethernet held general networking. Fibre Channel held storage. The main place InfiniBand survived and thrived was high-performance computing: fluid dynamics, molecular dynamics, climate models. At that kind of scale and coupling, interconnect latency is a direct ceiling on how fast the simulation runs, and people would pay for a dedicated fabric to push that ceiling up.

The founding consortium members mostly lost interest as the empire shrank to a single niche they were not primarily in the business of serving. The company that remained was Mellanox, founded in 1999 specifically to build InfiniBand silicon, which ended up dominating the InfiniBand NIC and switch market, consolidating it further by buying rivals like the switch vendor Voltaire. The people who still needed InfiniBand really needed it, and they had nowhere else to go. That concentration built slowly over twenty years until NVIDIA acquired Mellanox in 2020 for $6.9 billionAt the time, NVIDIA’s largest acquisition. Intel had made a late attempt to compete with its own fabric, Omni-Path, after acquiring QLogic’s InfiniBand assets in 2012 — and then exited the market entirely in 2019, selling the business to Cornelis Networks. That left Mellanox as the uncontested monopolist just as AI training demand was beginning to make InfiniBand switches and NICs very valuable hardware. on the logic that controlling the GPU and the interconnect together meant owning the full AI infrastructure stack.

The island quality of InfiniBand is inseparable from its technical coherence. It is a complete, purpose-built stack: its own physical layer, its own link-layer flow control, its own routing, its own transport, its own programming model. Each layer was designed against the one below it, and the result works extremely well. The cost is that every machine on an InfiniBand fabric needs InfiniBand cables, InfiniBand switches, InfiniBand NICs, and operators who understand all of it. Anywhere the benefit was not compelling enough to justify a dedicated fabric, it simply did not go.

The Allure of Ethernet§

The cost of InfiniBand’s island quality was more than just the high cost of Mellanox hardware. Every server needed two sets of cables, two NICs (one Ethernet for TCP traffic, one InfiniBand for RDMA), two switch fabrics, and teams with expertise in different systems. Storage then also ran on a third fabric, Fibre Channel, with its own specialists. Large datacenters in the mid-2000s were managing three distinct networks that didn’t talk to each other, each with its own cabling, spares, and operational procedures.

The obvious solution was to consolidate everything onto Ethernet, which everyone already owned and understood. The problem is that standard Ethernet is lossy by design: when a switch buffer fills up, it drops packets. For TCP traffic this is acceptable, because TCP detects loss at its layer, and retransmits. For the RDMA protocol, however, it is not. The transport was built assuming a lossless fabric, so its recovery is crude: a single dropped packet forces the connection — a queue pair, in RDMA terms — to retransmit everything sent after it (go-back-N), and sustained loss tips the queue pair into an error state the application has to tear down and reconnect.

People really did want to run RDMA over Ethernet though. The fix was Data Center Bridging (DCB)Data Center Bridging on Wikipedia is a good overview; the IEEE 802.1 DCB task group pages collect the standards and project history., a set of IEEE extensions developed in the late 2000s specifically to make Ethernet lossless enough for storage and RDMA traffic. The key piece is Priority Flow Control (PFC). When a switch buffer begins to fill, it sends a PAUSE frame upstream on a per-priority basis, telling the sender to stop transmitting that traffic class before any packets are dropped. Traffic classes that don’t need lossless behavior (ordinary TCP) get their own priority lane and are unaffected Two supporting standards also helped: Enhanced Transmission Selection (ETS) schedules bandwidth across priority classes, and DCBX handles discovery and negotiation so that switches and NICs agree on which classes exist and what rules apply to each..

PFC achieves losslessness by backpressure signaling. Once a buffer fills, a PAUSE frame gets sent upstream to the sender. If that sender’s buffers are also filling, it has to PAUSE its upstream neighbor in turn. The PAUSE can cascade backward through the fabric. In pathological cases, these cascading pauses can spread congestion across links that had nothing to do with the original problemThis failure mode is called a PFC pause storm. — a failure mode some designs sidestep by giving up on lossless entirelyAWS’s Elastic Fabric Adapter (EFA) does exactly that: its SRD (Scalable Reliable Datagram) transport, exposed through a custom libfabric provider, sprays each flow across many paths and reorders in software — reliable delivery over ordinary lossy Ethernet, no PFC required. See A Cloud-Optimized Transport Protocol for Elastic and Scalable HPC (IEEE Micro, 2020)..

But it works well enough. Once you have a lossless Ethernet substrate, the obstacle to running RDMA over it is removed.

RoCE: IB on Ethernet§

To run RDMA on Ethernet, RoCE (RDMA over Converged Ethernet) just takes the InfiniBand transport layer and runs it directly over Ethernet, changing as little as possible. A NIC that speaks RoCE and a NIC that speaks InfiniBand implement the same programming model and differ only at the link layer.

The first version, RoCE v1, put the IB transport directly inside an Ethernet frame. It was not IP-routable and worked only within a single broadcast domain. RoCE v2 wrapped the same transport in UDP/IP (port 4791), making it routable across subnets and compatible with standard ECMP load balancing across fat-tree topologiesEqual-Cost Multi-Path (ECMP) hashes flows across parallel paths, which is how a fat-tree distributes traffic. With RoCE v2, a flow is a standard IP 5-tuple and hashes normally. RoCE v1 is Ethernet-only and doesn’t hash across IP paths, which limits the topologies it can use.. RoCE v2 is what everyone uses today. The lossless substrate is still required (DCB and PFC configured end to end), but the traffic now looks like UDP to every piece of infrastructure between the endpoints, which routes easily across standard networks.

The economic argument for this at scale is straightforward. InfiniBand switches are specialty hardware sold by one company, whereas Ethernet switches are commodity products with many vendors competing on price. Ethernet also brings the operational advantage: the people who run these networks already know it, already have monitoring and tooling built around it, and don’t need a separate team to manage a parallel IB fabric.

This is why much of the hyperscale buildout converged on RoCE for AI training. Meta, for instance, built a matched pair of 24k-GPU clusters for Llama 3 — one on RoCE, one on InfiniBand — and ran its largest model on the RoCE side, a fair signal that Ethernet RDMA holds up at the top end. At the same link rate the two perform comparably enough that the choice comes down to economics and operational simplicity.

The competitive pressure this created is visible in NVIDIA’s own product line. Having acquired Mellanox and with it a monopoly on InfiniBand, NVIDIA simultaneously sells Spectrum-X, an Ethernet networking platform optimized for AI workloads, with its own congestion control and adaptive routing layered on top of standard EthernetSemiAnalysis covers the competitive dynamics in detail: Nvidia’s InfiniBand Problem..

Where things stand§

The market has not settled on a single answer.

InfiniBand remains strong in traditional HPC and in NVIDIA’s own reference architectures for large GPU clusters. Where island performance is the primary constraint, or NVIDIA gets to choose the design (its DGX SuperPOD reference clusters), InfiniBand still wins. NVIDIA sells both the GPU and the interconnect into these environments, and has every incentive to keep IB competitive.

RoCE is winning at hyperscale. The large cloud providers and AI labs building the biggest GPU clusters have converged on Ethernet-based RDMA, driven by economics and the operational leverage of a fabric they already understand.

Proprietary fabrics occupy the remaining niches: Slingshot for HPE’s government HPC systemsSlingshot comes out of Cray’s interconnect group — the team behind Gemini and Aries — which HPE inherited when it acquired Cray in 2019, though Slingshot itself is a ground-up Ethernet-compatible design rather than a continuation of the proprietary Aries fabric. — Frontier at Oak Ridge, Aurora at Argonne, and Isambard at Bristol — and EFA for AWS, whose SRD transport was designed for their own cloud network rather than adopted from IB or RoCE.

There are now lots of different ways to do RDMA. Which one you pick comes down to what you can afford, what you already run, and who you’re willing to depend on.