RoCEv2, lossless Ethernet, DCQCN congestion control, Spectrum-4 switch architecture, and GPUDirect RDMA — the Ethernet path to AI-scale performance.
Standard Ethernet was designed for bursty web traffic, not the sustained, latency-sensitive all-to-all communication of GPU training. Spectrum-X adds hardware-level adaptive routing, lossless fabric, SHARP in-network compute, and ultra-low latency to deliver InfiniBand-class performance over familiar Ethernet infrastructure.
The heart of the fabric — 51.2 Tbps, 128 × 400GbE or 64 × 800GbE ports, hardware adaptive routing, SHARP compute, and built-in telemetry.
Not just a NIC — a SmartNIC with Arm cores optimized for RoCEv2 offload, GPUDirect RDMA, and ASAP² traffic acceleration.
MLNX-OS switch firmware, UFM (Unified Fabric Manager), NCCL plugins, SHARP daemons, and monitoring via NVIDIA DCGM and NetQ.
Each GPU NIC connects to its dedicated rail leaf switch — full bisection bandwidth, no oversubscription at the leaf layer
Unlike TCP which handles loss with retransmit timers, RoCEv2 Reliable Connected mode expects an ordered, lossless path. A single dropped packet triggers a full QP retry that stalls the entire transfer. PFC + ECN are mandatory, not optional.
IEEE 802.1Qbb. Sends per-priority PAUSE frames upstream when a switch buffer fills, stopping traffic before drops occur.
IP-level congestion signaling. Switches mark packets with CE (Congestion Experienced) bits before buffers overflow.
The rate-control algorithm that ties ECN marks to sender-side rate reduction. Combines ideas from QCN (802.1Qau) and DCTCP.
Sent by the receiver NIC to the sender when an ECN-marked (CE-bit) packet is received. The sender uses CNPs to drive rate reduction via the α (alpha) parameter.
Controls rate reduction aggressiveness. On CNP reception: α = (1-g)×α + g. New rate = old rate × (1 − α/2). Higher α = more aggressive reduction.
When no CNPs received, sender probes upward: first with byte counter (fast), then with timer (slow). Prevents over-aggressive back-off.
| QP Type | Description |
|---|---|
| RC | Reliable Connected — ordered, no loss. Used for NCCL point-to-point. Most common. |
| UD | Unreliable Datagram — connectionless. Used for multicast and control traffic. |
| UC | Unreliable Connected — connection-oriented but no reliability. Rarely used. |
| XRC | Extended RC — shared QP for many-to-one patterns. Reduces QP explosion in large clusters. |
| Operation | Pattern |
|---|---|
| RDMA WRITE | Sender pushes data to remote memory — remote CPU uninvolved. Used in NCCL. |
| RDMA READ | Sender pulls data from remote memory. Initiator drives. Lower GPU utilization impact. |
| SEND/RECV | Both sides participate. Required for connection setup and control. |
| Atomic | Fetch-and-add, compare-and-swap on remote memory. Used for distributed locks. |
RoCEv2 uses GIDs instead of InfiniBand LIDs for addressing. A GID is a 128-bit identifier derived from the interface's IPv6 address (using EUI-64 or a mapped IPv4 address). When configuring NCCL, NCCL_IB_GID_INDEX selects which GID to use — critical for RoCEv2 (GID index 1+) vs RoCEv1 (GID index 0).
| Spectrum-4 Key Specifications | |
|---|---|
| Aggregate BW | 51.2 Tbps |
| Port configurations | 128 × 400GbE or 64 × 800GbE |
| Forwarding mode | Cut-through (ultra-low latency) |
| Buffer architecture | Shared packet buffer with VOQ |
| Adaptive Routing | Hardware per-packet, no SM |
| SHARP support | Yes — in-network AllReduce |
| Telemetry | INT, gRPC streaming, SNMP |
| PFC priorities | 8 traffic classes (IEEE 802.1Qbb) |
| ECN marking | Per-queue threshold, WRED |
| Predecessor | Spectrum-3 (25.6 Tbps) |
| BlueField-3 SuperNIC Specs | |
|---|---|
| Network speed | 400GbE (2 × 200GbE ports) |
| RDMA protocol | RoCEv2 hardware offload |
| GPUDirect | RDMA + Storage (GDS) |
| Arm cores | 16× Armv8.2+ A78 |
| PCIe | Gen5 × 16 |
| ASAP² | Accelerated Switch and Packet Processing |
| Controller | ConnectX-7 |
| Predecessor | BlueField-2 (100GbE) |
Each GPU NIC is assigned to a dedicated "rail" leaf switch. All intra-rail GPU communications stay at the leaf layer. Inter-node traffic uses spine. No oversubscription at leaf layer.
Classic 3-tier (edge-aggregation-core) or 2-tier Clos. Provides full bisection bandwidth. More flexible port utilization but requires ECMP/AR tuning for AI traffic.
InfiniBand AR is enabled by the Subnet Manager per-port. Spectrum-X AR is fully hardware-driven inside the Spectrum-4 ASIC — no SM equivalent needed. The switch monitors per-port congestion in real time and reroutes packets to less-loaded paths. Works at per-packet granularity.
Instead of GPU → CPU → network → CPU → GPU for AllReduce, SHARP delegates the aggregation to the switch ASIC itself:
What is the aggregate bandwidth of Spectrum-4?
51.2 Tbps
128 × 400GbE or 64 × 800GbE port configurations. Cut-through forwarding for minimum latency.
What makes BlueField-3 a "SuperNIC" vs a standard NIC?
Arm cores + ConnectX-7
Embedded Arm compute allows offloading security, storage, and network functions. Standard NICs are just forwarding devices.
Why is rail-optimized topology preferred for AI training?
GPU-to-leaf 1:1 mapping
Each GPU NIC connects to a dedicated leaf rail. AllReduce ring traffic stays within a single rail — no spine traversal for intra-step gradients.
Where does SHARP perform AllReduce aggregation?
Inside the switch ASIC
Spectrum-4 performs SUM/MAX/MIN on gradient data as it flows through the switch — no CPU or GPU involved in the aggregation step.
Correct DSCP-to-PFC-priority mapping is the single most common RoCEv2 misconfiguration. Every hop must agree.
| Traffic Class | DSCP Value | PFC Priority | Lossless? | Notes |
|---|---|---|---|---|
| RoCEv2 / RDMA | 26 (AF31) | Priority 3 | ✅ Yes | NCCL default; must match on NIC, switch, and spine |
| Storage (NVMe-oF) | 24 (CS3) | Priority 3 | ✅ Yes | Can share priority 3 with RoCEv2 in mixed clusters |
| Network Control | 48 (CS6) | Priority 6 | ⚠️ Partial | BGP, LACP, STP — must not be blocked by PFC |
| General (lossy) | 0 (BE) | Priority 0 | ❌ No | Management, SSH, HTTP — unaffected by RoCEv2 PFC |
| Cluster Management | 16 (CS2) | Priority 2 | ❌ No | Kubernetes control plane, health checks |
Set trust DSCP on NIC-facing switch ports (hosts mark their own DSCP). Set trust PCP on trunk/uplink ports between switches. Mismatching trust mode causes DSCP re-marking and PFC on wrong priority.
Allows the NIC's DMA engine to directly access GPU frame buffer memory — skipping the CPU and system DRAM entirely.
cudaMalloc / cuMemAllocip link show, ethtool -kmlnx_qos -i <dev> to inspectcpupower idle-set -D 0ethtool -S for pause frame counters# QoS inspection (NIC side) mlnx_qos -i mlx5_0 # Show PFC and priority settings mlnx_qos -i mlx5_0 -p 3 --pfc 1 # Enable PFC on priority 3 ethtool -k mlx5_0 # Check offload settings (GRO, LRO, etc.) ethtool -S mlx5_0 | grep pfc # PFC pause frame counters # RDMA statistics rdma stat show # All RDMA port counters rdma stat show -a # Extended auto-mode counters perfquery -x mlx5_0 1 # Extended port counters (InfiniBand compat) # RDMA performance testing ib_write_bw -d mlx5_0 -D 30 --report_gbits # RDMA write bandwidth (server) ib_write_bw -d mlx5_0 <server_ip> --report_gbits # Client side ib_read_lat -d mlx5_0 <server_ip> # Read latency test # DCQCN / CNP counters ethtool -S mlx5_0 | grep cnp # CNP sent/received (congestion signals) ethtool -S mlx5_0 | grep ecn # ECN-marked packet counters # RoCEv2 GID inspection show_gids # Show all GID table entries ibv_devinfo -d mlx5_0 -v # Device info + GIDs verbose # Switch-side (MLNX-OS) show qos interface ethernet 1/1 # Interface QoS counters show roce cnp-rx # CNP receive stats on switch show interfaces ethernet 1/1 counters # Port counters
Before declaring a Spectrum-X cluster production-ready: (1) ib_write_bw ≥ 380 Gbps per 400GbE link; (2) CNP counter stays at 0 under normal load (non-zero = congestion, check thresholds); (3) PFC pause counters increment only under extreme load — frequent pausing means ECN thresholds are set too high; (4) MTU consistency verified with ping -M do -s 8972 across all fabric hops.
| Feature | 🔵 Spectrum-X (RoCEv2) | 🟣 InfiniBand (NDR/HDR) |
|---|---|---|
| Transport Protocol | UDP/IP (L3-routable) — works across IP subnets | Native IB transport — subnet-scoped, needs SM |
| Max Link Speed | 800GbE (Spectrum-4 ready) | NDR: 400 Gb/s per port |
| Latency (port-to-port) | ~300–500 ns (cut-through) | ~100–300 ns (native IB) |
| Lossless Fabric | Requires PFC + ECN + DCQCN tuning | Built-in — IB transport is inherently lossless |
| Congestion Control | DCQCN (ECN + rate adaptation) | Hardware FECN → BECN → rate reduction in SM |
| Adaptive Routing | Hardware per-packet (Spectrum-4 ASIC, no SM) | SM-computed per-QP (configurable AR in Quantum-2) |
| In-Network Compute | SHARP on Spectrum-4 switches | SHARP on Quantum-2 switches |
| Network Management | Standard Ethernet (SNMP, gNMI, BGP, LLDP) | OpenSM, UFM (NVIDIA-proprietary) |
| Subnet Manager | Not required — L3 IP routing | Required — OpenSM or UFM |
| Routing Algorithm | ECMP, Adaptive Routing, Flowlet | ftree, MINHOP, DFSSSP, Adaptive Routing |
| Addressing | IPv4/IPv6 + GID (derived from IP) | LID (local) + GID (global), assigned by SM |
| Ecosystem | Broad — any Ethernet vendor for uplinks, standard tooling | NVIDIA (Mellanox) proprietary ecosystem |
| Operational Complexity | Lower — familiar Ethernet operations | Higher — SM config, LID space, IB partitions |
| GPUDirect RDMA | Yes — RoCEv2 GPUDirect | Yes — native IB GPUDirect |
| Best for | Brownfield Ethernet Multi-tenant Scale-out LLM | MPI HPC Min-latency Tightly-coupled |
| Aspect | RoCEv2 | Native InfiniBand |
|---|---|---|
| L3 protocol | UDP/IP (port 4791) | IB transport (no IP) |
| Addressing | IP address + GID | LID (local) + GID (global) |
| Routing scope | Inter-subnet (L3 routable) | Intra-subnet only (LID scope) |
| Congestion response | DCQCN (software α) | Hardware FECN/BECN (SM driven) |
| Lossless requirement | Must configure PFC + ECN | Built-in at protocol level |
| MTU typical | 9000 bytes (jumbo) | 4096 bytes (IB standard) |
| Verbs API | Same libibverbs / ibv_* calls | Same libibverbs / ibv_* calls |
| NCCL support | NCCL IB plugin (same code path) | NCCL IB plugin (same code path) |
NVIDIA positions Spectrum-X as achieving "IB-class performance on Ethernet" for AI training workloads. This is achieved through: hardware adaptive routing (no SM jitter), SHARP AllReduce (in-network aggregation), and lossless fabric (PFC + DCQCN). The claim is validated for NCCL-based AllReduce at scale — but native IB still wins on raw latency for MPI-heavy workloads.
The three mechanisms that make RoCEv2 work:
When congestion happens: CE bit marked → receiver sends CNP → sender increases α → sender reduces TX rate. No CNP = rate recovers.
SHARP does the AllReduce inside the Spectrum-4 switch ASIC — not in the GPU, not in the CPU. Think: the switch is doing your gradient addition for you.
Each GPU NIC connects to its own dedicated leaf switch rail. 8 NICs per node → 8 rails. AllReduce ring stays within a single rail — never touches the spine.
In IB, addresses are LIDs (SM-assigned). In RoCEv2, addresses are GIDs derived from IP/IPv6. GID index 0 = RoCEv1; index 1+ = RoCEv2. Always set NCCL_IB_GID_INDEX=3.
DSCP 26 (AF31) → PFC Priority 3 → Lossless queue. This mapping must be consistent on every NIC, every leaf, and every spine. One mismatch = lossy RDMA.
GPUDirect RDMA: the NIC DMA engine reads/writes GPU memory directly. CPU is completely out of the data path. Requires: nvidia-peermem, BAR1, same NUMA node.
Spectrum-4 = 51.2 Tbps aggregate. 128 × 400GbE or 64 × 800GbE ports. Its predecessor Spectrum-3 was 25.6 Tbps — exactly half. Easy doubling pattern to remember.
Choose your situation to get targeted guidance.
ib_write_bw between two hosts — target ≥380 Gbps per 400GbE link. If well below, check MTU (must be 9000 jumbo on all hops).ethtool -S mlx5_0 | grep cnp — non-zero CNP counters mean active congestion. Review ECN/DCQCN thresholds.mlnx_qos -i mlx5_0 — PFC must be enabled on priority 3 for RoCEv2 traffic.nvidia-peermem module is loaded for GPUDirect: lsmod | grep nvidia_peermem.cpupower idle-set -D 0 and set governor to performance.NCCL_IB_GID_INDEX should be 3 (RoCEv2), not 0 (RoCEv1/IB).mlnx_qos -i mlx5_0 -p 3 --pfc 1.ip link set mlx5_0 mtu 9000.ping -M do -s 8972 <remote_ip> — should not fragment or drop.ethtool -S mlx5_0 | grep -E 'pfc|cnp|ecn' — healthy = low/zero CNP counts.