NCP-AIN · Topic 3 · Ethernet AI Networking

Spectrum-X &
Ethernet for AI

RoCEv2, lossless Ethernet, DCQCN congestion control, Spectrum-4 switch architecture, and GPUDirect RDMA — the Ethernet path to AI-scale performance.

51.2
Tbps Spectrum-4
3
Lossless Mechanisms
7
Study Sections
10
Practice Questions

Spectrum-X: Ethernet Reinvented for AI
NVIDIA's answer to "can Ethernet compete with InfiniBand for AI workloads?" — and the answer is yes, with the right stack.
💡

Why a New Ethernet Platform?

Standard Ethernet was designed for bursty web traffic, not the sustained, latency-sensitive all-to-all communication of GPU training. Spectrum-X adds hardware-level adaptive routing, lossless fabric, SHARP in-network compute, and ultra-low latency to deliver InfiniBand-class performance over familiar Ethernet infrastructure.

Platform Components
Three pillars work together to make Spectrum-X perform.
🔀

Spectrum-4 Switch ASIC

The heart of the fabric — 51.2 Tbps, 128 × 400GbE or 64 × 800GbE ports, hardware adaptive routing, SHARP compute, and built-in telemetry.

  • Cut-through forwarding
  • Lossless fabric support (PFC + ECN)
  • Hardware-based Adaptive Routing
  • In-network SHARP AllReduce
  • INT (In-band Network Telemetry)
🧩

BlueField-3 SuperNIC

Not just a NIC — a SmartNIC with Arm cores optimized for RoCEv2 offload, GPUDirect RDMA, and ASAP² traffic acceleration.

  • 400GbE connectivity (2 × 200GbE)
  • Hardware RoCEv2 transport offload
  • GPUDirect RDMA support
  • ASAP² datapath acceleration
  • ConnectX-7 network controller
🛠️

NVIDIA AI Networking Software

MLNX-OS switch firmware, UFM (Unified Fabric Manager), NCCL plugins, SHARP daemons, and monitoring via NVIDIA DCGM and NetQ.

  • UFM: topology + monitoring
  • NCCL plugin for RoCEv2
  • SHARP daemon on switches
  • DCGM for GPU-network co-monitoring
  • NetQ for fabric telemetry

The Spectrum-X Value Proposition

What Spectrum-X Adds to Ethernet

Lossless fabric: PFC + ECN eliminate the retransmissions that cripple RoCEv2 performance on standard Ethernet.
Hardware Adaptive Routing: Per-packet load balancing at line rate — no external SM required.
SHARP AllReduce: In-network gradient aggregation offloads the most communication-heavy GPU training operation.
GPUDirect RDMA: GPU memory to GPU memory across the fabric with zero CPU involvement.
Ethernet ecosystem: Standard SNMP/gNMI management, existing tooling, no proprietary SM.

When to Choose Spectrum-X

✅ Brownfield data centers with existing Ethernet infrastructure
✅ Multi-tenant environments needing standard network management
✅ Scale-out AI training clusters (ResNet, BERT, LLM fine-tuning)
✅ Hybrid HPC/AI workloads on shared fabric
✅ Organizations prioritizing Ethernet tooling familiarity
⚠️ For absolute minimum latency (MPI-heavy HPC), InfiniBand still leads
⚠️ SHARP requires Spectrum switches throughout the spine-leaf path

Spectrum-X in a Scale-Out AI Cluster

Spine Layer — Spectrum-4 Spine Switches
Spine-1
51.2Tbps
Spine-2
51.2Tbps
Spine-3
51.2Tbps
Spine-4
51.2Tbps
Leaf Layer — Spectrum-4 Leaf Switches (Rail-Optimized)
Rail 0
GPU-0
Rail 1
GPU-1
Rail 2
GPU-2
Rail 3
GPU-3
Rail 4
GPU-4
Rail 5
GPU-5
Rail 6
GPU-6
Rail 7
GPU-7
Server Layer — DGX / HGX Nodes (BlueField-3 SuperNIC)
Node A
8× H100
Node B
8× H100
Node C
8× H100
Node D
8× H100

Each GPU NIC connects to its dedicated rail leaf switch — full bisection bandwidth, no oversubscription at the leaf layer


RoCEv2 & Lossless Ethernet
How RDMA operates over UDP/IP, and the three-layer mechanism that keeps the fabric lossless.

🔷 RoCEv2 Protocol Stack vs Standard TCP

Application / NCCL
NCCL AllReduce / MPI — GPU training communication primitives. Same API regardless of transport.
same for both
RDMA Verbs (libibverbs)
Queue Pairs (QPs), Work Requests (WRs), Completion Queues (CQs) — zero-copy, kernel-bypass I/O
RoCEv2 only
IB Transport (SW)
Reliable Connected (RC), Unreliable Datagram (UD) — same QP model as native InfiniBand
RoCEv2 only
UDP/IP
RoCEv2 uses UDP port 4791 — routable across L3 boundaries, no InfiniBand subnet manager needed
RoCEv2 key diff
Ethernet + PFC
Priority Flow Control (PFC) pause frames on Priority 3 create per-priority lossless queues at L2
hardware-enforced
⚠️

RoCEv2 is Fragile Without a Lossless Fabric

Unlike TCP which handles loss with retransmit timers, RoCEv2 Reliable Connected mode expects an ordered, lossless path. A single dropped packet triggers a full QP retry that stalls the entire transfer. PFC + ECN are mandatory, not optional.

The Lossless Ethernet Triad

🔒 PFC — Priority Flow Control

IEEE 802.1Qbb. Sends per-priority PAUSE frames upstream when a switch buffer fills, stopping traffic before drops occur.

Priority 3 → RoCEv2 / RDMA traffic
Priority 6 → Network control
Priority 0 → General (lossy) traffic
⚠️ Risk: Head-of-line blocking if poorly configured. Deadlocks possible in non-tree topologies.

📉 ECN — Explicit Congestion Notification

IP-level congestion signaling. Switches mark packets with CE (Congestion Experienced) bits before buffers overflow.

ECN bits 10 → ECT(0) — ECN capable
ECN bits 11 → CE — Congestion Experienced
Threshold: ~20-80KB buffer for mark
✅ No head-of-line blocking; operates alongside PFC for defence-in-depth.

DCQCN — Data Center Quantized Congestion Notification

The rate-control algorithm that ties ECN marks to sender-side rate reduction. Combines ideas from QCN (802.1Qau) and DCTCP.

DCQCN Congestion Response Flow

Buffer fills
at switch
Switch marks
CE bit (ECN)
Receiver NIC
detects CE bit
Receiver sends
CNP packet
Sender reduces
TX rate (α)
Sender probes
for recovery

CNP — Congestion Notification Packet

Sent by the receiver NIC to the sender when an ECN-marked (CE-bit) packet is received. The sender uses CNPs to drive rate reduction via the α (alpha) parameter.

Alpha (α) Parameter

Controls rate reduction aggressiveness. On CNP reception: α = (1-g)×α + g. New rate = old rate × (1 − α/2). Higher α = more aggressive reduction.

Rate Recovery

When no CNPs received, sender probes upward: first with byte counter (fast), then with timer (slow). Prevents over-aggressive back-off.

RoCEv2 Queue Pair Model

Queue Pair Types

QP TypeDescription
RCReliable Connected — ordered, no loss. Used for NCCL point-to-point. Most common.
UDUnreliable Datagram — connectionless. Used for multicast and control traffic.
UCUnreliable Connected — connection-oriented but no reliability. Rarely used.
XRCExtended RC — shared QP for many-to-one patterns. Reduces QP explosion in large clusters.

Key RDMA Operations

OperationPattern
RDMA WRITESender pushes data to remote memory — remote CPU uninvolved. Used in NCCL.
RDMA READSender pulls data from remote memory. Initiator drives. Lower GPU utilization impact.
SEND/RECVBoth sides participate. Required for connection setup and control.
AtomicFetch-and-add, compare-and-swap on remote memory. Used for distributed locks.
🔑

GID — Global Identifier for RoCEv2

RoCEv2 uses GIDs instead of InfiniBand LIDs for addressing. A GID is a 128-bit identifier derived from the interface's IPv6 address (using EUI-64 or a mapped IPv4 address). When configuring NCCL, NCCL_IB_GID_INDEX selects which GID to use — critical for RoCEv2 (GID index 1+) vs RoCEv1 (GID index 0).


Spectrum-X Architecture Deep Dive
Spectrum-4 switch internals, BlueField-3 SuperNIC, topology options, and SHARP in-network compute.

🔀 Spectrum-4 Switch ASIC

Spectrum-4 Key Specifications
Aggregate BW51.2 Tbps
Port configurations128 × 400GbE or 64 × 800GbE
Forwarding modeCut-through (ultra-low latency)
Buffer architectureShared packet buffer with VOQ
Adaptive RoutingHardware per-packet, no SM
SHARP supportYes — in-network AllReduce
TelemetryINT, gRPC streaming, SNMP
PFC priorities8 traffic classes (IEEE 802.1Qbb)
ECN markingPer-queue threshold, WRED
PredecessorSpectrum-3 (25.6 Tbps)

Forwarding Pipeline

1. Packet ingress — parse L2/L3/L4 headers
2. Lookup — FDB / LPM / ACL in TCAM
3. Adaptive Routing — select egress port
4. PFC / ECN marking check
5. Egress scheduling — WRR / DWRR

🧩 BlueField-3 SuperNIC

SuperNIC vs DPU Mode

BlueField-3 can operate in two modes:

SuperNIC mode: Arm cores run vendor firmware only. Full NIC performance (400GbE RoCEv2) exposed to host. Optimized for Spectrum-X AI clusters — pure network offload, no packet processing overhead.

DPU mode: Arm cores run a full Linux OS. Host sees a controlled NIC. Ideal for security/storage offload. Used in Morpheus, DOCA applications.

For NCP-AIN: SuperNIC mode = AI networking. DPU mode = SmartNIC/security (Topic 4).
BlueField-3 SuperNIC Specs
Network speed400GbE (2 × 200GbE ports)
RDMA protocolRoCEv2 hardware offload
GPUDirectRDMA + Storage (GDS)
Arm cores16× Armv8.2+ A78
PCIeGen5 × 16
ASAP²Accelerated Switch and Packet Processing
ControllerConnectX-7
PredecessorBlueField-2 (100GbE)

🏗️ AI Cluster Topology Options

Rail-Optimized (Preferred for AI)

Each GPU NIC is assigned to a dedicated "rail" leaf switch. All intra-rail GPU communications stay at the leaf layer. Inter-node traffic uses spine. No oversubscription at leaf layer.

Formula: N rails = N NICs per GPU server
Example: 8-GPU node → 8 NICs → 8 rails
Benefit: AllReduce traffic pattern (ring/all-to-all) maps naturally — each GPU ring uses one rail
SHARP: Works optimally on rail topology

Fat-Tree (Flexible Scale-Out)

Classic 3-tier (edge-aggregation-core) or 2-tier Clos. Provides full bisection bandwidth. More flexible port utilization but requires ECMP/AR tuning for AI traffic.

Oversubscription: Can accept 2:1 or 4:1 for cost savings
Adaptive Routing: Critical to balance elephant flows
ECMP hash: 5-tuple by default; may need flowlet switching
SHARP: Works but requires tree-topology SHARP trees
🔗

Adaptive Routing on Spectrum-X vs InfiniBand

InfiniBand AR is enabled by the Subnet Manager per-port. Spectrum-X AR is fully hardware-driven inside the Spectrum-4 ASIC — no SM equivalent needed. The switch monitors per-port congestion in real time and reroutes packets to less-loaded paths. Works at per-packet granularity.

⚡ SHARP — Scalable Hierarchical Aggregation and Reduction Protocol

How SHARP Works on Spectrum-X

Instead of GPU → CPU → network → CPU → GPU for AllReduce, SHARP delegates the aggregation to the switch ASIC itself:

GPUs send partial gradients to leaf switch
Spectrum-4 aggregates (SUM/MAX/MIN) in-switch
Passes aggregated result up to spine (tree)
Final result broadcast back to all GPUs

SHARP Benefits & Requirements

Reduces network traffic by up to N× (where N = tree depth)
Offloads GPU compute for gradient reduction
Lower latency for AllReduce vs ring-allreduce
NCCL integration via NCCL SHARP plugin

⚠️ Requires Spectrum switches throughout the path (leaf + spine)
⚠️ Requires SHARP daemon (sharpd) running on all switches
⚠️ Requires SHARP-capable firmware on NICs
⚠️ Works on Reliable Connected QPs only

Architecture Flash Cards — Click to Flip

🔀

What is the aggregate bandwidth of Spectrum-4?

51.2 Tbps

128 × 400GbE or 64 × 800GbE port configurations. Cut-through forwarding for minimum latency.

🧩

What makes BlueField-3 a "SuperNIC" vs a standard NIC?

Arm cores + ConnectX-7

Embedded Arm compute allows offloading security, storage, and network functions. Standard NICs are just forwarding devices.

🛤️

Why is rail-optimized topology preferred for AI training?

GPU-to-leaf 1:1 mapping

Each GPU NIC connects to a dedicated leaf rail. AllReduce ring traffic stays within a single rail — no spine traversal for intra-step gradients.

Where does SHARP perform AllReduce aggregation?

Inside the switch ASIC

Spectrum-4 performs SUM/MAX/MIN on gradient data as it flows through the switch — no CPU or GPU involved in the aggregation step.


Tuning & Configuration
QoS/DSCP mapping, GPUDirect RDMA setup, adaptive routing config, and the essential command toolkit.

QoS & DSCP Configuration for RoCEv2

Correct DSCP-to-PFC-priority mapping is the single most common RoCEv2 misconfiguration. Every hop must agree.

Traffic Class DSCP Value PFC Priority Lossless? Notes
RoCEv2 / RDMA 26 (AF31) Priority 3 ✅ Yes NCCL default; must match on NIC, switch, and spine
Storage (NVMe-oF) 24 (CS3) Priority 3 ✅ Yes Can share priority 3 with RoCEv2 in mixed clusters
Network Control 48 (CS6) Priority 6 ⚠️ Partial BGP, LACP, STP — must not be blocked by PFC
General (lossy) 0 (BE) Priority 0 ❌ No Management, SSH, HTTP — unaffected by RoCEv2 PFC
Cluster Management 16 (CS2) Priority 2 ❌ No Kubernetes control plane, health checks
⚠️

Trust Mode Matters

Set trust DSCP on NIC-facing switch ports (hosts mark their own DSCP). Set trust PCP on trunk/uplink ports between switches. Mismatching trust mode causes DSCP re-marking and PFC on wrong priority.

🎯 GPUDirect RDMA Setup

How GPUDirect RDMA Works

Allows the NIC's DMA engine to directly access GPU frame buffer memory — skipping the CPU and system DRAM entirely.

GPU memory allocated in BAR1 window
NIC DMA engine maps GPU BAR1 via PCIe P2P
RDMA WRITE goes GPU mem → NIC → fabric → remote NIC
Remote NIC DMA writes directly into remote GPU mem

GPUDirect Prerequisites

GPU & NIC on same PCIe root complex (or NVSwitch P2P)
nvidia-peermem kernel module loaded
IOMMU disabled or configured for P2P DMA
BAR1 size ≥ GPU VRAM (large BAR / resizeable BAR)
Pinned GPU memory via cudaMalloc / cuMemAlloc
⚠️ Hyperthreading and C-state disable recommended
⚠️ NUMA affinity: GPU, NIC, and CPU must be same NUMA node

Key Tuning Parameters

📦

MTU & Frame Size

  • MTU=9000 (jumbo frames) — mandatory for RoCEv2 performance
  • Must match on all hops: NIC, leaf, spine, remote NIC
  • Mismatch → fragmentation or drops
  • Verify: ip link show, ethtool -k
🔄

Queue Depths

  • tx/rx queue depth=1024 per QP for large transfers
  • CQ (Completion Queue) size ≥ QP depth
  • CQ moderation: coalesce completions (reduce interrupts)
  • mlnx_qos -i <dev> to inspect
😴

CPU Power States

  • Disable C-states: cpupower idle-set -D 0
  • C-state wakeup latency adds jitter to RDMA completions
  • Set CPU governor: performance
  • Disable irqbalance; pin IRQs to specific CPUs
📡

NCCL Environment

  • NCCL_IB_GID_INDEX=3 — force RoCEv2 GID
  • NCCL_SOCKET_IFNAME — select correct interface
  • NCCL_IB_DISABLE=0 — enable IB/RoCEv2 transport
  • SHARP_COLL_ENABLE_SAT=1 — enable SHARP AllReduce
📊

ECN Thresholds

  • ECN mark start: ~20KB buffer depth
  • ECN mark probability ramp to 100% at ~80KB
  • DCQCN α start: 1/128 (low aggressiveness)
  • CNP rate: max 1 CNP per 64µs per QP

PFC Configuration

  • Enable PFC on priority 3 only (RoCEv2)
  • PFC pause threshold: headroom = RTT × BW
  • Typical headroom: 80-100KB per port at 400GbE
  • Watch: ethtool -S for pause frame counters

Essential Spectrum-X / RoCEv2 Commands

# QoS inspection (NIC side)
mlnx_qos -i mlx5_0               # Show PFC and priority settings
mlnx_qos -i mlx5_0 -p 3 --pfc 1  # Enable PFC on priority 3
ethtool -k mlx5_0                 # Check offload settings (GRO, LRO, etc.)
ethtool -S mlx5_0 | grep pfc     # PFC pause frame counters

# RDMA statistics
rdma stat show                    # All RDMA port counters
rdma stat show -a                  # Extended auto-mode counters
perfquery -x mlx5_0 1            # Extended port counters (InfiniBand compat)

# RDMA performance testing
ib_write_bw -d mlx5_0 -D 30 --report_gbits     # RDMA write bandwidth (server)
ib_write_bw -d mlx5_0 <server_ip> --report_gbits # Client side
ib_read_lat -d mlx5_0 <server_ip>               # Read latency test

# DCQCN / CNP counters
ethtool -S mlx5_0 | grep cnp     # CNP sent/received (congestion signals)
ethtool -S mlx5_0 | grep ecn     # ECN-marked packet counters

# RoCEv2 GID inspection
show_gids                         # Show all GID table entries
ibv_devinfo -d mlx5_0 -v         # Device info + GIDs verbose

# Switch-side (MLNX-OS)
show qos interface ethernet 1/1   # Interface QoS counters
show roce cnp-rx                  # CNP receive stats on switch
show interfaces ethernet 1/1 counters # Port counters
🧪

Baseline Validation Checklist

Before declaring a Spectrum-X cluster production-ready: (1) ib_write_bw ≥ 380 Gbps per 400GbE link; (2) CNP counter stays at 0 under normal load (non-zero = congestion, check thresholds); (3) PFC pause counters increment only under extreme load — frequent pausing means ECN thresholds are set too high; (4) MTU consistency verified with ping -M do -s 8972 across all fabric hops.


Spectrum-X vs InfiniBand
A side-by-side look at every dimension that matters for an AI networking decision.
Feature 🔵 Spectrum-X (RoCEv2) 🟣 InfiniBand (NDR/HDR)
Transport Protocol UDP/IP (L3-routable) — works across IP subnets Native IB transport — subnet-scoped, needs SM
Max Link Speed 800GbE (Spectrum-4 ready) NDR: 400 Gb/s per port
Latency (port-to-port) ~300–500 ns (cut-through) ~100–300 ns (native IB)
Lossless Fabric Requires PFC + ECN + DCQCN tuning Built-in — IB transport is inherently lossless
Congestion Control DCQCN (ECN + rate adaptation) Hardware FECN → BECN → rate reduction in SM
Adaptive Routing Hardware per-packet (Spectrum-4 ASIC, no SM) SM-computed per-QP (configurable AR in Quantum-2)
In-Network Compute SHARP on Spectrum-4 switches SHARP on Quantum-2 switches
Network Management Standard Ethernet (SNMP, gNMI, BGP, LLDP) OpenSM, UFM (NVIDIA-proprietary)
Subnet Manager Not required — L3 IP routing Required — OpenSM or UFM
Routing Algorithm ECMP, Adaptive Routing, Flowlet ftree, MINHOP, DFSSSP, Adaptive Routing
Addressing IPv4/IPv6 + GID (derived from IP) LID (local) + GID (global), assigned by SM
Ecosystem Broad — any Ethernet vendor for uplinks, standard tooling NVIDIA (Mellanox) proprietary ecosystem
Operational Complexity Lower — familiar Ethernet operations Higher — SM config, LID space, IB partitions
GPUDirect RDMA Yes — RoCEv2 GPUDirect Yes — native IB GPUDirect
Best for Brownfield Ethernet Multi-tenant Scale-out LLM MPI HPC Min-latency Tightly-coupled

RoCEv2 vs Native IB RDMA — Protocol Differences

Aspect RoCEv2 Native InfiniBand
L3 protocolUDP/IP (port 4791)IB transport (no IP)
AddressingIP address + GIDLID (local) + GID (global)
Routing scopeInter-subnet (L3 routable)Intra-subnet only (LID scope)
Congestion responseDCQCN (software α)Hardware FECN/BECN (SM driven)
Lossless requirementMust configure PFC + ECNBuilt-in at protocol level
MTU typical9000 bytes (jumbo)4096 bytes (IB standard)
Verbs APISame libibverbs / ibv_* callsSame libibverbs / ibv_* calls
NCCL supportNCCL IB plugin (same code path)NCCL IB plugin (same code path)

✅ Choose Spectrum-X When…

• Existing Ethernet fabric that can be reused or extended
• Operations team knows Ethernet — not IB-certified
• Multi-tenant cluster: Ethernet VLANs/VXLANs for isolation
• Need L3-routable RDMA across racks/pods
• Scale-out LLM training (transformer models, fine-tuning)
• SHARP AllReduce available — Spectrum-4 throughout path
• Target: DGX H100/GB200 with Spectrum-X networking

✅ Choose InfiniBand When…

• Absolute minimum latency is critical (tightly-coupled MPI)
• Greenfield HPC cluster — no Ethernet legacy
• Team has IB expertise and UFM familiarity
• Existing Quantum-2 fabric with AR and SHARP in place
• Sub-microsecond jitter required for financial/HPC workloads
• Need IB partitioning (P_Key) for strong isolation
• Deep RDMA semantics: XRC, SRQ, multicast — mature IB features
📊

Exam Framing: The Spectrum-X Claim

NVIDIA positions Spectrum-X as achieving "IB-class performance on Ethernet" for AI training workloads. This is achieved through: hardware adaptive routing (no SM jitter), SHARP AllReduce (in-network aggregation), and lossless fabric (PFC + DCQCN). The claim is validated for NCCL-based AllReduce at scale — but native IB still wins on raw latency for MPI-heavy workloads.


Practice Quiz
10 exam-style questions on Spectrum-X, RoCEv2, lossless Ethernet, and GPUDirect RDMA.


Memory Hooks & Ethernet Advisor
Mnemonics for the exam, plus a guided advisor to troubleshoot your Spectrum-X setup.
P-E-D
Lossless Ethernet Triad

The three mechanisms that make RoCEv2 work:

PFC — Priority Flow Control (pause frames) ECN — Explicit Congestion Notification (CE bit) DCQCN — rate control algorithm
CNP → α → Rate ↓
DCQCN Response Chain

When congestion happens: CE bit marked → receiver sends CNP → sender increases α → sender reduces TX rate. No CNP = rate recovers.

SHARP = Switch Math
In-Network Compute

SHARP does the AllReduce inside the Spectrum-4 switch ASIC — not in the GPU, not in the CPU. Think: the switch is doing your gradient addition for you.

Rail = One NIC, One Leaf
Rail-Optimized Topology

Each GPU NIC connects to its own dedicated leaf switch rail. 8 NICs per node → 8 rails. AllReduce ring stays within a single rail — never touches the spine.

GID = RoCEv2 Address
Addressing in RoCEv2

In IB, addresses are LIDs (SM-assigned). In RoCEv2, addresses are GIDs derived from IP/IPv6. GID index 0 = RoCEv1; index 1+ = RoCEv2. Always set NCCL_IB_GID_INDEX=3.

DSCP 26 = Priority 3
QoS Mapping

DSCP 26 (AF31) → PFC Priority 3 → Lossless queue. This mapping must be consistent on every NIC, every leaf, and every spine. One mismatch = lossy RDMA.

GPUDirect = No CPU
GPU-to-GPU RDMA

GPUDirect RDMA: the NIC DMA engine reads/writes GPU memory directly. CPU is completely out of the data path. Requires: nvidia-peermem, BAR1, same NUMA node.

Spectrum-4: 51.2T
Switch Bandwidth

Spectrum-4 = 51.2 Tbps aggregate. 128 × 400GbE or 64 × 800GbE ports. Its predecessor Spectrum-3 was 25.6 Tbps — exactly half. Easy doubling pattern to remember.

🤖 Ethernet Advisor

Choose your situation to get targeted guidance.

What are you working on?

🚀 RDMA Performance Troubleshooting

  • Run ib_write_bw between two hosts — target ≥380 Gbps per 400GbE link. If well below, check MTU (must be 9000 jumbo on all hops).
  • Check ethtool -S mlx5_0 | grep cnp — non-zero CNP counters mean active congestion. Review ECN/DCQCN thresholds.
  • Check PFC priority: mlnx_qos -i mlx5_0 — PFC must be enabled on priority 3 for RoCEv2 traffic.
  • Verify DSCP trust mode on switch ports — NIC-facing ports should be trust DSCP.
  • Ensure nvidia-peermem module is loaded for GPUDirect: lsmod | grep nvidia_peermem.
  • Disable C-states: cpupower idle-set -D 0 and set governor to performance.
  • Verify NCCL GID: NCCL_IB_GID_INDEX should be 3 (RoCEv2), not 0 (RoCEv1/IB).

🏗️ Spectrum-X Topology Design Guidance

  • For AI training: use rail-optimized topology — 1 NIC per GPU maps to 1 dedicated leaf switch rail. 8-GPU node = 8 rails per node.
  • Each leaf rail connects to all spine switches with equal-weight uplinks — full bisection bandwidth at the leaf layer.
  • Size leaf-to-spine uplinks at ≥ (N_servers × NIC_speed) / N_spines with headroom for SHARP and control traffic.
  • Deploy Spectrum-4 throughout leaf and spine if SHARP AllReduce is required — SHARP needs Spectrum switches in the entire path.
  • Enable Adaptive Routing on all Spectrum-4 switches — provides per-packet load balancing without an SM.
  • For multi-tenant: use VLANs or VXLANs on the Ethernet fabric for traffic isolation between tenants.
  • Plan switch management network separately (out-of-band) — do not mix management and RoCEv2 traffic on same priority.

⚙️ QoS & Lossless Ethernet Configuration

  • Step 1: Configure DSCP marking on NIC hosts — RoCEv2 traffic should be marked DSCP 26 (AF31).
  • Step 2: Set trust DSCP on all switch ports facing servers; set trust PCP on trunk/uplink ports.
  • Step 3: Map DSCP 26 → PFC priority 3 on both switch and NIC: mlnx_qos -i mlx5_0 -p 3 --pfc 1.
  • Step 4: Enable ECN marking on the switch at ~20KB buffer threshold for priority 3 queue.
  • Step 5: Set MTU=9000 (jumbo) on all server NICs and switch ports: ip link set mlx5_0 mtu 9000.
  • Step 6: Validate end-to-end with ping -M do -s 8972 <remote_ip> — should not fragment or drop.
  • Step 7: Monitor ongoing with ethtool -S mlx5_0 | grep -E 'pfc|cnp|ecn' — healthy = low/zero CNP counts.

⚖️ Spectrum-X vs InfiniBand Decision Guide

  • Existing Ethernet infra? → Spectrum-X. Reuse switches, same tooling, same team skills.
  • Need L3-routable RDMA across subnets? → Spectrum-X (RoCEv2 is UDP/IP; IB LIDs are subnet-scoped).
  • Multi-tenant isolation required? → Spectrum-X (VLAN/VXLAN) or IB with P_Key partitioning — IB has stronger isolation primitives.
  • Absolute minimum latency for MPI-heavy HPC? → InfiniBand (sub-µs native IB vs 300-500ns RoCEv2).
  • SHARP AllReduce needed? → Both Quantum-2 (IB) and Spectrum-4 (Ethernet) support SHARP.
  • Greenfield LLM training at scale (≥1000 GPUs)? → Spectrum-X with rail-optimized topology is the NVIDIA recommended path for scale-out Ethernet AI.
  • Team expertise? → IB requires OpenSM/UFM knowledge. Spectrum-X uses standard Ethernet tooling anyone with networking background can operate.

Keep Building Your NCP-AIN Mastery

Topic 3 complete — continue with BlueField DPUs & DOCA next.

Start Free on FlashGenius