Spectrum-X & Ethernet for AI — NVIDIA NCP-AIN Study Guide

Spectrum-X: Ethernet Reinvented for AI

NVIDIA's answer to "can Ethernet compete with InfiniBand for AI workloads?" — and the answer is yes, with the right stack.

💡

Why a New Ethernet Platform?

Standard Ethernet was designed for bursty web traffic, not the sustained, latency-sensitive all-to-all communication of GPU training. Spectrum-X adds hardware-level adaptive routing, lossless fabric, SHARP in-network compute, and ultra-low latency to deliver InfiniBand-class performance over familiar Ethernet infrastructure.

Platform Components

Three pillars work together to make Spectrum-X perform.

🔀

Spectrum-4 Switch ASIC

The heart of the fabric — 51.2 Tbps, 128 × 400GbE or 64 × 800GbE ports, hardware adaptive routing, SHARP compute, and built-in telemetry.

Cut-through forwarding
Lossless fabric support (PFC + ECN)
Hardware-based Adaptive Routing
In-network SHARP AllReduce
INT (In-band Network Telemetry)

🧩

BlueField-3 SuperNIC

Not just a NIC — a SmartNIC with Arm cores optimized for RoCEv2 offload, GPUDirect RDMA, and ASAP² traffic acceleration.

400GbE connectivity (2 × 200GbE)
Hardware RoCEv2 transport offload
GPUDirect RDMA support
ASAP² datapath acceleration
ConnectX-7 network controller

🛠️

NVIDIA AI Networking Software

MLNX-OS switch firmware, UFM (Unified Fabric Manager), NCCL plugins, SHARP daemons, and monitoring via NVIDIA DCGM and NetQ.

UFM: topology + monitoring
NCCL plugin for RoCEv2
SHARP daemon on switches
DCGM for GPU-network co-monitoring
NetQ for fabric telemetry

The Spectrum-X Value Proposition

What Spectrum-X Adds to Ethernet

Lossless fabric: PFC + ECN eliminate the retransmissions that cripple RoCEv2 performance on standard Ethernet.
Hardware Adaptive Routing: Per-packet load balancing at line rate — no external SM required.
SHARP AllReduce: In-network gradient aggregation offloads the most communication-heavy GPU training operation.
GPUDirect RDMA: GPU memory to GPU memory across the fabric with zero CPU involvement.
Ethernet ecosystem: Standard SNMP/gNMI management, existing tooling, no proprietary SM.

When to Choose Spectrum-X

✅ Brownfield data centers with existing Ethernet infrastructure
✅ Multi-tenant environments needing standard network management
✅ Scale-out AI training clusters (ResNet, BERT, LLM fine-tuning)
✅ Hybrid HPC/AI workloads on shared fabric
✅ Organizations prioritizing Ethernet tooling familiarity
⚠️ For absolute minimum latency (MPI-heavy HPC), InfiniBand still leads
⚠️ SHARP requires Spectrum switches throughout the spine-leaf path

Spectrum-X in a Scale-Out AI Cluster

Spine Layer — Spectrum-4 Spine Switches

Spine-1
51.2Tbps

Spine-2
51.2Tbps

Spine-3
51.2Tbps

Spine-4
51.2Tbps

Leaf Layer — Spectrum-4 Leaf Switches (Rail-Optimized)

Rail 0
GPU-0

Rail 1
GPU-1

Rail 2
GPU-2

Rail 3
GPU-3

Rail 4
GPU-4

Rail 5
GPU-5

Rail 6
GPU-6

Rail 7
GPU-7

Server Layer — DGX / HGX Nodes (BlueField-3 SuperNIC)

Node A
8× H100

Node B
8× H100

Node C
8× H100

Node D
8× H100

Each GPU NIC connects to its dedicated rail leaf switch — full bisection bandwidth, no oversubscription at the leaf layer

RoCEv2 & Lossless Ethernet

How RDMA operates over UDP/IP, and the three-layer mechanism that keeps the fabric lossless.

🔷 RoCEv2 Protocol Stack vs Standard TCP

Application / NCCL

NCCL AllReduce / MPI — GPU training communication primitives. Same API regardless of transport.

same for both

RDMA Verbs (libibverbs)

Queue Pairs (QPs), Work Requests (WRs), Completion Queues (CQs) — zero-copy, kernel-bypass I/O

RoCEv2 only

IB Transport (SW)

Reliable Connected (RC), Unreliable Datagram (UD) — same QP model as native InfiniBand

RoCEv2 only

UDP/IP

RoCEv2 uses UDP port 4791 — routable across L3 boundaries, no InfiniBand subnet manager needed

RoCEv2 key diff

Ethernet + PFC

Priority Flow Control (PFC) pause frames on Priority 3 create per-priority lossless queues at L2

hardware-enforced

⚠️

RoCEv2 is Fragile Without a Lossless Fabric

Unlike TCP which handles loss with retransmit timers, RoCEv2 Reliable Connected mode expects an ordered, lossless path. A single dropped packet triggers a full QP retry that stalls the entire transfer. PFC + ECN are mandatory, not optional.

The Lossless Ethernet Triad

🔒 PFC — Priority Flow Control

IEEE 802.1Qbb. Sends per-priority PAUSE frames upstream when a switch buffer fills, stopping traffic before drops occur.

Priority 3 → RoCEv2 / RDMA traffic

Priority 6 → Network control

Priority 0 → General (lossy) traffic

⚠️ Risk: Head-of-line blocking if poorly configured. Deadlocks possible in non-tree topologies.

📉 ECN — Explicit Congestion Notification

IP-level congestion signaling. Switches mark packets with CE (Congestion Experienced) bits before buffers overflow.

ECN bits 10 → ECT(0) — ECN capable

ECN bits 11 → CE — Congestion Experienced

Threshold: ~20-80KB buffer for mark

✅ No head-of-line blocking; operates alongside PFC for defence-in-depth.

DCQCN — Data Center Quantized Congestion Notification

The rate-control algorithm that ties ECN marks to sender-side rate reduction. Combines ideas from QCN (802.1Qau) and DCTCP.

DCQCN Congestion Response Flow

Buffer fills
at switch

→

Switch marks
CE bit (ECN)

→

Receiver NIC
detects CE bit

→

Receiver sends
CNP packet

→

Sender reduces
TX rate (α)

→

Sender probes
for recovery

CNP — Congestion Notification Packet

Sent by the receiver NIC to the sender when an ECN-marked (CE-bit) packet is received. The sender uses CNPs to drive rate reduction via the α (alpha) parameter.

Alpha (α) Parameter

Controls rate reduction aggressiveness. On CNP reception: α = (1-g)×α + g. New rate = old rate × (1 − α/2). Higher α = more aggressive reduction.

Rate Recovery

When no CNPs received, sender probes upward: first with byte counter (fast), then with timer (slow). Prevents over-aggressive back-off.

RoCEv2 Queue Pair Model

Queue Pair Types

QP Type	Description
RC	Reliable Connected — ordered, no loss. Used for NCCL point-to-point. Most common.
UD	Unreliable Datagram — connectionless. Used for multicast and control traffic.
UC	Unreliable Connected — connection-oriented but no reliability. Rarely used.
XRC	Extended RC — shared QP for many-to-one patterns. Reduces QP explosion in large clusters.

Key RDMA Operations

Operation	Pattern
RDMA WRITE	Sender pushes data to remote memory — remote CPU uninvolved. Used in NCCL.
RDMA READ	Sender pulls data from remote memory. Initiator drives. Lower GPU utilization impact.
SEND/RECV	Both sides participate. Required for connection setup and control.
Atomic	Fetch-and-add, compare-and-swap on remote memory. Used for distributed locks.

🔑

GID — Global Identifier for RoCEv2

RoCEv2 uses GIDs instead of InfiniBand LIDs for addressing. A GID is a 128-bit identifier derived from the interface's IPv6 address (using EUI-64 or a mapped IPv4 address). When configuring NCCL, NCCL_IB_GID_INDEX selects which GID to use — critical for RoCEv2 (GID index 1+) vs RoCEv1 (GID index 0).

Spectrum-X Architecture Deep Dive

Spectrum-4 switch internals, BlueField-3 SuperNIC, topology options, and SHARP in-network compute.

🔀 Spectrum-4 Switch ASIC

Spectrum-4 Key Specifications
Aggregate BW	51.2 Tbps
Port configurations	128 × 400GbE or 64 × 800GbE
Forwarding mode	Cut-through (ultra-low latency)
Buffer architecture	Shared packet buffer with VOQ
Adaptive Routing	Hardware per-packet, no SM
SHARP support	Yes — in-network AllReduce
Telemetry	INT, gRPC streaming, SNMP
PFC priorities	8 traffic classes (IEEE 802.1Qbb)
ECN marking	Per-queue threshold, WRED
Predecessor	Spectrum-3 (25.6 Tbps)

Forwarding Pipeline

1. Packet ingress — parse L2/L3/L4 headers

↓

2. Lookup — FDB / LPM / ACL in TCAM

↓

3. Adaptive Routing — select egress port

↓

4. PFC / ECN marking check

↓

5. Egress scheduling — WRR / DWRR

🧩 BlueField-3 SuperNIC

SuperNIC vs DPU Mode

BlueField-3 can operate in two modes:

SuperNIC mode: Arm cores run vendor firmware only. Full NIC performance (400GbE RoCEv2) exposed to host. Optimized for Spectrum-X AI clusters — pure network offload, no packet processing overhead.

DPU mode: Arm cores run a full Linux OS. Host sees a controlled NIC. Ideal for security/storage offload. Used in Morpheus, DOCA applications.

For NCP-AIN: SuperNIC mode = AI networking. DPU mode = SmartNIC/security (Topic 4).

BlueField-3 SuperNIC Specs
Network speed	400GbE (2 × 200GbE ports)
RDMA protocol	RoCEv2 hardware offload
GPUDirect	RDMA + Storage (GDS)
Arm cores	16× Armv8.2+ A78
PCIe	Gen5 × 16
ASAP²	Accelerated Switch and Packet Processing
Controller	ConnectX-7
Predecessor	BlueField-2 (100GbE)

🏗️ AI Cluster Topology Options

Rail-Optimized (Preferred for AI)

Each GPU NIC is assigned to a dedicated "rail" leaf switch. All intra-rail GPU communications stay at the leaf layer. Inter-node traffic uses spine. No oversubscription at leaf layer.

Formula: N rails = N NICs per GPU server
Example: 8-GPU node → 8 NICs → 8 rails
Benefit: AllReduce traffic pattern (ring/all-to-all) maps naturally — each GPU ring uses one rail
SHARP: Works optimally on rail topology

Fat-Tree (Flexible Scale-Out)

Classic 3-tier (edge-aggregation-core) or 2-tier Clos. Provides full bisection bandwidth. More flexible port utilization but requires ECMP/AR tuning for AI traffic.

Oversubscription: Can accept 2:1 or 4:1 for cost savings
Adaptive Routing: Critical to balance elephant flows
ECMP hash: 5-tuple by default; may need flowlet switching
SHARP: Works but requires tree-topology SHARP trees

🔗

Adaptive Routing on Spectrum-X vs InfiniBand

InfiniBand AR is enabled by the Subnet Manager per-port. Spectrum-X AR is fully hardware-driven inside the Spectrum-4 ASIC — no SM equivalent needed. The switch monitors per-port congestion in real time and reroutes packets to less-loaded paths. Works at per-packet granularity.

⚡ SHARP — Scalable Hierarchical Aggregation and Reduction Protocol

How SHARP Works on Spectrum-X

Instead of GPU → CPU → network → CPU → GPU for AllReduce, SHARP delegates the aggregation to the switch ASIC itself:

GPUs send partial gradients to leaf switch

↓

Spectrum-4 aggregates (SUM/MAX/MIN) in-switch

↓

Passes aggregated result up to spine (tree)

↓

Final result broadcast back to all GPUs

SHARP Benefits & Requirements

✅ Reduces network traffic by up to N× (where N = tree depth)
✅ Offloads GPU compute for gradient reduction
✅ Lower latency for AllReduce vs ring-allreduce
✅ NCCL integration via NCCL SHARP plugin

⚠️ Requires Spectrum switches throughout the path (leaf + spine)
⚠️ Requires SHARP daemon (sharpd) running on all switches
⚠️ Requires SHARP-capable firmware on NICs
⚠️ Works on Reliable Connected QPs only

Architecture Flash Cards — Click to Flip

🔀

What is the aggregate bandwidth of Spectrum-4?

51.2 Tbps

128 × 400GbE or 64 × 800GbE port configurations. Cut-through forwarding for minimum latency.

🧩

What makes BlueField-3 a "SuperNIC" vs a standard NIC?

Arm cores + ConnectX-7

Embedded Arm compute allows offloading security, storage, and network functions. Standard NICs are just forwarding devices.

🛤️

Why is rail-optimized topology preferred for AI training?

GPU-to-leaf 1:1 mapping

Each GPU NIC connects to a dedicated leaf rail. AllReduce ring traffic stays within a single rail — no spine traversal for intra-step gradients.

⚡

Where does SHARP perform AllReduce aggregation?

Inside the switch ASIC

Spectrum-4 performs SUM/MAX/MIN on gradient data as it flows through the switch — no CPU or GPU involved in the aggregation step.

Tuning & Configuration

QoS/DSCP mapping, GPUDirect RDMA setup, adaptive routing config, and the essential command toolkit.

QoS & DSCP Configuration for RoCEv2

Correct DSCP-to-PFC-priority mapping is the single most common RoCEv2 misconfiguration. Every hop must agree.

Traffic Class	DSCP Value	PFC Priority	Lossless?	Notes
RoCEv2 / RDMA	26 (AF31)	Priority 3	✅ Yes	NCCL default; must match on NIC, switch, and spine
Storage (NVMe-oF)	24 (CS3)	Priority 3	✅ Yes	Can share priority 3 with RoCEv2 in mixed clusters
Network Control	48 (CS6)	Priority 6	⚠️ Partial	BGP, LACP, STP — must not be blocked by PFC
General (lossy)	0 (BE)	Priority 0	❌ No	Management, SSH, HTTP — unaffected by RoCEv2 PFC
Cluster Management	16 (CS2)	Priority 2	❌ No	Kubernetes control plane, health checks

⚠️

Trust Mode Matters

Set trust DSCP on NIC-facing switch ports (hosts mark their own DSCP). Set trust PCP on trunk/uplink ports between switches. Mismatching trust mode causes DSCP re-marking and PFC on wrong priority.

🎯 GPUDirect RDMA Setup

How GPUDirect RDMA Works

Allows the NIC's DMA engine to directly access GPU frame buffer memory — skipping the CPU and system DRAM entirely.

GPU memory allocated in BAR1 window

↓

NIC DMA engine maps GPU BAR1 via PCIe P2P

↓

RDMA WRITE goes GPU mem → NIC → fabric → remote NIC

↓

Remote NIC DMA writes directly into remote GPU mem

GPUDirect Prerequisites

✅ GPU & NIC on same PCIe root complex (or NVSwitch P2P)
✅ nvidia-peermem kernel module loaded
✅ IOMMU disabled or configured for P2P DMA
✅ BAR1 size ≥ GPU VRAM (large BAR / resizeable BAR)
✅ Pinned GPU memory via cudaMalloc / cuMemAlloc
⚠️ Hyperthreading and C-state disable recommended
⚠️ NUMA affinity: GPU, NIC, and CPU must be same NUMA node

Key Tuning Parameters

📦

MTU & Frame Size

MTU=9000 (jumbo frames) — mandatory for RoCEv2 performance
Must match on all hops: NIC, leaf, spine, remote NIC
Mismatch → fragmentation or drops
Verify: ip link show, ethtool -k

🔄

Queue Depths

tx/rx queue depth=1024 per QP for large transfers
CQ (Completion Queue) size ≥ QP depth
CQ moderation: coalesce completions (reduce interrupts)
mlnx_qos -i <dev> to inspect

😴

CPU Power States

Disable C-states: cpupower idle-set -D 0
C-state wakeup latency adds jitter to RDMA completions
Set CPU governor: performance
Disable irqbalance; pin IRQs to specific CPUs

📡

NCCL Environment

NCCL_IB_GID_INDEX=3 — force RoCEv2 GID
NCCL_SOCKET_IFNAME — select correct interface
NCCL_IB_DISABLE=0 — enable IB/RoCEv2 transport
SHARP_COLL_ENABLE_SAT=1 — enable SHARP AllReduce

📊

ECN Thresholds

ECN mark start: ~20KB buffer depth
ECN mark probability ramp to 100% at ~80KB
DCQCN α start: 1/128 (low aggressiveness)
CNP rate: max 1 CNP per 64µs per QP

⚡

PFC Configuration

Enable PFC on priority 3 only (RoCEv2)
PFC pause threshold: headroom = RTT × BW
Typical headroom: 80-100KB per port at 400GbE
Watch: ethtool -S for pause frame counters

Essential Spectrum-X / RoCEv2 Commands

# QoS inspection (NIC side)
mlnx_qos -i mlx5_0               # Show PFC and priority settings
mlnx_qos -i mlx5_0 -p 3 --pfc 1  # Enable PFC on priority 3
ethtool -k mlx5_0                 # Check offload settings (GRO, LRO, etc.)
ethtool -S mlx5_0 | grep pfc     # PFC pause frame counters

# RDMA statistics
rdma stat show                    # All RDMA port counters
rdma stat show -a                  # Extended auto-mode counters
perfquery -x mlx5_0 1            # Extended port counters (InfiniBand compat)

# RDMA performance testing
ib_write_bw -d mlx5_0 -D 30 --report_gbits     # RDMA write bandwidth (server)
ib_write_bw -d mlx5_0 <server_ip> --report_gbits # Client side
ib_read_lat -d mlx5_0 <server_ip>               # Read latency test

# DCQCN / CNP counters
ethtool -S mlx5_0 | grep cnp     # CNP sent/received (congestion signals)
ethtool -S mlx5_0 | grep ecn     # ECN-marked packet counters

# RoCEv2 GID inspection
show_gids                         # Show all GID table entries
ibv_devinfo -d mlx5_0 -v         # Device info + GIDs verbose

# Switch-side (MLNX-OS)
show qos interface ethernet 1/1   # Interface QoS counters
show roce cnp-rx                  # CNP receive stats on switch
show interfaces ethernet 1/1 counters # Port counters

🧪

Baseline Validation Checklist

Before declaring a Spectrum-X cluster production-ready: (1) ib_write_bw ≥ 380 Gbps per 400GbE link; (2) CNP counter stays at 0 under normal load (non-zero = congestion, check thresholds); (3) PFC pause counters increment only under extreme load — frequent pausing means ECN thresholds are set too high; (4) MTU consistency verified with ping -M do -s 8972 across all fabric hops.

Spectrum-X vs InfiniBand

A side-by-side look at every dimension that matters for an AI networking decision.

Feature	🔵 Spectrum-X (RoCEv2)	🟣 InfiniBand (NDR/HDR)
Transport Protocol	UDP/IP (L3-routable) — works across IP subnets	Native IB transport — subnet-scoped, needs SM
Max Link Speed	800GbE (Spectrum-4 ready)	NDR: 400 Gb/s per port
Latency (port-to-port)	~300–500 ns (cut-through)	~100–300 ns (native IB)
Lossless Fabric	Requires PFC + ECN + DCQCN tuning	Built-in — IB transport is inherently lossless
Congestion Control	DCQCN (ECN + rate adaptation)	Hardware FECN → BECN → rate reduction in SM
Adaptive Routing	Hardware per-packet (Spectrum-4 ASIC, no SM)	SM-computed per-QP (configurable AR in Quantum-2)
In-Network Compute	SHARP on Spectrum-4 switches	SHARP on Quantum-2 switches
Network Management	Standard Ethernet (SNMP, gNMI, BGP, LLDP)	OpenSM, UFM (NVIDIA-proprietary)
Subnet Manager	Not required — L3 IP routing	Required — OpenSM or UFM
Routing Algorithm	ECMP, Adaptive Routing, Flowlet	ftree, MINHOP, DFSSSP, Adaptive Routing
Addressing	IPv4/IPv6 + GID (derived from IP)	LID (local) + GID (global), assigned by SM
Ecosystem	Broad — any Ethernet vendor for uplinks, standard tooling	NVIDIA (Mellanox) proprietary ecosystem
Operational Complexity	Lower — familiar Ethernet operations	Higher — SM config, LID space, IB partitions
GPUDirect RDMA	Yes — RoCEv2 GPUDirect	Yes — native IB GPUDirect
Best for	Brownfield Ethernet Multi-tenant Scale-out LLM	MPI HPC Min-latency Tightly-coupled

RoCEv2 vs Native IB RDMA — Protocol Differences

Aspect	RoCEv2	Native InfiniBand
L3 protocol	UDP/IP (port 4791)	IB transport (no IP)
Addressing	IP address + GID	LID (local) + GID (global)
Routing scope	Inter-subnet (L3 routable)	Intra-subnet only (LID scope)
Congestion response	DCQCN (software α)	Hardware FECN/BECN (SM driven)
Lossless requirement	Must configure PFC + ECN	Built-in at protocol level
MTU typical	9000 bytes (jumbo)	4096 bytes (IB standard)
Verbs API	Same libibverbs / ibv_* calls	Same libibverbs / ibv_* calls
NCCL support	NCCL IB plugin (same code path)	NCCL IB plugin (same code path)

✅ Choose Spectrum-X When…

• Existing Ethernet fabric that can be reused or extended
• Operations team knows Ethernet — not IB-certified
• Multi-tenant cluster: Ethernet VLANs/VXLANs for isolation
• Need L3-routable RDMA across racks/pods
• Scale-out LLM training (transformer models, fine-tuning)
• SHARP AllReduce available — Spectrum-4 throughout path
• Target: DGX H100/GB200 with Spectrum-X networking

✅ Choose InfiniBand When…

• Absolute minimum latency is critical (tightly-coupled MPI)
• Greenfield HPC cluster — no Ethernet legacy
• Team has IB expertise and UFM familiarity
• Existing Quantum-2 fabric with AR and SHARP in place
• Sub-microsecond jitter required for financial/HPC workloads
• Need IB partitioning (P_Key) for strong isolation
• Deep RDMA semantics: XRC, SRQ, multicast — mature IB features

📊

Exam Framing: The Spectrum-X Claim

NVIDIA positions Spectrum-X as achieving "IB-class performance on Ethernet" for AI training workloads. This is achieved through: hardware adaptive routing (no SM jitter), SHARP AllReduce (in-network aggregation), and lossless fabric (PFC + DCQCN). The claim is validated for NCCL-based AllReduce at scale — but native IB still wins on raw latency for MPI-heavy workloads.

Practice Quiz

10 exam-style questions on Spectrum-X, RoCEv2, lossless Ethernet, and GPUDirect RDMA.

Memory Hooks & Ethernet Advisor

Mnemonics for the exam, plus a guided advisor to troubleshoot your Spectrum-X setup.

P-E-D

Lossless Ethernet Triad

The three mechanisms that make RoCEv2 work:

PFC — Priority Flow Control (pause frames) ECN — Explicit Congestion Notification (CE bit) DCQCN — rate control algorithm

CNP → α → Rate ↓

DCQCN Response Chain

When congestion happens: CE bit marked → receiver sends CNP → sender increases α → sender reduces TX rate. No CNP = rate recovers.

SHARP = Switch Math

In-Network Compute

SHARP does the AllReduce inside the Spectrum-4 switch ASIC — not in the GPU, not in the CPU. Think: the switch is doing your gradient addition for you.

Rail = One NIC, One Leaf

Rail-Optimized Topology

Each GPU NIC connects to its own dedicated leaf switch rail. 8 NICs per node → 8 rails. AllReduce ring stays within a single rail — never touches the spine.

GID = RoCEv2 Address

Addressing in RoCEv2

In IB, addresses are LIDs (SM-assigned). In RoCEv2, addresses are GIDs derived from IP/IPv6. GID index 0 = RoCEv1; index 1+ = RoCEv2. Always set NCCL_IB_GID_INDEX=3.

DSCP 26 = Priority 3

QoS Mapping

DSCP 26 (AF31) → PFC Priority 3 → Lossless queue. This mapping must be consistent on every NIC, every leaf, and every spine. One mismatch = lossy RDMA.

GPUDirect = No CPU

GPU-to-GPU RDMA

GPUDirect RDMA: the NIC DMA engine reads/writes GPU memory directly. CPU is completely out of the data path. Requires: nvidia-peermem, BAR1, same NUMA node.

Spectrum-4: 51.2T

Switch Bandwidth

Spectrum-4 = 51.2 Tbps aggregate. 128 × 400GbE or 64 × 800GbE ports. Its predecessor Spectrum-3 was 25.6 Tbps — exactly half. Easy doubling pattern to remember.

🤖 Ethernet Advisor

Choose your situation to get targeted guidance.

What are you working on?

🚀 RDMA Performance Troubleshooting

Run ib_write_bw between two hosts — target ≥380 Gbps per 400GbE link. If well below, check MTU (must be 9000 jumbo on all hops).
Check ethtool -S mlx5_0 | grep cnp — non-zero CNP counters mean active congestion. Review ECN/DCQCN thresholds.
Check PFC priority: mlnx_qos -i mlx5_0 — PFC must be enabled on priority 3 for RoCEv2 traffic.
Verify DSCP trust mode on switch ports — NIC-facing ports should be trust DSCP.
Ensure nvidia-peermem module is loaded for GPUDirect: lsmod | grep nvidia_peermem.
Disable C-states: cpupower idle-set -D 0 and set governor to performance.
Verify NCCL GID: NCCL_IB_GID_INDEX should be 3 (RoCEv2), not 0 (RoCEv1/IB).

🏗️ Spectrum-X Topology Design Guidance

For AI training: use rail-optimized topology — 1 NIC per GPU maps to 1 dedicated leaf switch rail. 8-GPU node = 8 rails per node.
Each leaf rail connects to all spine switches with equal-weight uplinks — full bisection bandwidth at the leaf layer.
Size leaf-to-spine uplinks at ≥ (N_servers × NIC_speed) / N_spines with headroom for SHARP and control traffic.
Deploy Spectrum-4 throughout leaf and spine if SHARP AllReduce is required — SHARP needs Spectrum switches in the entire path.
Enable Adaptive Routing on all Spectrum-4 switches — provides per-packet load balancing without an SM.
For multi-tenant: use VLANs or VXLANs on the Ethernet fabric for traffic isolation between tenants.
Plan switch management network separately (out-of-band) — do not mix management and RoCEv2 traffic on same priority.

⚙️ QoS & Lossless Ethernet Configuration

Step 1: Configure DSCP marking on NIC hosts — RoCEv2 traffic should be marked DSCP 26 (AF31).
Step 2: Set trust DSCP on all switch ports facing servers; set trust PCP on trunk/uplink ports.
Step 3: Map DSCP 26 → PFC priority 3 on both switch and NIC: mlnx_qos -i mlx5_0 -p 3 --pfc 1.
Step 4: Enable ECN marking on the switch at ~20KB buffer threshold for priority 3 queue.
Step 5: Set MTU=9000 (jumbo) on all server NICs and switch ports: ip link set mlx5_0 mtu 9000.
Step 6: Validate end-to-end with ping -M do -s 8972 <remote_ip> — should not fragment or drop.
Step 7: Monitor ongoing with ethtool -S mlx5_0 | grep -E 'pfc|cnp|ecn' — healthy = low/zero CNP counts.

⚖️ Spectrum-X vs InfiniBand Decision Guide

Existing Ethernet infra? → Spectrum-X. Reuse switches, same tooling, same team skills.
Need L3-routable RDMA across subnets? → Spectrum-X (RoCEv2 is UDP/IP; IB LIDs are subnet-scoped).
Multi-tenant isolation required? → Spectrum-X (VLAN/VXLAN) or IB with P_Key partitioning — IB has stronger isolation primitives.
Absolute minimum latency for MPI-heavy HPC? → InfiniBand (sub-µs native IB vs 300-500ns RoCEv2).
SHARP AllReduce needed? → Both Quantum-2 (IB) and Spectrum-4 (Ethernet) support SHARP.
Greenfield LLM training at scale (≥1000 GPUs)? → Spectrum-X with rail-optimized topology is the NVIDIA recommended path for scale-out Ethernet AI.
Team expertise? → IB requires OpenSM/UFM knowledge. Spectrum-X uses standard Ethernet tooling anyone with networking background can operate.