Master the foundational networking layer behind GPU clusters — fat-tree topologies, rail-optimized designs, InfiniBand generations, RoCE v2, RDMA, SHARP, and GPU-optimized storage.
Start Free Practice →AI training is dominated by east-west GPU-to-GPU AllReduce operations — not north-south data ingestion. RDMA eliminates CPU from the data path. NCCL orchestrates ring and tree algorithms across thousands of GPUs.
Fat-tree provides full bisection bandwidth when uplinks = downlinks at every tier. Rail-optimized assigns each GPU NIC index to its own rail switch — AllReduce on GPU rank N stays within Rail Switch N.
InfiniBand spans EDR 100G → XDR 800G. RoCE v2 brings RDMA to Ethernet with PFC + ECN for lossless delivery. SHARP performs AllReduce inside switch ASICs, cutting latency up to 50%.
GPUDirect Storage eliminates CPU from the storage-to-GPU path. NVMe-oF delivers <20µs remote NVMe latency over RDMA fabrics. Lustre and GPFS provide parallel aggregate bandwidth for checkpointing and data ingestion.
NVLink & NVSwitch connect GPUs inside one DGX system. Up to 900 GB/s bidirectional per GPU in H100 NVL72. No OS or CPU involvement — pure hardware crossbar switching.
NVLink 4.0 · NVSwitch 3.0 · NVLink Switch
InfiniBand or Ethernet connects GPU servers across racks and pods. RDMA operations bypass the CPU. NCCL orchestrates AllReduce across potentially thousands of GPU nodes.
ConnectX-7 · NDR 400G IB · Spectrum-X · RoCE v2
GPU ↔ GPU collective communications — AllReduce, AllGather, ReduceScatter. Can be 80%+ of all cluster traffic during training.
Data ingestion from storage, checkpoint writes, management, and model-serving API responses. Important but secondary to training collectives.
GPU 0 from every server → Rail Switch 0. AllReduce across GPU rank 0 never crosses rail switch boundaries — minimizing hop count and latency for collective operations.
Pauses a specific traffic class (not all traffic) when switch buffers approach a threshold. Creates per-priority back-pressure at Layer 2. Reactive — triggers after congestion is detected. Required for lossless RoCE v2 delivery.
Switches mark packets with congestion signals before buffers overflow. Endpoints throttle their injection rate (via DCQCN). Proactive — prevents loss before it happens. Works alongside PFC for complete lossless coverage.
Eliminates CPU and system RAM from the storage→GPU data path. DMA engines move data directly between NVMe/storage fabric and GPU HBM. Dramatically faster checkpoint restore and dataset loading for large models.
Extends NVMe protocol over RDMA fabrics (InfiniBand, RoCE, FC-NVMe). Achieves <20µs latency for remote NVMe access. Essential for shared checkpoint storage in multi-node GPU clusters.
| Concept | Pillar | Key Detail | Exam Tip |
|---|---|---|---|
| AI Factory | Foundations | Purpose-built facility optimized end-to-end for AI training — power, cooling, networking, storage co-designed | Distinct from general data center; full-stack GPU optimization |
| RDMA | Foundations | Remote Direct Memory Access — NIC reads/writes remote GPU memory without involving remote CPU | Zero-copy, kernel-bypass; enables µs-latency GPU collectives |
| AllReduce | Foundations | Collective op summing gradient tensors across all GPUs and distributing result to all | Ring-AllReduce and tree-AllReduce are common algorithms |
| NCCL | Foundations | NVIDIA Collective Communications Library — optimized collectives for GPU clusters over IB or RoCE | Topology-aware rings; auto-negotiates transport (IB or RoCE) |
| East-West Traffic | Foundations | GPU-to-GPU collective traffic flowing laterally; dominates AI training bandwidth requirements | Must size fabric for east-west, not north-south |
| Fat-Tree | Topology | Multi-tier tree where uplinks = downlinks at each tier; provides full bisection bandwidth | Non-blocking = no oversubscription = no congestion penalty |
| Rail-Optimized | Topology | Each GPU NIC index N connects to Rail Switch N across all servers in the cluster | GPU0 of every server → Rail Switch 0; AllReduce stays within one switch |
| Multi-Rail | Topology | Multiple NICs per GPU server; NCCL uses all rails simultaneously for aggregate bandwidth | DGX H100 ships with 8 × ConnectX-7 NICs = 8 rails |
| Scale-Up (NVLink) | Topology | NVLink 4.0 connects GPUs within one node at 900 GB/s bidirectional; NVSwitch is the crossbar | No OS or CPU involvement; hardware switch fabric |
| Scale-Out (IB/Eth) | Topology | ConnectX NICs provide RDMA over InfiniBand or RoCE between nodes in a cluster | NCCL orchestrates; each GPU NIC connects to a rail switch |
| InfiniBand NDR | IB vs Eth | Next Data Rate — 400 Gb/s per port; current-generation for DGX H100 clusters | HDR=200G, NDR=400G, XDR=800G. ConnectX-7 = NDR. |
| RoCE v2 | IB vs Eth | RDMA over Converged Ethernet — routable at Layer 3; requires PFC + ECN for lossless operation | Spectrum-X uses Advanced Congestion Control (ACC) + RoCE v2 |
| PFC | IB vs Eth | Priority Flow Control — pauses specific traffic class at Layer 2 when buffers approach overflow | Reactive; creates back-pressure; needed alongside ECN |
| ECN / DCQCN | IB vs Eth | Explicit Congestion Notification — marks packets before loss; DCQCN is end-to-end rate control | Proactive; works with PFC for a complete lossless solution |
| SHARP | IB vs Eth | In-network AllReduce inside IB switch ASICs; hierarchical aggregation cuts AllReduce latency ≤50% | Unique to InfiniBand; programmed via UFM aggregation trees |
| Subnet Manager | IB vs Eth | Controls IB fabric — assigns LIDs, computes routing, detects link failures; OpenSM or NVIDIA UFM | Every IB subnet must have exactly one active SM |
| UFM | IB vs Eth | Unified Fabric Manager — enterprise SM + telemetry + WJH (What Just Happened) diagnostics | Goes beyond OpenSM; provides real-time fabric visibility and adaptive routing |
| GPUDirect Storage | Storage | Direct DMA path from NVMe/SAN to GPU HBM; eliminates CPU and system RAM from copy path | Speeds checkpoint restore; requires compatible NIC + cuFile API |
| NVMe-oF | Storage | NVMe over Fabrics — extends NVMe protocol over RDMA or FC; <20µs latency for remote NVMe | Transport options: RDMA (RoCE/IB), FC-NVMe, TCP-NVMe |
| Lustre | Storage | POSIX-compliant parallel file system; splits files across OSTs; widely used in HPC and AI clusters | High aggregate bandwidth via striping; common in large training clusters |
| GPFS / Spectrum Scale | Storage | IBM parallel clustered file system; global namespace, tiering, and AFM caching support | Enterprise-grade; supports both HPC and AI workloads |
| Checkpoint Bandwidth | Storage | Storage bandwidth for saving model state; 70B-param model in BF16 = ~140 GB; all nodes write simultaneously | Plan for N nodes × per-node write rate as burst storage requirement |
A compact but production-grade GPU cluster showing rail-optimized topology, SHARP, and UFM working together.
RoCE v2 deployment for cloud providers offering GPU compute to multiple tenants simultaneously.
XDR InfiniBand fat-tree for a large-scale LLM training cluster requiring full bisection bandwidth.
Planning storage network bandwidth for concurrent checkpoint writes from a large GPU cluster.
AI Factory = purpose-built facility where power, cooling, fabric, and storage are all co-designed end-to-end for AI training at scale. AI Data Center = general facility hosting AI among other workloads. AI Factory maximizes GPU utilization through integrated, holistic design.
GPU N from every server → Rail Switch N. GPU 0 of all servers share Rail Switch 0. AllReduce for GPU rank 0 never crosses to another switch. Benefit: minimizes hop count and latency for collective operations across the entire cluster.
Non-blocking condition: uplinks = downlinks at every switch tier. Full bisection bandwidth = cutting the network in half, each half retains 100% of edge bandwidth for cross-half communication. No oversubscription = no congestion penalty during AllReduce.
RDMA = NIC transfers directly to remote GPU memory, no CPU. RoCE v2 = RDMA over Ethernet at L3. Lossless requires: PFC (pause per priority class, reactive) + ECN (mark before drop, proactive). Without both, retransmissions collapse RDMA throughput.
EDR → 100 Gb/s (ConnectX-4)
HDR → 200 Gb/s (ConnectX-6)
NDR → 400 Gb/s (ConnectX-7)
XDR → 800 Gb/s (ConnectX-8)
Memory: Every Hundred's Nicely Doubled eXponentially
SHARP = Scalable Hierarchical Aggregation and Reduction Protocol. Runs inside InfiniBand switch ASICs — not on host CPUs. Reduces AllReduce time by up to 50%. Unique to InfiniBand. Programmed via UFM aggregation trees. Not available on standard Ethernet.
GDS creates a direct DMA path: Storage → NIC → GPU HBM, bypassing both CPU and system RAM. Result: dramatically faster checkpoint saves and dataset loads. Requires: compatible ConnectX NIC, cuFile API, CUDA 11.4+. Pairs with NVMe-oF for remote storage access.
Scale-Up: NVLink/NVSwitch within one node. 900 GB/s bidir. No CPU/OS. Hardware crossbar. Scale-Out: IB or Ethernet between nodes. RDMA via ConnectX. NCCL manages. Together: NVLink handles intra-node, IB/Ethernet handles inter-node collectives.