What is rail-optimized topology in AI networking?

Rail-optimized topology connects each GPU NIC in every server to the same-indexed rail switch — GPU0 on all servers connects to Rail Switch 0, GPU1 to Rail Switch 1, etc. This ensures collective communications like AllReduce complete within a single switch without cross-switch traffic, minimizing latency.

What is the difference between scale-up and scale-out networking?

Scale-up networking connects GPUs within a single node using NVLink and NVSwitch for extremely high bandwidth (up to 900 GB/s bidirectional). Scale-out networking connects multiple nodes using InfiniBand or Ethernet, enabling GPU clusters to span thousands of nodes with RDMA-based collective communications.

What InfiniBand generations are covered in NCP-AIN?

NCP-AIN covers four InfiniBand generations: EDR at 100 Gb/s, HDR at 200 Gb/s, NDR at 400 Gb/s, and XDR at 800 Gb/s. Each generation roughly doubles bandwidth while maintaining backward compatibility within the InfiniBand architecture.

What is SHARP and why does it matter for AI training?

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) performs AllReduce operations directly inside InfiniBand switches, eliminating the need to send data all the way back to host CPUs. This in-network computing approach can reduce AllReduce latency by up to 50% in large-scale distributed training.

What does RoCE v2 require for lossless delivery?

RoCE v2 (RDMA over Converged Ethernet) requires two mechanisms for lossless delivery: Priority Flow Control (PFC) to pause specific traffic classes when buffers fill, and Explicit Congestion Notification (ECN) to signal congestion before packet loss occurs. Without these, RDMA retransmissions severely degrade performance.

AI Network Architecture & Fabric Design — NVIDIA NCP-AIN Exam Prep

Four Pillars of AI Network Architecture & Fabric Design

GPU utilization in large AI training jobs is dominated by communication overhead. A cluster with poor network fabric design can spend 40–60% of training time waiting for AllReduce. These four pillars cover every architecture concept tested on the NCP-AIN exam.

Pillar 1 · AI Fabric Foundations

East-West Dominance, RDMA & Collectives

AI training is dominated by east-west GPU-to-GPU AllReduce operations — not north-south data ingestion. RDMA eliminates CPU from the data path. NCCL orchestrates ring and tree algorithms across thousands of GPUs.

80%

East-west share

0

CPU hops (RDMA)

µs

Target latency

Pillar 2 · Network Topologies

Fat-Tree, Rail-Optimized & Multi-Rail

Fat-tree provides full bisection bandwidth when uplinks = downlinks at every tier. Rail-optimized assigns each GPU NIC index to its own rail switch — AllReduce on GPU rank N stays within Rail Switch N.

8

Rails per DGX H100

3

Fat-tree tiers max

1:1

Non-blocking ratio

Pillar 3 · InfiniBand vs Ethernet

IB Generations, RoCE v2, SHARP & UFM

InfiniBand spans EDR 100G → XDR 800G. RoCE v2 brings RDMA to Ethernet with PFC + ECN for lossless delivery. SHARP performs AllReduce inside switch ASICs, cutting latency up to 50%.

800G

XDR per port

50%

SHARP latency cut

~1µs

IB baseline RTT

Pillar 4 · Storage Networking for AI

GPUDirect, NVMe-oF & Parallel File Systems

GPUDirect Storage eliminates CPU from the storage-to-GPU path. NVMe-oF delivers <20µs remote NVMe latency over RDMA fabrics. Lustre and GPFS provide parallel aggregate bandwidth for checkpointing and data ingestion.

0

CPU bounces (GDS)

<20µs

NVMe-oF latency

TB/s

Lustre aggregate

Scale-Up vs Scale-Out — The Two Dimensions of AI Cluster Networking

Scale-Up

Within a Single Node

NVLink & NVSwitch connect GPUs inside one DGX system. Up to 900 GB/s bidirectional per GPU in H100 NVL72. No OS or CPU involvement — pure hardware crossbar switching.

NVLink 4.0 · NVSwitch 3.0 · NVLink Switch

Scale-Out

Between Multiple Nodes

InfiniBand or Ethernet connects GPU servers across racks and pods. RDMA operations bypass the CPU. NCCL orchestrates AllReduce across potentially thousands of GPU nodes.

ConnectX-7 · NDR 400G IB · Spectrum-X · RoCE v2

East-West vs North-South Traffic

🔄 East-West (Dominates)

GPU ↔ GPU collective communications — AllReduce, AllGather, ReduceScatter. Can be 80%+ of all cluster traffic during training.

⇄

⬆ North-South (Secondary)

Data ingestion from storage, checkpoint writes, management, and model-serving API responses. Important but secondary to training collectives.

Exam tip: Traditional data center designs optimized for north-south are a poor fit for AI training. AI fabric designs must prioritize east-west bandwidth first.

How AI Fabric Design Works

From topology selection through protocol choice to storage integration.

Rail-Optimized Topology

Each GPU NIC Index Connects to Its Dedicated Rail Switch

Rail Switches

Rail 0

Rail 1

Rail 2

Rail 3

↕ ↕ ↕ ↕

Server A

G0 NIC

G1 NIC

G2 NIC

G3 NIC

Server B

G0 NIC

G1 NIC

G2 NIC

G3 NIC

Server C

G0 NIC

G1 NIC

G2 NIC

G3 NIC

GPU 0 from every server → Rail Switch 0. AllReduce across GPU rank 0 never crosses rail switch boundaries — minimizing hop count and latency for collective operations.

Fat-Tree & Full Bisection Bandwidth

Non-blocking rule: A fat-tree is non-blocking when the number of uplinks equals the number of downlinks at every tier. Full bisection bandwidth means either half of the network can communicate with the other half at full edge bandwidth — no oversubscription penalty.

3-Tier Fat-Tree: Core → Aggregation → Leaf → Servers

Core SW 1

Core SW 2

Core SW 3

Core SW 4

↕↕↕↕↕↕↕↕

Agg SW 1

Agg SW 2

Agg SW 3

Agg SW 4

↕↕↕↕↕↕↕↕

Leaf SW 1

Leaf SW 2

Leaf SW 3

Leaf SW 4

↕↕↕↕↕↕↕↕↕↕↕↕↕

GPU Server

InfiniBand Generations

InfiniBand Bandwidth Evolution — Each Generation ~2× the Previous

EDR

100 Gb/s

~2015 · ConnectX-4

HDR

200 Gb/s

~2018 · ConnectX-6

NDR

400 Gb/s

~2022 · ConnectX-7 · Current AI standard

XDR

800 Gb/s

~2024+ · ConnectX-8 · Next-gen

Memory hook: Every Hundred's Nicely Doubled eXponentially — EDR 100G, HDR 200G, NDR 400G, XDR 800G.

RoCE v2 Lossless Requirements

RDMA = Remote Direct Memory Access. The NIC reads/writes directly into remote GPU memory without involving the remote CPU — zero-copy, kernel-bypass. For RoCE v2 (RDMA over Ethernet), the network must be lossless: even a single dropped packet triggers RDMA retransmission storms that collapse throughput.

PFC — Priority Flow Control

Pauses a specific traffic class (not all traffic) when switch buffers approach a threshold. Creates per-priority back-pressure at Layer 2. Reactive — triggers after congestion is detected. Required for lossless RoCE v2 delivery.

ECN — Explicit Congestion Notification

Switches mark packets with congestion signals before buffers overflow. Endpoints throttle their injection rate (via DCQCN). Proactive — prevents loss before it happens. Works alongside PFC for complete lossless coverage.

SHARP: In-Network AllReduce

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) pushes AllReduce computation into InfiniBand switch ASICs. Instead of all gradient data flowing to host CPUs for aggregation, the switches perform the reduction hierarchically as data flows through. Result: up to 50% reduction in AllReduce latency for large distributed training jobs. Unique to InfiniBand — not available on standard Ethernet. Managed via aggregation trees in UFM.

GPUDirect Storage & NVMe-oF

GPUDirect Storage (GDS)

Eliminates CPU and system RAM from the storage→GPU data path. DMA engines move data directly between NVMe/storage fabric and GPU HBM. Dramatically faster checkpoint restore and dataset loading for large models.

NVMe-oF (NVMe over Fabrics)

Extends NVMe protocol over RDMA fabrics (InfiniBand, RoCE, FC-NVMe). Achieves <20µs latency for remote NVMe access. Essential for shared checkpoint storage in multi-node GPU clusters.

Concept Comparison Table

Filter by pillar or view all 22 key NCP-AIN networking concepts side by side.

Concept	Pillar	Key Detail	Exam Tip
AI Factory	Foundations	Purpose-built facility optimized end-to-end for AI training — power, cooling, networking, storage co-designed	Distinct from general data center; full-stack GPU optimization
RDMA	Foundations	Remote Direct Memory Access — NIC reads/writes remote GPU memory without involving remote CPU	Zero-copy, kernel-bypass; enables µs-latency GPU collectives
AllReduce	Foundations	Collective op summing gradient tensors across all GPUs and distributing result to all	Ring-AllReduce and tree-AllReduce are common algorithms
NCCL	Foundations	NVIDIA Collective Communications Library — optimized collectives for GPU clusters over IB or RoCE	Topology-aware rings; auto-negotiates transport (IB or RoCE)
East-West Traffic	Foundations	GPU-to-GPU collective traffic flowing laterally; dominates AI training bandwidth requirements	Must size fabric for east-west, not north-south
Fat-Tree	Topology	Multi-tier tree where uplinks = downlinks at each tier; provides full bisection bandwidth	Non-blocking = no oversubscription = no congestion penalty
Rail-Optimized	Topology	Each GPU NIC index N connects to Rail Switch N across all servers in the cluster	GPU0 of every server → Rail Switch 0; AllReduce stays within one switch
Multi-Rail	Topology	Multiple NICs per GPU server; NCCL uses all rails simultaneously for aggregate bandwidth	DGX H100 ships with 8 × ConnectX-7 NICs = 8 rails
Scale-Up (NVLink)	Topology	NVLink 4.0 connects GPUs within one node at 900 GB/s bidirectional; NVSwitch is the crossbar	No OS or CPU involvement; hardware switch fabric
Scale-Out (IB/Eth)	Topology	ConnectX NICs provide RDMA over InfiniBand or RoCE between nodes in a cluster	NCCL orchestrates; each GPU NIC connects to a rail switch
InfiniBand NDR	IB vs Eth	Next Data Rate — 400 Gb/s per port; current-generation for DGX H100 clusters	HDR=200G, NDR=400G, XDR=800G. ConnectX-7 = NDR.
RoCE v2	IB vs Eth	RDMA over Converged Ethernet — routable at Layer 3; requires PFC + ECN for lossless operation	Spectrum-X uses Advanced Congestion Control (ACC) + RoCE v2
PFC	IB vs Eth	Priority Flow Control — pauses specific traffic class at Layer 2 when buffers approach overflow	Reactive; creates back-pressure; needed alongside ECN
ECN / DCQCN	IB vs Eth	Explicit Congestion Notification — marks packets before loss; DCQCN is end-to-end rate control	Proactive; works with PFC for a complete lossless solution
SHARP	IB vs Eth	In-network AllReduce inside IB switch ASICs; hierarchical aggregation cuts AllReduce latency ≤50%	Unique to InfiniBand; programmed via UFM aggregation trees
Subnet Manager	IB vs Eth	Controls IB fabric — assigns LIDs, computes routing, detects link failures; OpenSM or NVIDIA UFM	Every IB subnet must have exactly one active SM
UFM	IB vs Eth	Unified Fabric Manager — enterprise SM + telemetry + WJH (What Just Happened) diagnostics	Goes beyond OpenSM; provides real-time fabric visibility and adaptive routing
GPUDirect Storage	Storage	Direct DMA path from NVMe/SAN to GPU HBM; eliminates CPU and system RAM from copy path	Speeds checkpoint restore; requires compatible NIC + cuFile API
NVMe-oF	Storage	NVMe over Fabrics — extends NVMe protocol over RDMA or FC; <20µs latency for remote NVMe	Transport options: RDMA (RoCE/IB), FC-NVMe, TCP-NVMe
Lustre	Storage	POSIX-compliant parallel file system; splits files across OSTs; widely used in HPC and AI clusters	High aggregate bandwidth via striping; common in large training clusters
GPFS / Spectrum Scale	Storage	IBM parallel clustered file system; global namespace, tiering, and AFM caching support	Enterprise-grade; supports both HPC and AI workloads
Checkpoint Bandwidth	Storage	Storage bandwidth for saving model state; 70B-param model in BF16 = ~140 GB; all nodes write simultaneously	Plan for N nodes × per-node write rate as burst storage requirement

Real-World Examples

How AI fabric design concepts apply in production GPU cluster deployments.

Pillar 1 · AI Fabric Foundations

8-Node DGX H100 Rail-Optimized InfiniBand Cluster

A compact but production-grade GPU cluster showing rail-optimized topology, SHARP, and UFM working together.

8 × DGX H100 servers, each with 8 × ConnectX-7 NICs = 64 NIC connections total
8 × NDR 400G InfiniBand rail switches — one per GPU rank across all servers
GPU 0 from all 8 servers connect exclusively to Rail Switch 0
AllReduce on GPU rank 0 never crosses to another rail switch — single-hop completion
SHARP aggregation trees programmed on all 8 rail switches via UFM
UFM manages subnet, LID assignment, adaptive routing, and WJH diagnostics

Result: 3.2 Tb/s aggregate fabric bandwidth with SHARP-accelerated AllReduce — near-theoretical GPU utilization for distributed training.

Pillar 2 · Network Topologies

Spectrum-X Ethernet AI Cloud (Multi-Tenant)

RoCE v2 deployment for cloud providers offering GPU compute to multiple tenants simultaneously.

Spectrum-4 switches with Advanced Congestion Control (ACC) enabled
RoCE v2 with PFC on priority class 3 + ECN marking at 30% buffer fill
DCQCN end-to-end rate limiting to prevent incast collapse during AllReduce
ConnectX-7 NICs in each GPU server with full hardware RoCE offload
NCCL configured for RoCE transport, 8 rails per server, multi-rail enabled
VXLAN-based tenant isolation — each tenant gets a dedicated RDMA domain

Result: Near-InfiniBand collective performance at Ethernet manageability cost — enabling scalable multi-tenant GPU cloud offerings.

Pillar 3 · InfiniBand vs Ethernet

3-Tier Fat-Tree Petascale Cluster (1024+ GPUs)

XDR InfiniBand fat-tree for a large-scale LLM training cluster requiring full bisection bandwidth.

128 × DGX H100 nodes × 8 rails = 1024 GPU endpoints
3-tier fat-tree: leaf → aggregation → core, all XDR 800G switches
Full bisection bandwidth maintained — uplinks = downlinks at every tier
SHARP aggregation trees programmed from leaf through core switches via UFM
UFM adaptive routing dynamically balances traffic around link failures
Separate 200G HDR storage network to dedicated Lustre cluster

Result: Linear scaling of AllReduce bandwidth as GPUs are added, with SHARP cutting effective latency by up to 50% across the full 1024-GPU topology.

Pillar 4 · Storage Networking for AI

Storage Sizing for 70B LLM Training Checkpoint

Planning storage network bandwidth for concurrent checkpoint writes from a large GPU cluster.

70B parameter model in BF16 = ~140 GB of weights alone; with optimizer state ≈ 420 GB total
512-GPU cluster (64 nodes) checkpointing every 10 minutes during training
All 64 nodes write simultaneously → peak burst: 64 × ~4.5 GB/s = ~288 GB/s
Lustre file system with 32 SSD-backed OSTs provides ~320 GB/s aggregate write
GPUDirect Storage bypasses CPU during the write path from GPU HBM to Lustre
NVMe-oF provides secondary storage layer at <20µs latency for fast restart

Result: GPUDirect Storage + dedicated storage network eliminates the CPU bottleneck, enabling 420 GB checkpoint saves in under 2 minutes even at petascale.

Practice Quiz

10 questions across all four pillars. One at a time — answer and see the explanation before moving on.

Fabric Advisor

Answer 2–3 questions to get a tailored AI fabric design recommendation.

Memory Hooks — Flip Cards

Click each card to flip and reveal the answer. Great for last-minute exam review.

AI Fabric Foundations

AI Factory vs AI Data Center

What is the key distinction?

AI Factory = purpose-built facility where power, cooling, fabric, and storage are all co-designed end-to-end for AI training at scale. AI Data Center = general facility hosting AI among other workloads. AI Factory maximizes GPU utilization through integrated, holistic design.

Network Topologies

Rail-Optimized Topology Rule

How does GPU NIC assignment work?

GPU N from every server → Rail Switch N. GPU 0 of all servers share Rail Switch 0. AllReduce for GPU rank 0 never crosses to another switch. Benefit: minimizes hop count and latency for collective operations across the entire cluster.

Network Topologies

Fat-Tree Bisection Bandwidth

What makes it non-blocking?

Non-blocking condition: uplinks = downlinks at every switch tier. Full bisection bandwidth = cutting the network in half, each half retains 100% of edge bandwidth for cross-half communication. No oversubscription = no congestion penalty during AllReduce.

InfiniBand vs Ethernet

RDMA & RoCE v2 Lossless

What two mechanisms are required?

RDMA = NIC transfers directly to remote GPU memory, no CPU. RoCE v2 = RDMA over Ethernet at L3. Lossless requires: PFC (pause per priority class, reactive) + ECN (mark before drop, proactive). Without both, retransmissions collapse RDMA throughput.

InfiniBand vs Ethernet

InfiniBand Generations

EDR / HDR / NDR / XDR speeds?

EDR → 100 Gb/s (ConnectX-4)
HDR → 200 Gb/s (ConnectX-6)
NDR → 400 Gb/s (ConnectX-7)
XDR → 800 Gb/s (ConnectX-8)
Memory: Every Hundred's Nicely Doubled eXponentially

InfiniBand vs Ethernet

SHARP In-Network AllReduce

Where does the computation happen?

SHARP = Scalable Hierarchical Aggregation and Reduction Protocol. Runs inside InfiniBand switch ASICs — not on host CPUs. Reduces AllReduce time by up to 50%. Unique to InfiniBand. Programmed via UFM aggregation trees. Not available on standard Ethernet.

Storage Networking

GPUDirect Storage (GDS)

What does it eliminate from the path?

GDS creates a direct DMA path: Storage → NIC → GPU HBM, bypassing both CPU and system RAM. Result: dramatically faster checkpoint saves and dataset loads. Requires: compatible ConnectX NIC, cuFile API, CUDA 11.4+. Pairs with NVMe-oF for remote storage access.

AI Fabric Foundations

Scale-Up vs Scale-Out

NVLink vs InfiniBand — when does each apply?

Scale-Up: NVLink/NVSwitch within one node. 900 GB/s bidir. No CPU/OS. Hardware crossbar. Scale-Out: IB or Ethernet between nodes. RDMA via ConnectX. NCCL manages. Together: NVLink handles intra-node, IB/Ethernet handles inter-node collectives.

AI Network Architecture & Fabric Design

East-West Dominance, RDMA & Collectives

Fat-Tree, Rail-Optimized & Multi-Rail

IB Generations, RoCE v2, SHARP & UFM

GPUDirect, NVMe-oF & Parallel File Systems

Scale-Up vs Scale-Out — The Two Dimensions of AI Cluster Networking

Within a Single Node

Between Multiple Nodes

East-West vs North-South Traffic

🔄 East-West (Dominates)

⬆ North-South (Secondary)

Rail-Optimized Topology

Each GPU NIC Index Connects to Its Dedicated Rail Switch

Fat-Tree & Full Bisection Bandwidth

3-Tier Fat-Tree: Core → Aggregation → Leaf → Servers

InfiniBand Generations

InfiniBand Bandwidth Evolution — Each Generation ~2× the Previous

RoCE v2 Lossless Requirements

SHARP: In-Network AllReduce

GPUDirect Storage & NVMe-oF

8-Node DGX H100 Rail-Optimized InfiniBand Cluster

Spectrum-X Ethernet AI Cloud (Multi-Tenant)

3-Tier Fat-Tree Petascale Cluster (1024+ GPUs)

Storage Sizing for 70B LLM Training Checkpoint

AI Factory vs AI Data Center

Rail-Optimized Topology Rule

Fat-Tree Bisection Bandwidth

RDMA & RoCE v2 Lossless

InfiniBand Generations

SHARP In-Network AllReduce

GPUDirect Storage (GDS)

Scale-Up vs Scale-Out

Ready to Master the Full NCP-AIN Exam?