NVIDIA NCP-AIN Exam Prep · Topic 1 of 5

AI Network Architecture & Fabric Design

Master the foundational networking layer behind GPU clusters — fat-tree topologies, rail-optimized designs, InfiniBand generations, RoCE v2, RDMA, SHARP, and GPU-optimized storage.

Start Free Practice →
Four Pillars of AI Network Architecture & Fabric Design
GPU utilization in large AI training jobs is dominated by communication overhead. A cluster with poor network fabric design can spend 40–60% of training time waiting for AllReduce. These four pillars cover every architecture concept tested on the NCP-AIN exam.
Pillar 1 · AI Fabric Foundations

East-West Dominance, RDMA & Collectives

AI training is dominated by east-west GPU-to-GPU AllReduce operations — not north-south data ingestion. RDMA eliminates CPU from the data path. NCCL orchestrates ring and tree algorithms across thousands of GPUs.

80%
East-west share
0
CPU hops (RDMA)
µs
Target latency
Pillar 2 · Network Topologies

Fat-Tree, Rail-Optimized & Multi-Rail

Fat-tree provides full bisection bandwidth when uplinks = downlinks at every tier. Rail-optimized assigns each GPU NIC index to its own rail switch — AllReduce on GPU rank N stays within Rail Switch N.

8
Rails per DGX H100
3
Fat-tree tiers max
1:1
Non-blocking ratio
Pillar 3 · InfiniBand vs Ethernet

IB Generations, RoCE v2, SHARP & UFM

InfiniBand spans EDR 100G → XDR 800G. RoCE v2 brings RDMA to Ethernet with PFC + ECN for lossless delivery. SHARP performs AllReduce inside switch ASICs, cutting latency up to 50%.

800G
XDR per port
50%
SHARP latency cut
~1µs
IB baseline RTT
Pillar 4 · Storage Networking for AI

GPUDirect, NVMe-oF & Parallel File Systems

GPUDirect Storage eliminates CPU from the storage-to-GPU path. NVMe-oF delivers <20µs remote NVMe latency over RDMA fabrics. Lustre and GPFS provide parallel aggregate bandwidth for checkpointing and data ingestion.

0
CPU bounces (GDS)
<20µs
NVMe-oF latency
TB/s
Lustre aggregate

Scale-Up vs Scale-Out — The Two Dimensions of AI Cluster Networking

Scale-Up
Within a Single Node

NVLink & NVSwitch connect GPUs inside one DGX system. Up to 900 GB/s bidirectional per GPU in H100 NVL72. No OS or CPU involvement — pure hardware crossbar switching.

NVLink 4.0 · NVSwitch 3.0 · NVLink Switch

Scale-Out
Between Multiple Nodes

InfiniBand or Ethernet connects GPU servers across racks and pods. RDMA operations bypass the CPU. NCCL orchestrates AllReduce across potentially thousands of GPU nodes.

ConnectX-7 · NDR 400G IB · Spectrum-X · RoCE v2

East-West vs North-South Traffic

🔄 East-West (Dominates)

GPU ↔ GPU collective communications — AllReduce, AllGather, ReduceScatter. Can be 80%+ of all cluster traffic during training.

⬆ North-South (Secondary)

Data ingestion from storage, checkpoint writes, management, and model-serving API responses. Important but secondary to training collectives.

Exam tip: Traditional data center designs optimized for north-south are a poor fit for AI training. AI fabric designs must prioritize east-west bandwidth first.
How AI Fabric Design Works
From topology selection through protocol choice to storage integration.

Rail-Optimized Topology

Each GPU NIC Index Connects to Its Dedicated Rail Switch

Rail Switches
Rail 0
Rail 1
Rail 2
Rail 3
↕   ↕   ↕   ↕
Server A
G0 NIC
G1 NIC
G2 NIC
G3 NIC
Server B
G0 NIC
G1 NIC
G2 NIC
G3 NIC
Server C
G0 NIC
G1 NIC
G2 NIC
G3 NIC

GPU 0 from every server → Rail Switch 0. AllReduce across GPU rank 0 never crosses rail switch boundaries — minimizing hop count and latency for collective operations.

Fat-Tree & Full Bisection Bandwidth

Non-blocking rule: A fat-tree is non-blocking when the number of uplinks equals the number of downlinks at every tier. Full bisection bandwidth means either half of the network can communicate with the other half at full edge bandwidth — no oversubscription penalty.

3-Tier Fat-Tree: Core → Aggregation → Leaf → Servers

Core SW 1
Core SW 2
Core SW 3
Core SW 4
↕↕↕↕↕↕↕↕
Agg SW 1
Agg SW 2
Agg SW 3
Agg SW 4
↕↕↕↕↕↕↕↕
Leaf SW 1
Leaf SW 2
Leaf SW 3
Leaf SW 4
↕↕↕↕↕↕↕↕↕↕↕↕↕
GPU Server
GPU Server
GPU Server
GPU Server

InfiniBand Generations

InfiniBand Bandwidth Evolution — Each Generation ~2× the Previous

EDR
100 Gb/s
~2015 · ConnectX-4
HDR
200 Gb/s
~2018 · ConnectX-6
NDR
400 Gb/s
~2022 · ConnectX-7 · Current AI standard
XDR
800 Gb/s
~2024+ · ConnectX-8 · Next-gen
Memory hook: Every Hundred's Nicely Doubled eXponentially — EDR 100G, HDR 200G, NDR 400G, XDR 800G.

RoCE v2 Lossless Requirements

RDMA = Remote Direct Memory Access. The NIC reads/writes directly into remote GPU memory without involving the remote CPU — zero-copy, kernel-bypass. For RoCE v2 (RDMA over Ethernet), the network must be lossless: even a single dropped packet triggers RDMA retransmission storms that collapse throughput.
PFC — Priority Flow Control

Pauses a specific traffic class (not all traffic) when switch buffers approach a threshold. Creates per-priority back-pressure at Layer 2. Reactive — triggers after congestion is detected. Required for lossless RoCE v2 delivery.

ECN — Explicit Congestion Notification

Switches mark packets with congestion signals before buffers overflow. Endpoints throttle their injection rate (via DCQCN). Proactive — prevents loss before it happens. Works alongside PFC for complete lossless coverage.

SHARP: In-Network AllReduce

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) pushes AllReduce computation into InfiniBand switch ASICs. Instead of all gradient data flowing to host CPUs for aggregation, the switches perform the reduction hierarchically as data flows through. Result: up to 50% reduction in AllReduce latency for large distributed training jobs. Unique to InfiniBand — not available on standard Ethernet. Managed via aggregation trees in UFM.

GPUDirect Storage & NVMe-oF

GPUDirect Storage (GDS)

Eliminates CPU and system RAM from the storage→GPU data path. DMA engines move data directly between NVMe/storage fabric and GPU HBM. Dramatically faster checkpoint restore and dataset loading for large models.

NVMe-oF (NVMe over Fabrics)

Extends NVMe protocol over RDMA fabrics (InfiniBand, RoCE, FC-NVMe). Achieves <20µs latency for remote NVMe access. Essential for shared checkpoint storage in multi-node GPU clusters.

Concept Comparison Table
Filter by pillar or view all 22 key NCP-AIN networking concepts side by side.
ConceptPillarKey DetailExam Tip
AI FactoryFoundationsPurpose-built facility optimized end-to-end for AI training — power, cooling, networking, storage co-designedDistinct from general data center; full-stack GPU optimization
RDMAFoundationsRemote Direct Memory Access — NIC reads/writes remote GPU memory without involving remote CPUZero-copy, kernel-bypass; enables µs-latency GPU collectives
AllReduceFoundationsCollective op summing gradient tensors across all GPUs and distributing result to allRing-AllReduce and tree-AllReduce are common algorithms
NCCLFoundationsNVIDIA Collective Communications Library — optimized collectives for GPU clusters over IB or RoCETopology-aware rings; auto-negotiates transport (IB or RoCE)
East-West TrafficFoundationsGPU-to-GPU collective traffic flowing laterally; dominates AI training bandwidth requirementsMust size fabric for east-west, not north-south
Fat-TreeTopologyMulti-tier tree where uplinks = downlinks at each tier; provides full bisection bandwidthNon-blocking = no oversubscription = no congestion penalty
Rail-OptimizedTopologyEach GPU NIC index N connects to Rail Switch N across all servers in the clusterGPU0 of every server → Rail Switch 0; AllReduce stays within one switch
Multi-RailTopologyMultiple NICs per GPU server; NCCL uses all rails simultaneously for aggregate bandwidthDGX H100 ships with 8 × ConnectX-7 NICs = 8 rails
Scale-Up (NVLink)TopologyNVLink 4.0 connects GPUs within one node at 900 GB/s bidirectional; NVSwitch is the crossbarNo OS or CPU involvement; hardware switch fabric
Scale-Out (IB/Eth)TopologyConnectX NICs provide RDMA over InfiniBand or RoCE between nodes in a clusterNCCL orchestrates; each GPU NIC connects to a rail switch
InfiniBand NDRIB vs EthNext Data Rate — 400 Gb/s per port; current-generation for DGX H100 clustersHDR=200G, NDR=400G, XDR=800G. ConnectX-7 = NDR.
RoCE v2IB vs EthRDMA over Converged Ethernet — routable at Layer 3; requires PFC + ECN for lossless operationSpectrum-X uses Advanced Congestion Control (ACC) + RoCE v2
PFCIB vs EthPriority Flow Control — pauses specific traffic class at Layer 2 when buffers approach overflowReactive; creates back-pressure; needed alongside ECN
ECN / DCQCNIB vs EthExplicit Congestion Notification — marks packets before loss; DCQCN is end-to-end rate controlProactive; works with PFC for a complete lossless solution
SHARPIB vs EthIn-network AllReduce inside IB switch ASICs; hierarchical aggregation cuts AllReduce latency ≤50%Unique to InfiniBand; programmed via UFM aggregation trees
Subnet ManagerIB vs EthControls IB fabric — assigns LIDs, computes routing, detects link failures; OpenSM or NVIDIA UFMEvery IB subnet must have exactly one active SM
UFMIB vs EthUnified Fabric Manager — enterprise SM + telemetry + WJH (What Just Happened) diagnosticsGoes beyond OpenSM; provides real-time fabric visibility and adaptive routing
GPUDirect StorageStorageDirect DMA path from NVMe/SAN to GPU HBM; eliminates CPU and system RAM from copy pathSpeeds checkpoint restore; requires compatible NIC + cuFile API
NVMe-oFStorageNVMe over Fabrics — extends NVMe protocol over RDMA or FC; <20µs latency for remote NVMeTransport options: RDMA (RoCE/IB), FC-NVMe, TCP-NVMe
LustreStoragePOSIX-compliant parallel file system; splits files across OSTs; widely used in HPC and AI clustersHigh aggregate bandwidth via striping; common in large training clusters
GPFS / Spectrum ScaleStorageIBM parallel clustered file system; global namespace, tiering, and AFM caching supportEnterprise-grade; supports both HPC and AI workloads
Checkpoint BandwidthStorageStorage bandwidth for saving model state; 70B-param model in BF16 = ~140 GB; all nodes write simultaneouslyPlan for N nodes × per-node write rate as burst storage requirement
Real-World Examples
How AI fabric design concepts apply in production GPU cluster deployments.
Pillar 1 · AI Fabric Foundations

8-Node DGX H100 Rail-Optimized InfiniBand Cluster

A compact but production-grade GPU cluster showing rail-optimized topology, SHARP, and UFM working together.

  • 8 × DGX H100 servers, each with 8 × ConnectX-7 NICs = 64 NIC connections total
  • 8 × NDR 400G InfiniBand rail switches — one per GPU rank across all servers
  • GPU 0 from all 8 servers connect exclusively to Rail Switch 0
  • AllReduce on GPU rank 0 never crosses to another rail switch — single-hop completion
  • SHARP aggregation trees programmed on all 8 rail switches via UFM
  • UFM manages subnet, LID assignment, adaptive routing, and WJH diagnostics
Result: 3.2 Tb/s aggregate fabric bandwidth with SHARP-accelerated AllReduce — near-theoretical GPU utilization for distributed training.
Pillar 2 · Network Topologies

Spectrum-X Ethernet AI Cloud (Multi-Tenant)

RoCE v2 deployment for cloud providers offering GPU compute to multiple tenants simultaneously.

  • Spectrum-4 switches with Advanced Congestion Control (ACC) enabled
  • RoCE v2 with PFC on priority class 3 + ECN marking at 30% buffer fill
  • DCQCN end-to-end rate limiting to prevent incast collapse during AllReduce
  • ConnectX-7 NICs in each GPU server with full hardware RoCE offload
  • NCCL configured for RoCE transport, 8 rails per server, multi-rail enabled
  • VXLAN-based tenant isolation — each tenant gets a dedicated RDMA domain
Result: Near-InfiniBand collective performance at Ethernet manageability cost — enabling scalable multi-tenant GPU cloud offerings.
Pillar 3 · InfiniBand vs Ethernet

3-Tier Fat-Tree Petascale Cluster (1024+ GPUs)

XDR InfiniBand fat-tree for a large-scale LLM training cluster requiring full bisection bandwidth.

  • 128 × DGX H100 nodes × 8 rails = 1024 GPU endpoints
  • 3-tier fat-tree: leaf → aggregation → core, all XDR 800G switches
  • Full bisection bandwidth maintained — uplinks = downlinks at every tier
  • SHARP aggregation trees programmed from leaf through core switches via UFM
  • UFM adaptive routing dynamically balances traffic around link failures
  • Separate 200G HDR storage network to dedicated Lustre cluster
Result: Linear scaling of AllReduce bandwidth as GPUs are added, with SHARP cutting effective latency by up to 50% across the full 1024-GPU topology.
Pillar 4 · Storage Networking for AI

Storage Sizing for 70B LLM Training Checkpoint

Planning storage network bandwidth for concurrent checkpoint writes from a large GPU cluster.

  • 70B parameter model in BF16 = ~140 GB of weights alone; with optimizer state ≈ 420 GB total
  • 512-GPU cluster (64 nodes) checkpointing every 10 minutes during training
  • All 64 nodes write simultaneously → peak burst: 64 × ~4.5 GB/s = ~288 GB/s
  • Lustre file system with 32 SSD-backed OSTs provides ~320 GB/s aggregate write
  • GPUDirect Storage bypasses CPU during the write path from GPU HBM to Lustre
  • NVMe-oF provides secondary storage layer at <20µs latency for fast restart
Result: GPUDirect Storage + dedicated storage network eliminates the CPU bottleneck, enabling 420 GB checkpoint saves in under 2 minutes even at petascale.
Practice Quiz
10 questions across all four pillars. One at a time — answer and see the explanation before moving on.
Fabric Advisor
Answer 2–3 questions to get a tailored AI fabric design recommendation.
Memory Hooks — Flip Cards
Click each card to flip and reveal the answer. Great for last-minute exam review.
AI Fabric Foundations

AI Factory vs AI Data Center

What is the key distinction?

AI Factory = purpose-built facility where power, cooling, fabric, and storage are all co-designed end-to-end for AI training at scale. AI Data Center = general facility hosting AI among other workloads. AI Factory maximizes GPU utilization through integrated, holistic design.

Network Topologies

Rail-Optimized Topology Rule

How does GPU NIC assignment work?

GPU N from every server → Rail Switch N. GPU 0 of all servers share Rail Switch 0. AllReduce for GPU rank 0 never crosses to another switch. Benefit: minimizes hop count and latency for collective operations across the entire cluster.

Network Topologies

Fat-Tree Bisection Bandwidth

What makes it non-blocking?

Non-blocking condition: uplinks = downlinks at every switch tier. Full bisection bandwidth = cutting the network in half, each half retains 100% of edge bandwidth for cross-half communication. No oversubscription = no congestion penalty during AllReduce.

InfiniBand vs Ethernet

RDMA & RoCE v2 Lossless

What two mechanisms are required?

RDMA = NIC transfers directly to remote GPU memory, no CPU. RoCE v2 = RDMA over Ethernet at L3. Lossless requires: PFC (pause per priority class, reactive) + ECN (mark before drop, proactive). Without both, retransmissions collapse RDMA throughput.

InfiniBand vs Ethernet

InfiniBand Generations

EDR / HDR / NDR / XDR speeds?

EDR → 100 Gb/s (ConnectX-4)
HDR → 200 Gb/s (ConnectX-6)
NDR → 400 Gb/s (ConnectX-7)
XDR → 800 Gb/s (ConnectX-8)
Memory: Every Hundred's Nicely Doubled eXponentially

InfiniBand vs Ethernet

SHARP In-Network AllReduce

Where does the computation happen?

SHARP = Scalable Hierarchical Aggregation and Reduction Protocol. Runs inside InfiniBand switch ASICs — not on host CPUs. Reduces AllReduce time by up to 50%. Unique to InfiniBand. Programmed via UFM aggregation trees. Not available on standard Ethernet.

Storage Networking

GPUDirect Storage (GDS)

What does it eliminate from the path?

GDS creates a direct DMA path: Storage → NIC → GPU HBM, bypassing both CPU and system RAM. Result: dramatically faster checkpoint saves and dataset loads. Requires: compatible ConnectX NIC, cuFile API, CUDA 11.4+. Pairs with NVMe-oF for remote storage access.

AI Fabric Foundations

Scale-Up vs Scale-Out

NVLink vs InfiniBand — when does each apply?

Scale-Up: NVLink/NVSwitch within one node. 900 GB/s bidir. No CPU/OS. Hardware crossbar. Scale-Out: IB or Ethernet between nodes. RDMA via ConnectX. NCCL manages. Together: NVLink handles intra-node, IB/Ethernet handles inter-node collectives.

NCP-AIN Exam Series · 5 Topics

Ready to Master the Full NCP-AIN Exam?

Continue with InfiniBand Configuration, Spectrum-X Ethernet, BlueField DPUs, and AI Cluster Orchestration — all five topics with interactive quizzes and decision tools.