🖥️ NCP-AII  ·  Topic 2  ·  AI Server Systems

AI Server Systems &
Platform Design

DGX/HGX/MGX platforms, GPU form factors (SXM vs PCIe), PCIe Gen 5, ConnectX-7 networking, GPUDirect Storage, and DGX POD scale-out. Every number and spec the NCP-AII exam tests.

8
H100s per DGX
32
PFLOPS FP8
128
GB/s PCIe 5
10.2
kW DGX H100

🗺️ The AI Server Platform Landscape

NVIDIA offers a layered ecosystem of AI server platforms — from complete turnkey systems to OEM reference designs. Understanding which platform maps to which use case, and the specs that differentiate them, is central to the NCP-AII exam.

PlatformWhat It IsTargetGPU Config
DGX Complete NVIDIA turnkey AI server (CPU + GPU + storage + networking) AI training, research, enterprise 8× SXM GPUs + 4× NVSwitch
HGX OEM GPU baseboard (GPUs + NVSwitch, no CPU/storage) Cloud providers, OEM servers 4 or 8× SXM GPUs + NVSwitch
MGX NVIDIA modular GPU server reference architecture OEM edge & inference servers 1–8× PCIe or SXM GPUs
OVX Omniverse/simulation server reference design Industrial AI, digital twins RTX / L40S GPUs
GB200 NVL72 72× B200 GPUs + 36 Grace CPUs in one liquid-cooled rack Hyperscale AI training 72× B200 via NVLink 5
🖥️
DGX Platforms
H100 / H200 / B200 — full specs, NVSwitch count, FP8 PFLOPS
🔌
GPU Form Factors
SXM vs PCIe — TDP, NVLink availability, bandwidth
🌐
Networking
ConnectX-7, InfiniBand NDR, GPUDirect RDMA
💾
Storage
NVMe, GPUDirect Storage, Lustre/WEKA shared FS
📡
Scale-Out
DGX POD, DGX SuperPOD, 256-GPU configurations

📊 At-a-Glance: DGX Generation Comparison

SystemGPUTotal HBMFP8 AINVLink/GPUPower
DGX A1008× A100 SXM320 GB~25 PFLOPS600 GB/s6.5 kW
DGX H1008× H100 SXM5640 GB32 PFLOPS900 GB/s10.2 kW
DGX H2008× H200 SXM1,128 GB~32 PFLOPS*900 GB/s10.2 kW
DGX B2008× B200 SXM1,536 GB~72 PFLOPS1.8 TB/s~14.3 kW

* H200 has same GH100 compute die as H100; improvement is HBM3e capacity/bandwidth (4.8 TB/s vs 3.35 TB/s), not peak TFLOPS.

🖥️ DGX H100 — The Exam Reference System

The DGX H100 is the most exam-tested platform. Memorize these specs completely — they appear directly in NCP-AII questions about system capacity, power, and interconnect design.

DGX H200
Hopper+ · HBM3e Upgrade · 10U
GPUs8× H200 SXM (141 GB HBM3e)
Total GPU Memory1,128 GB HBM3e
HBM Bandwidth4.8 TB/s per GPU
NVSwitch4× NVSwitch 3rd Gen
FP8 AI Performance~32 PFLOPS (same GH100 die)
Best ForLarge model inference (bigger KV cache)
Peak Power10.2 kW
DGX B200
Blackwell · 2025 · NVLink 5
GPUs8× B200 SXM (192 GB HBM3e)
Total GPU Memory1,536 GB HBM3e
HBM Bandwidth8.0 TB/s per GPU
NVSwitch4× NVSwitch 4th Gen
NVLink BW/GPU1.8 TB/s bidirectional
FP8 AI Performance~72 PFLOPS
FP4 AI Performance~144 PFLOPS
Peak Power~14.3 kW
GB200 NVL72
Blackwell · Full-Rack Liquid-Cooled
GPUs72× B200 GPUs
CPUs36× Grace CPUs (ARM)
Total GPU Memory13.8 TB HBM3e
NVLink DomainAll 72 GPUs in 1 NVLink domain
FP4 AI Performance~1.44 ExaFLOPS
CoolingLiquid-cooled (direct liquid)
Rack Power~120 kW per rack

🏗️ HGX — OEM GPU Baseboard

HGX is NVIDIA's GPU subsystem for OEM partners. It includes the GPU tray (H100/H200/B200 SXM GPUs + NVSwitch chips) without the CPU, DRAM, storage, or networking. OEM partners like Dell, HPE, Lenovo, Supermicro, and Inspur integrate HGX into their own server chassis, adding their choice of CPU platform, networking, and cooling.

🔑 DGX vs HGX vs MGX — Quick Rule DGX = NVIDIA builds everything. HGX = NVIDIA provides the GPU sled; OEM builds the rest. MGX = NVIDIA provides a modular reference spec; OEM builds any compliant server (1U–8U, air or liquid, PCIe or SXM).

HGX H100 Configurations

  • HGX H100 8-GPU: 8× H100 SXM5 + 4× NVSwitch 3rd gen — identical GPU/interconnect to DGX H100, different chassis/CPU/networking
  • HGX H100 4-GPU: 4× H100 SXM5 + 2× NVSwitch — used in half-width 2U form factors for cloud providers with denser rack deployments
  • OEM adds: CPUs (typically 2× Xeon Sapphire Rapids or AMD EPYC Genoa), system DRAM, NVMe SSDs, and ConnectX NICs
📌 Exam tip: The GPU and NVSwitch hardware in HGX H100 8-GPU is identical to DGX H100. The performance difference (if any) comes from the OEM's CPU, cooling, and network choices — not the GPU/NVSwitch subsystem.

🔌 SXM vs PCIe: GPU Form Factor Comparison

NVIDIA offers H100 (and other GPUs) in two physical form factors. The choice between them is one of the most exam-critical design decisions in AI server architecture. SXM is for maximum performance; PCIe is for flexibility and standard server integration.

⚡ SXM (Socket Module)
TDP 700 W
NVLink 4 Bandwidth 900 GB/s
NVSwitch Required Yes (4 in DGX H100)
PCIe to Host PCIe Gen 5 x16
HBM3 Bandwidth 3.35 TB/s
Multi-GPU Topology Full NVLink mesh
Typical System DGX H100, HGX H100
Cooling Required Active (high-volume airflow)
🔗 PCIe (Add-in Card)
TDP 350 W
NVLink Bandwidth None (or NVLink Bridge: limited)
NVSwitch Required No — standard PCIe slot
PCIe to Host PCIe Gen 5 x16
HBM3 Bandwidth 3.35 TB/s (same die)
Multi-GPU Topology PCIe switch fabric
Typical System Standard 2U/4U OEM servers
Cooling Required Standard 1U/2U airflow
⚠️ Critical Distinction — Same Die, Different System Impact H100 SXM5 and H100 PCIe use the same GH100 silicon. The SXM form factor gets NVLink connectivity (900 GB/s GPU-to-GPU) and higher TDP (700W vs 350W), enabling full Tensor Core utilization. The PCIe variant is limited by PCIe bandwidth (128 GB/s) for GPU-to-GPU communication and thermal design.

📊 PCIe Gen 5 — The CPU-GPU Data Path

PCIe (Peripheral Component Interconnect Express) is the standard interface connecting GPUs to the CPU and host system. Understanding PCIe bandwidth — and its bottleneck effects — is essential for AI system design.

NVLink 4 (H100, GPU↔GPU)900 GB/s
900 GB/s
HBM3 Memory Bandwidth (H100)3,350 GB/s
3,350 GB/s (off-scale — ~26× PCIe)
PCIe 5.0 x16 (GPU↔CPU)128 GB/s
128 GB/s
PCIe 4.0 x16 (prev gen)64 GB/s
64 GB/s
InfiniBand NDR (per port, host)50 GB/s
50 GB/s
🔑 The PCIe Bottleneck — Why It Matters H100 HBM3 bandwidth = 3,350 GB/s. PCIe 5.0 x16 = 128 GB/s. That's a 26× gap. Every byte that travels CPU→GPU or GPU→CPU over PCIe is 26× slower than the GPU's own memory. AI training must minimize host-GPU data movement: use pinned memory, async prefetch, and DALI for data loading.

PCIe Version Bandwidth Table

PCIe GenGT/s per lanex8 BW (BiDir)x16 BW (BiDir)First GPU Gen
PCIe 3.08 GT/s16 GB/s32 GB/sPascal (P100)
PCIe 4.016 GT/s32 GB/s64 GB/sAmpere (A100)
PCIe 5.032 GT/s64 GB/s128 GB/sHopper (H100)
PCIe 6.064 GT/s128 GB/s256 GB/sFuture / Blackwell+

NVLink Bridge (PCIe GPUs)

For PCIe H100 deployments needing some GPU-to-GPU bandwidth above PCIe, NVIDIA offers the NVLink Bridge — a physical connector linking 2 or 4 PCIe GPUs via NVLink. However, the NVLink Bridge provides far less bandwidth than a full NVSwitch topology (limited to 2–4 GPU pairs, not a full mesh), and is not a substitute for the SXM/NVSwitch architecture in large training workloads.

🌐 AI Server Network Architecture

AI servers require multiple distinct network fabrics for different traffic types. Mixing them onto a single network causes congestion and performance collapse. The NCP-AII exam tests your understanding of which network handles which traffic, and the hardware involved.

🔥
Compute Fabric (AI Data Plane)
InfiniBand NDR (400 Gb/s) or Spectrum-X Ethernet (400 GbE) — carries AllReduce, gradient sync, NCCL traffic between GPUs across nodes. 1 ConnectX-7 NIC per GPU.
💾
Storage Fabric
100–400 GbE to shared filesystem (Lustre, WEKA, GPFS) — carries training data reads and checkpoint writes. Often separate from compute fabric.
⚙️
Management / BMC Network
1 GbE or 10 GbE out-of-band management — IPMI, BMC, iDRAC, iLO access. Used for server lifecycle management, not AI traffic.
🔗
NVLink (Intra-Node)
Not a network — direct GPU-to-GPU fabric within the DGX node via NVSwitch. 900 GB/s per H100 GPU, full mesh. Handled by NVSwitch hardware automatically.

🔧 ConnectX-7 — The AI Server NIC

ConnectX-7 is NVIDIA's 7th-generation network adapter, standard in DGX H100 (one per GPU). It supports both InfiniBand NDR (400 Gb/s) and 400GbE, making it dual-mode capable. Each ConnectX-7 directly services one H100 GPU, ensuring the GPU's network bandwidth is not shared with other GPUs.

FeatureConnectX-7 Spec
Max port speed400 Gb/s (NDR InfiniBand or 400GbE)
Protocol supportInfiniBand NDR, 100/200/400GbE, RoCEv2
GPUDirect RDMAYes — NIC DMA direct to/from GPU HBM
PCIe interfacePCIe Gen 5 x16
RDMA latency<600 ns (InfiniBand)
Count in DGX H1008× (1 per GPU) + 1× management
OffloadsRDMA, GPUDirect, RoCEv2, SHARP, TCP offload

GPUDirect RDMA

GPUDirect RDMA (Remote Direct Memory Access) allows a ConnectX-7 NIC to transfer data directly between a remote server's GPU memory and the local GPU's HBM — completely bypassing the CPU and system DRAM. This eliminates two PCIe crossings per transfer and can double effective network bandwidth for GPU-to-GPU operations across nodes.

🔑 GPUDirect RDMA Data Path (Without vs With) Without: Remote GPU → Remote NIC → Network → Local NIC → CPU (PCIe) → System DRAM → GPU (PCIe). With GPUDirect RDMA: Remote GPU → Remote NIC → Network → Local NIC → Local GPU HBM directly. Eliminates 2 PCIe hops + CPU copy.

InfiniBand NDR vs Spectrum-X for AI

FeatureInfiniBand NDRSpectrum-X (Ethernet)
Port speed400 Gb/s400 GbE
Latency~600 ns (lowest)~2–5 µs
ProtocolIB native (lossless)RoCEv2 over UDP/IP
Congestion controlHardware-native ECN + credit-basedDCQCN (PFC + ECN)
SHARP in-network computeYes (AllReduce in switch ASIC)Yes (Spectrum-4 ASIC)
Use caseTightly-coupled HPC/AI trainingCloud-native, Ethernet-standard AI
NICConnectX-7 (dual-mode)ConnectX-7 (dual-mode)
📌 Exam tip: ConnectX-7 supports both InfiniBand NDR and 400GbE in the same hardware — the mode is selected by software. DGX H100 ships with ConnectX-7 configured to the customer's choice of fabric.

💾 Storage for AI Workloads

AI training has extreme storage demands: large dataset reads during training, frequent checkpointing of multi-hundred-GB model weights, and burst I/O patterns. The storage architecture must match these demands — a slow filesystem becomes the training bottleneck even with 8 fast H100s.

Tier 1 — Fastest
GPU HBM3
3,350 GB/s
Active model weights, activations, gradients. 80 GB per H100.
Tier 2 — Local
Local NVMe SSD
~50 GB/s
DGX H100: 30 TB (8× 3.84 TB NVMe). Scratch, checkpoint staging.
Tier 3 — Shared
Lustre / WEKA
100–400 GB/s
Cluster-wide parallel filesystem. Training datasets, shared checkpoints.
Tier 4 — Cold
Object Storage
1–10 GB/s
S3-compatible (AWS S3, WEKA S3). Datasets, model archives, long-term storage.

GPUDirect Storage (GDS)

GPUDirect Storage enables a DMA engine to transfer data directly between NVMe SSDs and GPU HBM memory, bypassing the CPU and system DRAM entirely. This is critical for checkpointing large models (e.g., saving a 140 GB LLaMA 70B checkpoint) without saturating PCIe with CPU-routed copies.

🔑 GDS Data Path Comparison Without GDS: NVMe → PCIe → CPU → system DRAM → PCIe → GPU HBM (2× PCIe, CPU involved). With GDS: NVMe → PCIe DMA → GPU HBM directly (1× PCIe, CPU-free). Uses the cuFile API and requires the nvidia-fs kernel driver.
// GPUDirect Storage read example (cuFile API)
CUfileDescr_t cf_desc;
CUfileHandle_t cf_handle;
cuFileDriverOpen();
cuFileHandleRegister(&cf_handle, &cf_desc);
// Direct NVMe → GPU DMA — no CPU data copy
cuFileRead(cf_handle, gpu_buffer_ptr, read_size, file_offset, 0);

Shared Filesystems for AI Clusters

FilesystemTypeBandwidthBest ForNotes
LustreOpen-source parallel FS100s GB/sHPC + AI training datasetsStandard in DGX SuperPOD reference
WEKAFlash-native parallel FSUp to 1 TB/sAll-flash AI clustersNative S3 + POSIX, NVIDIA validated
IBM Spectrum Scale (GPFS)Enterprise parallel FS100s GB/sEnterprise AI, financeMature, complex mgmt
BeeGFSOpen parallel FS10–100 GB/sAcademic clustersEasy setup, lower performance ceiling
NFS v4Network FS1–10 GB/sNot recommended for trainingSingle metadata server bottleneck

📡 Scale-Out: DGX POD → SuperPOD

Single DGX nodes are the building block. NVIDIA provides reference architectures — DGX POD and DGX SuperPOD — that define how to interconnect multiple DGX nodes into validated AI clusters with known-good performance, networking, and storage configurations.

1
DGX H100 Node
8 GPUs · 640 GB HBM3 · 32 PFLOPS FP8 · 900 GB/s NVLink mesh · 8× ConnectX-7
2
DGX BasePOD (8 nodes)
64 GPUs · 5.12 TB HBM3 · 256 PFLOPS FP8 · Quantum-2 NDR IB leaf switches · Shared Lustre/WEKA storage
3
DGX SuperPOD (32 nodes)
256 GPUs · 20.48 TB HBM3 · 1,024 PFLOPS FP8 · Non-blocking fat-tree IB NDR fabric · 2-tier leaf-spine switching
4
Multi-SuperPOD / AI Datacenter
Thousands of GPUs · Multiple SuperPODs connected via IB core switches · Hierarchical storage (NVMe → Lustre → Object) · DCGM + UFM observability
🔑 DGX SuperPOD Key Numbers (H100) 32 DGX H100 nodes = 256 H100 GPUs = 1,024 PFLOPS FP8 = 20.48 TB total HBM3. Network: Quantum-2 NDR InfiniBand non-blocking fat-tree. Storage: WEKA or Lustre at 1+ TB/s aggregate. Validated reference design — no guesswork on integration.

GB200 NVL72 — Rack-Scale Computing

The GB200 NVL72 takes scale-up further: 72 B200 GPUs and 36 Grace CPUs in a single liquid-cooled rack, all connected in one NVLink 5 domain. Every GPU can communicate with every other GPU at 1.8 TB/s without leaving the rack — eliminating the inter-node InfiniBand bottleneck for models that fit within 72 GPUs.

  • Total GPU memory: 13.8 TB HBM3e — fits a 70T parameter model with room for KV cache
  • FP4 performance: ~1.44 ExaFLOPS — enables real-time inference of trillion-parameter models
  • Rack power: ~120 kW — requires facility-level direct liquid cooling infrastructure
  • CPU: Grace (ARM-based) — tight CPU-GPU memory bandwidth via NVLink-C2C (900 GB/s coherent interface)

🧪 Practice Quiz — AI Server Systems

Question 1 of 10
How many H100 SXM5 GPUs are housed in a single DGX H100 system?
A) 4
B) 6
C) 8
D) 16
1 / 10
Question 2 of 10
What is the TDP (thermal design power) of the H100 GPU in SXM form factor?
A) 350 W
B) 500 W
C) 700 W
D) 1,000 W
2 / 10
Question 3 of 10
What is the bidirectional bandwidth of PCIe Gen 5 x16 — the interface connecting H100 to the host CPU?
A) 64 GB/s
B) 96 GB/s
C) 128 GB/s
D) 256 GB/s
3 / 10
Question 4 of 10
Which NIC is standard in the DGX H100 (one per GPU) for compute network connectivity?
A) ConnectX-5
B) ConnectX-6 Dx
C) ConnectX-7
D) BlueField-3 DPU
4 / 10
Question 5 of 10
What does GPUDirect Storage (GDS) enable?
A) Faster GPU-to-GPU NVLink transfers within a node
B) Direct DMA between NVMe storage and GPU HBM, bypassing the CPU
C) Encrypted GPU memory for multi-tenant isolation
D) Automatic data tiering from HBM to NVMe when GPU memory is full
5 / 10
Question 6 of 10
What is the approximate total FP8 AI performance of a DGX H100?
A) 16 PFLOPS
B) 24 PFLOPS
C) 32 PFLOPS
D) 64 PFLOPS
6 / 10
Question 7 of 10
What is the primary advantage of the SXM GPU form factor over the PCIe form factor for multi-GPU AI training?
A) Lower power consumption (350W vs 700W)
B) Compatible with standard PCIe server slots
C) NVLink connectivity enabling 900 GB/s GPU-to-GPU bandwidth via NVSwitch
D) Support for more CUDA cores per chip
7 / 10
Question 8 of 10
How does HGX differ from DGX?
A) HGX supports more GPUs than DGX
B) HGX is an OEM GPU baseboard (GPU + NVSwitch only); DGX is a complete NVIDIA turnkey system
C) HGX uses PCIe GPUs; DGX uses SXM GPUs exclusively
D) HGX is only for inference; DGX is only for training
8 / 10
Question 9 of 10
How many NVSwitch chips are in a DGX H100 system?
A) 2
B) 4
C) 6
D) 8
9 / 10
Question 10 of 10
For large-scale AI training clusters, which shared filesystem options are commonly used for high-throughput scratch and dataset storage?
A) NFS v4 — widely supported network filesystem
B) SMB/CIFS — standard Windows-compatible shares
C) HDFS — Hadoop distributed filesystem
D) Lustre or WEKA — high-throughput parallel filesystems
10 / 10
0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

DGX H100
GPU count, total HBM, FP8 performance, and peak power?
8× H100 SXM5 · 640 GB HBM3 · 32 PFLOPS FP8 · 10.2 kW peak power
8 GPUs, 640 GB, 32 PFLOPS
SXM vs PCIe TDP
H100 SXM form factor TDP vs H100 PCIe form factor TDP?
SXM: 700 W (with NVLink, full NVSwitch mesh) · PCIe: 350 W (no NVSwitch, PCIe-only GPU-GPU)
700W SXM · 350W PCIe
PCIe Gen 5
PCIe Gen 5 x16 bidirectional bandwidth vs PCIe Gen 4?
Gen 5 x16: 128 GB/s bidirectional · Gen 4 x16: 64 GB/s · Gen 5 = 2× Gen 4
5 → 128 GB/s; 4 → 64 GB/s
GPUDirect Storage
What path does GDS enable, and what API does it use?
NVMe → PCIe DMA → GPU HBM directly (no CPU). Uses cuFile API + nvidia-fs kernel driver.
NVMe → GPU, cuFile, no CPU copy
DGX H100 NVSwitches
How many NVSwitch chips in DGX H100 and what do they provide?
4× NVSwitch 3rd gen · Full all-to-all GPU mesh · 900 GB/s between any GPU pair · 57.6 TB/s aggregate switching BW
4 NVSwitches = full mesh
HGX vs DGX
What is the key difference between HGX and DGX?
HGX = GPU baseboard only (GPU + NVSwitch). OEM adds CPU, RAM, storage, NICs. DGX = NVIDIA complete turnkey system with everything included.
HGX = GPU sled only; DGX = full system
DGX SuperPOD
How many DGX H100 nodes, total GPUs, and FP8 PFLOPS in a DGX SuperPOD?
32 DGX H100 nodes · 256 H100 GPUs · ~1,024 PFLOPS FP8 · Non-blocking NDR IB fat-tree fabric
32 nodes × 8 GPUs = 256 GPUs
ConnectX-7
ConnectX-7 max speed, protocol support, and count in DGX H100?
400 Gb/s per port · InfiniBand NDR or 400GbE (dual-mode) · 8× in DGX H100 (1 per GPU) + 1 management NIC
CX-7: 400 Gb/s, dual-mode, 8+1 per DGX

🤖 AI Server Systems Advisor

Select a topic for exam-focused guidance:

DGX System Deep Dive
PCIe & SXM Selection
Network Architecture
Storage Configuration
Scale-Out: POD to SuperPOD

🖥️ DGX System Deep Dive — Key Exam Points

  • DGX H100 = 8× H100 SXM5 + 4× NVSwitch 3G + 2× Xeon Platinum 8480C (56C each) + 2 TB DDR5 + 30 TB NVMe + 8× ConnectX-7 + 10.2 kW power. Every number is exam-critical.
  • DGX H200 has same GH100 compute die as H100 — compute TFLOPS are similar. Key upgrade: HBM3e 141 GB per GPU (vs 80 GB HBM3), and 4.8 TB/s bandwidth (vs 3.35 TB/s). Choose H200 when model size or KV cache exceeds 80 GB per GPU.
  • DGX B200: 8× B200, 192 GB HBM3e each, 1,536 GB total, NVLink 5 at 1.8 TB/s, ~72 PFLOPS FP8. Requires liquid cooling (higher TDP ~1,000W per GPU).
  • GB200 NVL72: not a DGX — it's a full-rack design with 72 B200 + 36 Grace CPUs, all connected in one NVLink 5 domain. 13.8 TB HBM3e, ~120 kW rack power, direct liquid cooling required.
  • The 4 NVSwitch chips in DGX H100 create a non-blocking all-to-all GPU mesh. With 8 GPUs × 900 GB/s NVLink = 7.2 TB/s aggregate NVLink bandwidth within the chassis.
  • HGX H100 8-GPU baseboard has identical GPU/NVSwitch hardware to DGX H100. Verify this on the exam — questions may ask which component is "OEM-configurable" (answer: CPU, RAM, NIC, storage) vs "fixed by NVIDIA" (GPU, NVSwitch).

🔌 PCIe & SXM Selection — Design Guide

  • Choose SXM when: training large models across multiple GPUs (needs NVLink AllReduce at 900 GB/s), running tightly coupled parallel workloads, or deploying in DGX/HGX reference systems. SXM = 700W, NVLink 4.
  • Choose PCIe when: deploying in standard OEM servers, inference workloads with single-GPU jobs, budget-constrained deployments, or server chassis lacking NVLink baseboard. PCIe = 350W, no NVSwitch.
  • PCIe Gen 5 bandwidth: 128 GB/s x16 bidirectional. PCIe Gen 4: 64 GB/s. Gen 5 is 2× Gen 4. H100 introduced PCIe Gen 5. A100 used PCIe Gen 4.
  • The PCIe bottleneck: H100 HBM3 = 3,350 GB/s. PCIe 5 = 128 GB/s. The host-GPU interface is 26× slower than GPU memory. Minimize CPU↔GPU data movement in training pipelines.
  • NVLink Bridge (PCIe): connects 2 or 4 PCIe H100 cards directly via NVLink lanes. Useful for small multi-GPU servers but far less capable than full NVSwitch topology (no full mesh beyond 4 GPUs).
  • Verify bandwidth with: nvidia-smi topo -m shows NV18 (SXM + NVSwitch) vs SYS (PCIe cross-socket) entries for GPU-to-GPU bandwidth class.

🌐 Network Architecture — Design Guide

  • Use 3 separate networks in an AI cluster: (1) Compute fabric for GPU-GPU AllReduce, (2) Storage fabric for dataset reads + checkpoint writes, (3) Management/BMC for out-of-band access. Mixing these causes congestion.
  • ConnectX-7 is dual-mode: InfiniBand NDR (400 Gb/s) or 400GbE with RoCEv2. Same hardware, mode selected at configuration time. DGX H100 ships with 8× ConnectX-7 (1 per GPU).
  • GPUDirect RDMA: remote GPU → local GPU data transfer via NIC DMA, bypassing CPU and system DRAM. Requires nvidia-peermem driver module + ConnectX-7. Verify with lsmod | grep nvidia_peermem.
  • InfiniBand NDR: ~600 ns latency, hardware-native congestion control, SHARP in-network AllReduce. Best for tightly-coupled HPC-style AI training.
  • Spectrum-X (RoCEv2): DCQCN congestion control (PFC + ECN), SHARP AllReduce in Spectrum-4 ASIC, compatible with standard Ethernet tooling. Choose for cloud-native or Ethernet-first environments.
  • Rail-optimized topology: each GPU in a multi-GPU server connects to a different top-of-rack switch "rail", so inter-node traffic from GPU N always goes to the same rail switch — avoiding local switch oversubscription.

💾 Storage Configuration — Guide

  • DGX H100 local NVMe: 30 TB (8× 3.84 TB NVMe SSDs), ~50 GB/s aggregate sequential read. Use for checkpoint staging, temporary scratch during training iterations.
  • GPUDirect Storage: enables NVMe → GPU DMA without CPU. Use cuFile API. Requires nvidia-fs kernel driver + GDS-compatible NVMe. Dramatically reduces checkpoint overhead for large models.
  • Lustre: open-source parallel filesystem, standard in DGX SuperPOD reference. Metadata server (MDS) + object storage servers (OSS). Scales to 100s GB/s with enough OSS nodes.
  • WEKA: all-flash parallel filesystem, native S3 + POSIX. NVIDIA-validated for DGX deployments. Can deliver 1+ TB/s aggregate with all-NVMe backend. Best-in-class for AI training I/O.
  • Checkpoint sizing: LLaMA 70B = ~140 GB in BF16. Full checkpoint write to local NVMe at 50 GB/s = ~3 seconds. Writing to slow NFS at 1 GB/s = ~140 seconds. Storage BW matters enormously for training efficiency.
  • Data pipeline: use NVIDIA DALI (Data Loading Library) for image/video preprocessing directly on GPU. Eliminates CPU bottleneck in vision training pipelines. Combined with GPUDirect Storage for end-to-end CPU-free data loading.

📡 Scale-Out: POD to SuperPOD — Guide

  • DGX BasePOD (8 nodes): 64 GPUs, Quantum-2 NDR IB leaf switches, shared storage. Minimum validated AI cluster configuration. Good starting point for LLM fine-tuning at scale.
  • DGX SuperPOD (32 nodes): 256 H100 GPUs, 1,024 PFLOPS FP8, non-blocking 2-tier fat-tree IB NDR fabric (leaf + spine). The standard reference for large LLM pre-training facilities.
  • SuperPOD storage: WEKA or Lustre configured for 1+ TB/s aggregate throughput. Each DGX node's 8 ConnectX-7 NICs connect to storage and compute fabric simultaneously via different switch planes.
  • GB200 NVL72 is a scale-up alternative to SuperPOD: 72 GPUs in 1 NVLink domain vs 256 GPUs across 32 nodes with InfiniBand. NVLink (1.8 TB/s) >> InfiniBand NDR (400 Gb/s) for inter-GPU bandwidth, but NVL72 is limited to 72 GPUs per domain.
  • Multi-rack clusters: connect SuperPODs via IB core switches (Quantum-2 or Quantum-X800). Full bisection bandwidth non-blocking design is preferred for training; oversubscribed designs (e.g., 4:1 oversubscription) are acceptable for inference.
  • Observability at scale: DCGM monitors GPU health (temperature, power, ECC, SM utilization) across all nodes. UFM (Unified Fabric Manager) monitors InfiniBand fabric health. Both feed into Prometheus + Grafana dashboards.

🧩 Platform Mnemonics

FactMnemonic / Hook
DGX H100 = 8 GPUs, 640 GB, 32 PFLOPS"8 GPUs × 80 GB = 640 GB. 8 × 3,958 TFLOPS ÷ 1,000 = ~32 PFLOPS"
DGX H100 power = 10.2 kW"8 GPUs × 700W + CPU/overhead ≈ 10.2 kW — like 100 gaming PCs"
PCIe 5 = 128 GB/s"Gen 5 = 5 × 32 GT/s per lane × 16 lanes ÷ 8 bits = 128 GB/s"
SXM = 700W; PCIe = 350W"SXM = double the power, double the bandwidth. PCIe = half price, half bandwidth"
ConnectX-7 = 400 Gb/s dual-mode"CX-7 = both roads: InfiniBand or Ethernet, same NIC"
SuperPOD = 32 nodes = 256 GPUs"32 × 8 = 256 — 32 DGX nodes make one SuperPOD"
GDS = cuFile API"GPU Direct Storage = cuFile + nvidia-fs kernel module"
← All NCP-AII Topics
NCP-AII · Topic 2 Complete

Build your AI infrastructure expertise

FlashGenius covers every NCP-AII topic with exam-focused flashcards, quizzes, and deep-dive study guides.

Start Free → NVIDIA Certifications ↗