AI Server Systems & Platform Design

Platform	What It Is	Target	GPU Config
DGX	Complete NVIDIA turnkey AI server (CPU + GPU + storage + networking)	AI training, research, enterprise	8× SXM GPUs + 4× NVSwitch
HGX	OEM GPU baseboard (GPUs + NVSwitch, no CPU/storage)	Cloud providers, OEM servers	4 or 8× SXM GPUs + NVSwitch
MGX	NVIDIA modular GPU server reference architecture	OEM edge & inference servers	1–8× PCIe or SXM GPUs
OVX	Omniverse/simulation server reference design	Industrial AI, digital twins	RTX / L40S GPUs
GB200 NVL72	72× B200 GPUs + 36 Grace CPUs in one liquid-cooled rack	Hyperscale AI training	72× B200 via NVLink 5

System	GPU	Total HBM	FP8 AI	NVLink/GPU	Power
DGX A100	8× A100 SXM	320 GB	~25 PFLOPS	600 GB/s	6.5 kW
DGX H100	8× H100 SXM5	640 GB	32 PFLOPS	900 GB/s	10.2 kW
DGX H200	8× H200 SXM	1,128 GB	~32 PFLOPS*	900 GB/s	10.2 kW
DGX B200	8× B200 SXM	1,536 GB	~72 PFLOPS	1.8 TB/s	~14.3 kW

🖥️ DGX H100 — The Exam Reference System

The DGX H100 is the most exam-tested platform. Memorize these specs completely — they appear directly in NCP-AII questions about system capacity, power, and interconnect design.

DGX H100

Hopper · 2023 · 10U Rack

GPUs8× H100 SXM5 (80 GB HBM3)

Total GPU Memory640 GB HBM3

NVSwitch4× NVSwitch 3rd Gen

NVLink BW/GPU900 GB/s bidirectional

FP8 AI Performance32 PFLOPS

CPUs2× Intel Xeon Platinum 8480C

CPU Cores112 total (56 per CPU)

System RAM2 TB DDR5

Local Storage30 TB NVMe SSD

Network NICs8× ConnectX-7 (1 per GPU)

Network Speed400 Gb/s per NIC (HDR/NDR IB or 400GbE)

Peak Power10.2 kW

PCIe VersionGen 5 (GPU↔CPU)

DGX H200

Hopper+ · HBM3e Upgrade · 10U

GPUs8× H200 SXM (141 GB HBM3e)

Total GPU Memory1,128 GB HBM3e

HBM Bandwidth4.8 TB/s per GPU

NVSwitch4× NVSwitch 3rd Gen

FP8 AI Performance~32 PFLOPS (same GH100 die)

Best ForLarge model inference (bigger KV cache)

Peak Power10.2 kW

DGX B200

Blackwell · 2025 · NVLink 5

GPUs8× B200 SXM (192 GB HBM3e)

Total GPU Memory1,536 GB HBM3e

HBM Bandwidth8.0 TB/s per GPU

NVSwitch4× NVSwitch 4th Gen

NVLink BW/GPU1.8 TB/s bidirectional

FP8 AI Performance~72 PFLOPS

FP4 AI Performance~144 PFLOPS

Peak Power~14.3 kW

GB200 NVL72

Blackwell · Full-Rack Liquid-Cooled

GPUs72× B200 GPUs

CPUs36× Grace CPUs (ARM)

Total GPU Memory13.8 TB HBM3e

NVLink DomainAll 72 GPUs in 1 NVLink domain

FP4 AI Performance~1.44 ExaFLOPS

CoolingLiquid-cooled (direct liquid)

Rack Power~120 kW per rack

🏗️ HGX — OEM GPU Baseboard

HGX is NVIDIA's GPU subsystem for OEM partners. It includes the GPU tray (H100/H200/B200 SXM GPUs + NVSwitch chips) without the CPU, DRAM, storage, or networking. OEM partners like Dell, HPE, Lenovo, Supermicro, and Inspur integrate HGX into their own server chassis, adding their choice of CPU platform, networking, and cooling.

🔑 DGX vs HGX vs MGX — Quick Rule DGX = NVIDIA builds everything. HGX = NVIDIA provides the GPU sled; OEM builds the rest. MGX = NVIDIA provides a modular reference spec; OEM builds any compliant server (1U–8U, air or liquid, PCIe or SXM).

HGX H100 Configurations

HGX H100 8-GPU: 8× H100 SXM5 + 4× NVSwitch 3rd gen — identical GPU/interconnect to DGX H100, different chassis/CPU/networking
HGX H100 4-GPU: 4× H100 SXM5 + 2× NVSwitch — used in half-width 2U form factors for cloud providers with denser rack deployments
OEM adds: CPUs (typically 2× Xeon Sapphire Rapids or AMD EPYC Genoa), system DRAM, NVMe SSDs, and ConnectX NICs

📌 Exam tip: The GPU and NVSwitch hardware in HGX H100 8-GPU is identical to DGX H100. The performance difference (if any) comes from the OEM's CPU, cooling, and network choices — not the GPU/NVSwitch subsystem.

🔌 SXM vs PCIe: GPU Form Factor Comparison

NVIDIA offers H100 (and other GPUs) in two physical form factors. The choice between them is one of the most exam-critical design decisions in AI server architecture. SXM is for maximum performance; PCIe is for flexibility and standard server integration.

⚡ SXM (Socket Module)

TDP 700 W

NVLink 4 Bandwidth 900 GB/s

NVSwitch Required Yes (4 in DGX H100)

PCIe to Host PCIe Gen 5 x16

HBM3 Bandwidth 3.35 TB/s

Multi-GPU Topology Full NVLink mesh

Typical System DGX H100, HGX H100

Cooling Required Active (high-volume airflow)

🔗 PCIe (Add-in Card)

TDP 350 W

NVLink Bandwidth None (or NVLink Bridge: limited)

NVSwitch Required No — standard PCIe slot

PCIe to Host PCIe Gen 5 x16

HBM3 Bandwidth 3.35 TB/s (same die)

Multi-GPU Topology PCIe switch fabric

Typical System Standard 2U/4U OEM servers

Cooling Required Standard 1U/2U airflow

⚠️ Critical Distinction — Same Die, Different System Impact H100 SXM5 and H100 PCIe use the same GH100 silicon. The SXM form factor gets NVLink connectivity (900 GB/s GPU-to-GPU) and higher TDP (700W vs 350W), enabling full Tensor Core utilization. The PCIe variant is limited by PCIe bandwidth (128 GB/s) for GPU-to-GPU communication and thermal design.

📊 PCIe Gen 5 — The CPU-GPU Data Path

PCIe (Peripheral Component Interconnect Express) is the standard interface connecting GPUs to the CPU and host system. Understanding PCIe bandwidth — and its bottleneck effects — is essential for AI system design.

NVLink 4 (H100, GPU↔GPU)900 GB/s

900 GB/s

HBM3 Memory Bandwidth (H100)3,350 GB/s

3,350 GB/s (off-scale — ~26× PCIe)

PCIe 5.0 x16 (GPU↔CPU)128 GB/s

128 GB/s

PCIe 4.0 x16 (prev gen)64 GB/s

64 GB/s

InfiniBand NDR (per port, host)50 GB/s

50 GB/s

🔑 The PCIe Bottleneck — Why It Matters H100 HBM3 bandwidth = 3,350 GB/s. PCIe 5.0 x16 = 128 GB/s. That's a 26× gap. Every byte that travels CPU→GPU or GPU→CPU over PCIe is 26× slower than the GPU's own memory. AI training must minimize host-GPU data movement: use pinned memory, async prefetch, and DALI for data loading.

PCIe Version Bandwidth Table

PCIe Gen	GT/s per lane	x8 BW (BiDir)	x16 BW (BiDir)	First GPU Gen
PCIe 3.0	8 GT/s	16 GB/s	32 GB/s	Pascal (P100)
PCIe 4.0	16 GT/s	32 GB/s	64 GB/s	Ampere (A100)
PCIe 5.0	32 GT/s	64 GB/s	128 GB/s	Hopper (H100)
PCIe 6.0	64 GT/s	128 GB/s	256 GB/s	Future / Blackwell+

NVLink Bridge (PCIe GPUs)

For PCIe H100 deployments needing some GPU-to-GPU bandwidth above PCIe, NVIDIA offers the NVLink Bridge — a physical connector linking 2 or 4 PCIe GPUs via NVLink. However, the NVLink Bridge provides far less bandwidth than a full NVSwitch topology (limited to 2–4 GPU pairs, not a full mesh), and is not a substitute for the SXM/NVSwitch architecture in large training workloads.

🌐 AI Server Network Architecture

AI servers require multiple distinct network fabrics for different traffic types. Mixing them onto a single network causes congestion and performance collapse. The NCP-AII exam tests your understanding of which network handles which traffic, and the hardware involved.

🔥

Compute Fabric (AI Data Plane)

InfiniBand NDR (400 Gb/s) or Spectrum-X Ethernet (400 GbE) — carries AllReduce, gradient sync, NCCL traffic between GPUs across nodes. 1 ConnectX-7 NIC per GPU.

💾

Storage Fabric

100–400 GbE to shared filesystem (Lustre, WEKA, GPFS) — carries training data reads and checkpoint writes. Often separate from compute fabric.

⚙️

Management / BMC Network

1 GbE or 10 GbE out-of-band management — IPMI, BMC, iDRAC, iLO access. Used for server lifecycle management, not AI traffic.

🔗

NVLink (Intra-Node)

Not a network — direct GPU-to-GPU fabric within the DGX node via NVSwitch. 900 GB/s per H100 GPU, full mesh. Handled by NVSwitch hardware automatically.

🔧 ConnectX-7 — The AI Server NIC

ConnectX-7 is NVIDIA's 7th-generation network adapter, standard in DGX H100 (one per GPU). It supports both InfiniBand NDR (400 Gb/s) and 400GbE, making it dual-mode capable. Each ConnectX-7 directly services one H100 GPU, ensuring the GPU's network bandwidth is not shared with other GPUs.

Feature	ConnectX-7 Spec
Max port speed	400 Gb/s (NDR InfiniBand or 400GbE)
Protocol support	InfiniBand NDR, 100/200/400GbE, RoCEv2
GPUDirect RDMA	Yes — NIC DMA direct to/from GPU HBM
PCIe interface	PCIe Gen 5 x16
RDMA latency	<600 ns (InfiniBand)
Count in DGX H100	8× (1 per GPU) + 1× management
Offloads	RDMA, GPUDirect, RoCEv2, SHARP, TCP offload

GPUDirect RDMA

GPUDirect RDMA (Remote Direct Memory Access) allows a ConnectX-7 NIC to transfer data directly between a remote server's GPU memory and the local GPU's HBM — completely bypassing the CPU and system DRAM. This eliminates two PCIe crossings per transfer and can double effective network bandwidth for GPU-to-GPU operations across nodes.

🔑 GPUDirect RDMA Data Path (Without vs With) Without: Remote GPU → Remote NIC → Network → Local NIC → CPU (PCIe) → System DRAM → GPU (PCIe). With GPUDirect RDMA: Remote GPU → Remote NIC → Network → Local NIC → Local GPU HBM directly. Eliminates 2 PCIe hops + CPU copy.

InfiniBand NDR vs Spectrum-X for AI

Feature	InfiniBand NDR	Spectrum-X (Ethernet)
Port speed	400 Gb/s	400 GbE
Latency	~600 ns (lowest)	~2–5 µs
Protocol	IB native (lossless)	RoCEv2 over UDP/IP
Congestion control	Hardware-native ECN + credit-based	DCQCN (PFC + ECN)
SHARP in-network compute	Yes (AllReduce in switch ASIC)	Yes (Spectrum-4 ASIC)
Use case	Tightly-coupled HPC/AI training	Cloud-native, Ethernet-standard AI
NIC	ConnectX-7 (dual-mode)	ConnectX-7 (dual-mode)

📌 Exam tip: ConnectX-7 supports both InfiniBand NDR and 400GbE in the same hardware — the mode is selected by software. DGX H100 ships with ConnectX-7 configured to the customer's choice of fabric.

💾 Storage for AI Workloads

AI training has extreme storage demands: large dataset reads during training, frequent checkpointing of multi-hundred-GB model weights, and burst I/O patterns. The storage architecture must match these demands — a slow filesystem becomes the training bottleneck even with 8 fast H100s.

Tier 1 — Fastest

GPU HBM3

3,350 GB/s

Active model weights, activations, gradients. 80 GB per H100.

Tier 2 — Local

Local NVMe SSD

~50 GB/s

DGX H100: 30 TB (8× 3.84 TB NVMe). Scratch, checkpoint staging.

Tier 3 — Shared

Lustre / WEKA

100–400 GB/s

Cluster-wide parallel filesystem. Training datasets, shared checkpoints.

Tier 4 — Cold

Object Storage

1–10 GB/s

S3-compatible (AWS S3, WEKA S3). Datasets, model archives, long-term storage.

GPUDirect Storage (GDS)

GPUDirect Storage enables a DMA engine to transfer data directly between NVMe SSDs and GPU HBM memory, bypassing the CPU and system DRAM entirely. This is critical for checkpointing large models (e.g., saving a 140 GB LLaMA 70B checkpoint) without saturating PCIe with CPU-routed copies.

🔑 GDS Data Path Comparison Without GDS: NVMe → PCIe → CPU → system DRAM → PCIe → GPU HBM (2× PCIe, CPU involved). With GDS: NVMe → PCIe DMA → GPU HBM directly (1× PCIe, CPU-free). Uses the cuFile API and requires the nvidia-fs kernel driver.

// GPUDirect Storage read example (cuFile API)
CUfileDescr_t cf_desc;
CUfileHandle_t cf_handle;
cuFileDriverOpen();
cuFileHandleRegister(&cf_handle, &cf_desc);
// Direct NVMe → GPU DMA — no CPU data copy
cuFileRead(cf_handle, gpu_buffer_ptr, read_size, file_offset, 0);

Shared Filesystems for AI Clusters

Filesystem	Type	Bandwidth	Best For	Notes
Lustre	Open-source parallel FS	100s GB/s	HPC + AI training datasets	Standard in DGX SuperPOD reference
WEKA	Flash-native parallel FS	Up to 1 TB/s	All-flash AI clusters	Native S3 + POSIX, NVIDIA validated
IBM Spectrum Scale (GPFS)	Enterprise parallel FS	100s GB/s	Enterprise AI, finance	Mature, complex mgmt
BeeGFS	Open parallel FS	10–100 GB/s	Academic clusters	Easy setup, lower performance ceiling
NFS v4	Network FS	1–10 GB/s	Not recommended for training	Single metadata server bottleneck

📡 Scale-Out: DGX POD → SuperPOD

Single DGX nodes are the building block. NVIDIA provides reference architectures — DGX POD and DGX SuperPOD — that define how to interconnect multiple DGX nodes into validated AI clusters with known-good performance, networking, and storage configurations.

DGX H100 Node

8 GPUs · 640 GB HBM3 · 32 PFLOPS FP8 · 900 GB/s NVLink mesh · 8× ConnectX-7

DGX BasePOD (8 nodes)

64 GPUs · 5.12 TB HBM3 · 256 PFLOPS FP8 · Quantum-2 NDR IB leaf switches · Shared Lustre/WEKA storage

DGX SuperPOD (32 nodes)

256 GPUs · 20.48 TB HBM3 · 1,024 PFLOPS FP8 · Non-blocking fat-tree IB NDR fabric · 2-tier leaf-spine switching

Multi-SuperPOD / AI Datacenter

Thousands of GPUs · Multiple SuperPODs connected via IB core switches · Hierarchical storage (NVMe → Lustre → Object) · DCGM + UFM observability

🔑 DGX SuperPOD Key Numbers (H100) 32 DGX H100 nodes = 256 H100 GPUs = 1,024 PFLOPS FP8 = 20.48 TB total HBM3. Network: Quantum-2 NDR InfiniBand non-blocking fat-tree. Storage: WEKA or Lustre at 1+ TB/s aggregate. Validated reference design — no guesswork on integration.

GB200 NVL72 — Rack-Scale Computing

The GB200 NVL72 takes scale-up further: 72 B200 GPUs and 36 Grace CPUs in a single liquid-cooled rack, all connected in one NVLink 5 domain. Every GPU can communicate with every other GPU at 1.8 TB/s without leaving the rack — eliminating the inter-node InfiniBand bottleneck for models that fit within 72 GPUs.

Total GPU memory: 13.8 TB HBM3e — fits a 70T parameter model with room for KV cache
FP4 performance: ~1.44 ExaFLOPS — enables real-time inference of trillion-parameter models
Rack power: ~120 kW — requires facility-level direct liquid cooling infrastructure
CPU: Grace (ARM-based) — tight CPU-GPU memory bandwidth via NVLink-C2C (900 GB/s coherent interface)

🧪 Practice Quiz — AI Server Systems

Question 1 of 10

How many H100 SXM5 GPUs are housed in a single DGX H100 system?

A) 4

B) 6

C) 8

D) 16

1 / 10

Question 2 of 10

What is the TDP (thermal design power) of the H100 GPU in SXM form factor?

A) 350 W

B) 500 W

C) 700 W

D) 1,000 W

2 / 10

Question 3 of 10

What is the bidirectional bandwidth of PCIe Gen 5 x16 — the interface connecting H100 to the host CPU?

A) 64 GB/s

B) 96 GB/s

C) 128 GB/s

D) 256 GB/s

3 / 10

Question 4 of 10

Which NIC is standard in the DGX H100 (one per GPU) for compute network connectivity?

A) ConnectX-5

B) ConnectX-6 Dx

C) ConnectX-7

D) BlueField-3 DPU

4 / 10

Question 5 of 10

What does GPUDirect Storage (GDS) enable?

A) Faster GPU-to-GPU NVLink transfers within a node

B) Direct DMA between NVMe storage and GPU HBM, bypassing the CPU

C) Encrypted GPU memory for multi-tenant isolation

D) Automatic data tiering from HBM to NVMe when GPU memory is full

5 / 10

Question 6 of 10

What is the approximate total FP8 AI performance of a DGX H100?

A) 16 PFLOPS

B) 24 PFLOPS

C) 32 PFLOPS

D) 64 PFLOPS

6 / 10

Question 7 of 10

What is the primary advantage of the SXM GPU form factor over the PCIe form factor for multi-GPU AI training?

A) Lower power consumption (350W vs 700W)

B) Compatible with standard PCIe server slots

C) NVLink connectivity enabling 900 GB/s GPU-to-GPU bandwidth via NVSwitch

D) Support for more CUDA cores per chip

7 / 10

Question 8 of 10

How does HGX differ from DGX?

A) HGX supports more GPUs than DGX

B) HGX is an OEM GPU baseboard (GPU + NVSwitch only); DGX is a complete NVIDIA turnkey system

C) HGX uses PCIe GPUs; DGX uses SXM GPUs exclusively

D) HGX is only for inference; DGX is only for training

8 / 10

Question 9 of 10

How many NVSwitch chips are in a DGX H100 system?

A) 2

B) 4

C) 6

D) 8

9 / 10

Question 10 of 10

For large-scale AI training clusters, which shared filesystem options are commonly used for high-throughput scratch and dataset storage?

A) NFS v4 — widely supported network filesystem

B) SMB/CIFS — standard Windows-compatible shares

C) HDFS — Hadoop distributed filesystem

D) Lustre or WEKA — high-throughput parallel filesystems

10 / 10

0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

DGX H100

GPU count, total HBM, FP8 performance, and peak power?

8× H100 SXM5 · 640 GB HBM3 · 32 PFLOPS FP8 · 10.2 kW peak power

8 GPUs, 640 GB, 32 PFLOPS

SXM vs PCIe TDP

H100 SXM form factor TDP vs H100 PCIe form factor TDP?

SXM: 700 W (with NVLink, full NVSwitch mesh) · PCIe: 350 W (no NVSwitch, PCIe-only GPU-GPU)

700W SXM · 350W PCIe

PCIe Gen 5

PCIe Gen 5 x16 bidirectional bandwidth vs PCIe Gen 4?

Gen 5 x16: 128 GB/s bidirectional · Gen 4 x16: 64 GB/s · Gen 5 = 2× Gen 4

5 → 128 GB/s; 4 → 64 GB/s

GPUDirect Storage

What path does GDS enable, and what API does it use?

NVMe → PCIe DMA → GPU HBM directly (no CPU). Uses cuFile API + nvidia-fs kernel driver.

NVMe → GPU, cuFile, no CPU copy

DGX H100 NVSwitches

How many NVSwitch chips in DGX H100 and what do they provide?

4× NVSwitch 3rd gen · Full all-to-all GPU mesh · 900 GB/s between any GPU pair · 57.6 TB/s aggregate switching BW

4 NVSwitches = full mesh

HGX vs DGX

What is the key difference between HGX and DGX?

HGX = GPU baseboard only (GPU + NVSwitch). OEM adds CPU, RAM, storage, NICs. DGX = NVIDIA complete turnkey system with everything included.

HGX = GPU sled only; DGX = full system

DGX SuperPOD

How many DGX H100 nodes, total GPUs, and FP8 PFLOPS in a DGX SuperPOD?

32 DGX H100 nodes · 256 H100 GPUs · ~1,024 PFLOPS FP8 · Non-blocking NDR IB fat-tree fabric

32 nodes × 8 GPUs = 256 GPUs

ConnectX-7

ConnectX-7 max speed, protocol support, and count in DGX H100?

400 Gb/s per port · InfiniBand NDR or 400GbE (dual-mode) · 8× in DGX H100 (1 per GPU) + 1 management NIC

CX-7: 400 Gb/s, dual-mode, 8+1 per DGX

🤖 AI Server Systems Advisor

Select a topic for exam-focused guidance:

DGX System Deep Dive

PCIe & SXM Selection

Network Architecture

Storage Configuration

Scale-Out: POD to SuperPOD

🖥️ DGX System Deep Dive — Key Exam Points

DGX H100 = 8× H100 SXM5 + 4× NVSwitch 3G + 2× Xeon Platinum 8480C (56C each) + 2 TB DDR5 + 30 TB NVMe + 8× ConnectX-7 + 10.2 kW power. Every number is exam-critical.
DGX H200 has same GH100 compute die as H100 — compute TFLOPS are similar. Key upgrade: HBM3e 141 GB per GPU (vs 80 GB HBM3), and 4.8 TB/s bandwidth (vs 3.35 TB/s). Choose H200 when model size or KV cache exceeds 80 GB per GPU.
DGX B200: 8× B200, 192 GB HBM3e each, 1,536 GB total, NVLink 5 at 1.8 TB/s, ~72 PFLOPS FP8. Requires liquid cooling (higher TDP ~1,000W per GPU).
GB200 NVL72: not a DGX — it's a full-rack design with 72 B200 + 36 Grace CPUs, all connected in one NVLink 5 domain. 13.8 TB HBM3e, ~120 kW rack power, direct liquid cooling required.
The 4 NVSwitch chips in DGX H100 create a non-blocking all-to-all GPU mesh. With 8 GPUs × 900 GB/s NVLink = 7.2 TB/s aggregate NVLink bandwidth within the chassis.
HGX H100 8-GPU baseboard has identical GPU/NVSwitch hardware to DGX H100. Verify this on the exam — questions may ask which component is "OEM-configurable" (answer: CPU, RAM, NIC, storage) vs "fixed by NVIDIA" (GPU, NVSwitch).

🔌 PCIe & SXM Selection — Design Guide

Choose SXM when: training large models across multiple GPUs (needs NVLink AllReduce at 900 GB/s), running tightly coupled parallel workloads, or deploying in DGX/HGX reference systems. SXM = 700W, NVLink 4.
Choose PCIe when: deploying in standard OEM servers, inference workloads with single-GPU jobs, budget-constrained deployments, or server chassis lacking NVLink baseboard. PCIe = 350W, no NVSwitch.
PCIe Gen 5 bandwidth: 128 GB/s x16 bidirectional. PCIe Gen 4: 64 GB/s. Gen 5 is 2× Gen 4. H100 introduced PCIe Gen 5. A100 used PCIe Gen 4.
The PCIe bottleneck: H100 HBM3 = 3,350 GB/s. PCIe 5 = 128 GB/s. The host-GPU interface is 26× slower than GPU memory. Minimize CPU↔GPU data movement in training pipelines.
NVLink Bridge (PCIe): connects 2 or 4 PCIe H100 cards directly via NVLink lanes. Useful for small multi-GPU servers but far less capable than full NVSwitch topology (no full mesh beyond 4 GPUs).
Verify bandwidth with: nvidia-smi topo -m shows NV18 (SXM + NVSwitch) vs SYS (PCIe cross-socket) entries for GPU-to-GPU bandwidth class.

🌐 Network Architecture — Design Guide

Use 3 separate networks in an AI cluster: (1) Compute fabric for GPU-GPU AllReduce, (2) Storage fabric for dataset reads + checkpoint writes, (3) Management/BMC for out-of-band access. Mixing these causes congestion.
ConnectX-7 is dual-mode: InfiniBand NDR (400 Gb/s) or 400GbE with RoCEv2. Same hardware, mode selected at configuration time. DGX H100 ships with 8× ConnectX-7 (1 per GPU).
GPUDirect RDMA: remote GPU → local GPU data transfer via NIC DMA, bypassing CPU and system DRAM. Requires nvidia-peermem driver module + ConnectX-7. Verify with lsmod | grep nvidia_peermem.
InfiniBand NDR: ~600 ns latency, hardware-native congestion control, SHARP in-network AllReduce. Best for tightly-coupled HPC-style AI training.
Spectrum-X (RoCEv2): DCQCN congestion control (PFC + ECN), SHARP AllReduce in Spectrum-4 ASIC, compatible with standard Ethernet tooling. Choose for cloud-native or Ethernet-first environments.
Rail-optimized topology: each GPU in a multi-GPU server connects to a different top-of-rack switch "rail", so inter-node traffic from GPU N always goes to the same rail switch — avoiding local switch oversubscription.

💾 Storage Configuration — Guide

DGX H100 local NVMe: 30 TB (8× 3.84 TB NVMe SSDs), ~50 GB/s aggregate sequential read. Use for checkpoint staging, temporary scratch during training iterations.
GPUDirect Storage: enables NVMe → GPU DMA without CPU. Use cuFile API. Requires nvidia-fs kernel driver + GDS-compatible NVMe. Dramatically reduces checkpoint overhead for large models.
Lustre: open-source parallel filesystem, standard in DGX SuperPOD reference. Metadata server (MDS) + object storage servers (OSS). Scales to 100s GB/s with enough OSS nodes.
WEKA: all-flash parallel filesystem, native S3 + POSIX. NVIDIA-validated for DGX deployments. Can deliver 1+ TB/s aggregate with all-NVMe backend. Best-in-class for AI training I/O.
Checkpoint sizing: LLaMA 70B = ~140 GB in BF16. Full checkpoint write to local NVMe at 50 GB/s = ~3 seconds. Writing to slow NFS at 1 GB/s = ~140 seconds. Storage BW matters enormously for training efficiency.
Data pipeline: use NVIDIA DALI (Data Loading Library) for image/video preprocessing directly on GPU. Eliminates CPU bottleneck in vision training pipelines. Combined with GPUDirect Storage for end-to-end CPU-free data loading.

📡 Scale-Out: POD to SuperPOD — Guide

DGX BasePOD (8 nodes): 64 GPUs, Quantum-2 NDR IB leaf switches, shared storage. Minimum validated AI cluster configuration. Good starting point for LLM fine-tuning at scale.
DGX SuperPOD (32 nodes): 256 H100 GPUs, 1,024 PFLOPS FP8, non-blocking 2-tier fat-tree IB NDR fabric (leaf + spine). The standard reference for large LLM pre-training facilities.
SuperPOD storage: WEKA or Lustre configured for 1+ TB/s aggregate throughput. Each DGX node's 8 ConnectX-7 NICs connect to storage and compute fabric simultaneously via different switch planes.
GB200 NVL72 is a scale-up alternative to SuperPOD: 72 GPUs in 1 NVLink domain vs 256 GPUs across 32 nodes with InfiniBand. NVLink (1.8 TB/s) >> InfiniBand NDR (400 Gb/s) for inter-GPU bandwidth, but NVL72 is limited to 72 GPUs per domain.
Multi-rack clusters: connect SuperPODs via IB core switches (Quantum-2 or Quantum-X800). Full bisection bandwidth non-blocking design is preferred for training; oversubscribed designs (e.g., 4:1 oversubscription) are acceptable for inference.
Observability at scale: DCGM monitors GPU health (temperature, power, ECC, SM utilization) across all nodes. UFM (Unified Fabric Manager) monitors InfiniBand fabric health. Both feed into Prometheus + Grafana dashboards.

🧩 Platform Mnemonics

Fact	Mnemonic / Hook
DGX H100 = 8 GPUs, 640 GB, 32 PFLOPS	"8 GPUs × 80 GB = 640 GB. 8 × 3,958 TFLOPS ÷ 1,000 = ~32 PFLOPS"
DGX H100 power = 10.2 kW	"8 GPUs × 700W + CPU/overhead ≈ 10.2 kW — like 100 gaming PCs"
PCIe 5 = 128 GB/s	"Gen 5 = 5 × 32 GT/s per lane × 16 lanes ÷ 8 bits = 128 GB/s"
SXM = 700W; PCIe = 350W	"SXM = double the power, double the bandwidth. PCIe = half price, half bandwidth"
ConnectX-7 = 400 Gb/s dual-mode	"CX-7 = both roads: InfiniBand or Ethernet, same NIC"
SuperPOD = 32 nodes = 256 GPUs	"32 × 8 = 256 — 32 DGX nodes make one SuperPOD"
GDS = cuFile API	"GPU Direct Storage = cuFile + nvidia-fs kernel module"

← All NCP-AII Topics

AI Server Systems &
Platform Design

🗺️ The AI Server Platform Landscape

📊 At-a-Glance: DGX Generation Comparison

🖥️ DGX H100 — The Exam Reference System

🏗️ HGX — OEM GPU Baseboard

HGX H100 Configurations

🔌 SXM vs PCIe: GPU Form Factor Comparison

📊 PCIe Gen 5 — The CPU-GPU Data Path

PCIe Version Bandwidth Table

NVLink Bridge (PCIe GPUs)

🌐 AI Server Network Architecture

🔧 ConnectX-7 — The AI Server NIC

GPUDirect RDMA

InfiniBand NDR vs Spectrum-X for AI

💾 Storage for AI Workloads

GPUDirect Storage (GDS)

Shared Filesystems for AI Clusters

📡 Scale-Out: DGX POD → SuperPOD

GB200 NVL72 — Rack-Scale Computing

🧪 Practice Quiz — AI Server Systems

🃏 Flashcards — Tap to Flip

🤖 AI Server Systems Advisor

🖥️ DGX System Deep Dive — Key Exam Points

🔌 PCIe & SXM Selection — Design Guide

🌐 Network Architecture — Design Guide

💾 Storage Configuration — Guide

📡 Scale-Out: POD to SuperPOD — Guide

🧩 Platform Mnemonics

Build your AI infrastructure expertise

AI Server Systems &Platform Design

🗺️ The AI Server Platform Landscape

📊 At-a-Glance: DGX Generation Comparison

🖥️ DGX H100 — The Exam Reference System

🏗️ HGX — OEM GPU Baseboard

HGX H100 Configurations

🔌 SXM vs PCIe: GPU Form Factor Comparison

📊 PCIe Gen 5 — The CPU-GPU Data Path

PCIe Version Bandwidth Table

NVLink Bridge (PCIe GPUs)

🌐 AI Server Network Architecture

🔧 ConnectX-7 — The AI Server NIC

GPUDirect RDMA

InfiniBand NDR vs Spectrum-X for AI

💾 Storage for AI Workloads

GPUDirect Storage (GDS)

Shared Filesystems for AI Clusters

📡 Scale-Out: DGX POD → SuperPOD

GB200 NVL72 — Rack-Scale Computing

🧪 Practice Quiz — AI Server Systems

🃏 Flashcards — Tap to Flip

🤖 AI Server Systems Advisor

🖥️ DGX System Deep Dive — Key Exam Points

🔌 PCIe & SXM Selection — Design Guide

🌐 Network Architecture — Design Guide

💾 Storage Configuration — Guide

📡 Scale-Out: POD to SuperPOD — Guide

🧩 Platform Mnemonics

Build your AI infrastructure expertise

AI Server Systems &
Platform Design