What You'll Master
Training vs Inference Hardware
Why training demands maximum GPU memory and NVLink bandwidth (H100/H200 SXM), while inference prioritizes cost efficiency and latency (A10G, L4, L40S PCIe). Memory sizing formulas for both.
GPU Memory and Model Sizing
HBM (High Bandwidth Memory), the model memory formula (parameters × bytes), training overhead (4–6× inference memory), gradient checkpointing, and KV cache challenges for inference.
GPU Scaling Strategies
Single GPU → Data Parallelism → Tensor Parallelism (within node, NVLink) → Pipeline Parallelism (cross-node, InfiniBand) → 3D Parallelism for trillion-parameter models.
AI Cluster Architecture
Four pillars: compute nodes (DGX/HGX), high-speed interconnect fabric (InfiniBand/RoCE), parallel storage (Lustre/WEKA), and management network (BMC + Base Command Manager).
On-Premises vs Cloud
Data sovereignty, predictable cost (on-prem) vs instant scalability, zero CapEx (cloud). Hybrid approach: steady-state on-prem, burst to cloud. Key decision factors for each scenario.
NVIDIA Reference Architectures
BasePOD (validated reference, OEM-agnostic), DGX BasePOD, DGX POD (20× DGX H100 = 160 GPUs), DGX SuperPOD (thousands of GPUs, AI factory scale).
Exam Weight
| Domain | Coverage | Exam Questions (est.) |
|---|---|---|
| AI Infrastructure — Hardware & Systems (this page) | 40% | ~20 questions |
| Essential AI Knowledge | ~38% | ~19 questions |
| AI Operations & MLOps | ~22% | ~11 questions |
| Total exam: 50 questions, 60 minutes, passing ~70% | ||
Concept 1 — Hardware Requirements: Training vs Inference
Training Hardware Priorities
- Maximum GPU memory (to hold model weights and activations)
- High memory bandwidth (HBM3/HBM3e)
- High FP16/BF16/FP8 throughput (Tensor Cores)
- Fast GPU-to-GPU interconnect (NVLink)
- Fast node-to-node networking (InfiniBand)
Training-Optimized GPUs
- H100 SXM5 — 80GB HBM3; preferred for training clusters
- H200 SXM5 — 141GB HBM3e; larger models per node
- B200 — 192GB HBM3e; next-gen frontier training
- Always prefer SXM form factor for training (NVLink connectivity via NVSwitch)
Inference Hardware Priorities
- Low latency (time-to-first-token, tokens per second)
- High throughput (concurrent requests)
- Cost efficiency (maximize requests per dollar)
- Support for quantized models (INT8/FP8)
Inference-Suitable GPUs
- A10G (24GB, PCIe) — cloud inference workhorse
- L4 (24GB, PCIe) — energy-efficient inference
- L40S (48GB, PCIe) — large model inference
- H100 PCIe (80GB) — large model inference, commodity servers
- Quantization (INT8, FP8) reduces memory requirements significantly
Memory Requirement Estimation
Parameters × bytes per parameter: FP16 = 2B, FP32 = 4B, INT8 = 1B.
70B model in FP16 = 140GB for weights alone → needs multiple H100s (8× 80GB = 640GB available). Optimizer states add 2–4× during training.
Batch Size Effect
Larger batch = better GPU utilization but more memory needed. Training: use the largest batch that fits. Inference: small or batch-size-1 for low latency; larger batches for throughput optimization.
Concept 2 — GPU Memory and Model Sizing
HBM (High Bandwidth Memory)
Stacked DRAM directly on GPU package — far higher bandwidth than GDDR. Key specs:
- H100 SXM5: 80GB HBM3 at 3.35 TB/s
- H200: 141GB HBM3e at 4.8 TB/s
- B200: 192GB HBM3e at 8 TB/s
Model Memory Formula
Parameters (billions) × 2 bytes (FP16) = GB for weights:
- 7B model = 14GB
- 13B model = 26GB
- 70B model = 140GB
- 405B model = 810GB
Training Overhead
Adam optimizer adds 2× more memory (momentum + variance states). Activations for backpropagation add memory proportional to batch size and model depth.
Total training memory ≈ 4–6× inference memory.
Gradient Checkpointing
Trade compute for memory during training — recompute activations on the backward pass instead of storing them. Allows larger batch sizes or models at a cost of ~30% more compute. Key technique for memory-constrained training.
KV Cache (Inference)
Transformer inference requires storing key-value pairs for attention across the context window. Size grows with context length and batch size. Managing the KV cache is the primary inference memory challenge for long-context models.
Practical Sizing Examples
- 70B training (FP16): 140GB weights + optimizer + activations → min 2–4× H100 80GB
- 70B with tensor parallelism: split across 8 GPUs
- 70B inference (INT4): 70B × 0.5B = 35GB → fits in 1× H200 141GB
Concept 3 — GPU Scaling Strategies
Single GPU
Baseline; simplest deployment. Limited by GPU memory. Fine for models <30B parameters in FP16, or smaller models with quantization. Suitable for inference, small model development, and experimentation.
Data Parallelism (DP)
Same model replicated across all GPUs; each processes a different data batch. Gradients averaged via all-reduce (NCCL). Scales training throughput linearly with GPU count. Limitation: full model must fit in one GPU.
Tensor Parallelism (TP)
Split individual layers (weight matrices) across GPUs — each holds a shard of every layer. Requires high-bandwidth intra-node interconnect (NVLink). Best within a single node (up to 8 GPUs). Megatron-LM pioneered this approach.
Pipeline Parallelism (PP)
Split model layers into stages across nodes — each node handles a contiguous set of layers. Overlaps forward/backward passes to reduce "pipeline bubble" overhead. Enables models too large for tensor parallelism alone. Uses InfiniBand between nodes.
3D Parallelism
Combine all three strategies: TP within node (NVLink) + PP across nodes (InfiniBand) + DP across pipeline replicas. Used by Megatron-DeepSpeed for trillion-parameter model training.
Parallelism Selection Guide
- Model fits one GPU → Data Parallel
- Too large for one GPU → Tensor Parallel within node
- Too large for one node → Pipeline Parallel + Tensor Parallel across nodes
- Largest models → 3D Parallelism
Concept 4 — AI Cluster Architecture and Components
Compute Nodes
GPU servers (DGX, HGX-based, or custom). Each node typically houses 8 GPUs + 2 CPUs + NVSwitch/NVLink fabric. One DGX H100 = 640GB total GPU memory (8× 80GB). The fundamental building block of any AI cluster.
High-Speed Interconnect Fabric
InfiniBand (NDR 400Gb/s or HDR 200Gb/s) or RoCE (RDMA over Converged Ethernet) connects nodes for gradient all-reduce during distributed training. Latency and bandwidth directly determine scaling efficiency.
Storage Subsystem
High-performance parallel storage for datasets and checkpoints: Lustre, GPFS, WEKA, NFS. Local NVMe SSDs for scratch. Must deliver data faster than GPUs consume it — storage bandwidth is a frequent bottleneck.
Management Network
Separate 1/10GbE out-of-band network for BMC/iDRAC access, provisioning, and monitoring. Must not share the data plane. NVIDIA Base Command Manager provides cluster provisioning, job scheduling, and monitoring.
Cooling and Power
High-density GPU nodes generate 5–10+ kW per server. Air or liquid cooling required depending on density. Power distribution must support high PDU density. Key operational consideration for on-premises clusters.
The Four Pillars (Exam Focus)
Every AI cluster requires all four: Compute (GPUs/nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command). Miss one and you get a bottleneck.
Concept 5 — On-Premises vs Cloud GPU Infrastructure
On-Premises Advantages
- Data sovereignty — keep sensitive data internal
- Predictable long-term cost (no egress fees)
- Maximum performance customization
- No internet dependency for training runs
- NVIDIA DGX systems provide turnkey on-prem AI
On-Premises Challenges
- High upfront CapEx (DGX H100 ~$300K+)
- Long procurement lead times (weeks to months)
- Requires specialized staff (sysadmins, network engineers)
- Difficult to scale up/down rapidly
- Hardware becomes outdated over time
Cloud Advantages
- No upfront cost — pure OpEx model
- Instant scalability (spin up 1000 GPUs, release when done)
- Access to latest GPUs on day one
- Managed AI services (SageMaker, Vertex AI, Azure ML)
- Ideal for variable/bursty workloads
Cloud Challenges
- High sustained cost at scale (expensive per GPU-hour)
- Data egress fees and residency concerns
- Latency to cloud data centers
- Vendor lock-in risk
- Shared hardware (noisy neighbor effect)
Hybrid Approach
Steady-state training on-prem for predictable workloads; burst to cloud during peak demand; cloud for dev/test while on-prem handles production. NVIDIA DGX Cloud bridges on-prem and cloud environments.
Decision Factors
- Data sensitivity → on-prem
- Steady workload, long-term → on-prem (TCO advantage)
- Bursty/variable → cloud
- No CapEx budget → cloud
- No specialized staff → cloud managed services
Concept 6 — Storage Requirements for AI Infrastructure
Capacity Planning
- Training datasets can be petabytes
- Model checkpoints saved every N steps
- Logs and experiment artifacts accumulate
- Plan for 3–5× dataset size for working copies and augmentation
Bandwidth Requirements
GPUs consume data extremely fast. H100 can process millions of images per second. Storage must deliver data faster than GPUs can consume it. Parallel filesystems (Lustre, WEKA) provide aggregate bandwidth across many drives simultaneously.
Storage Tiers
- Hot: NVMe SSDs — active datasets and checkpoints
- Warm: SAS/SATA HDDs or NFS — datasets in use
- Cold: Object storage (S3, MinIO) — archival
- Training pipelines stream from the hot tier
NVIDIA Magnum IO
Suite of libraries and protocols for GPU-accelerated I/O. GPUDirect Storage (GDS) allows the GPU to read/write NVMe/InfiniBand storage directly, bypassing CPU memory copies entirely — dramatically improving storage-to-GPU bandwidth.
Checkpoint Strategy
Save model weights periodically during training to enable recovery from failure without full restart. Storage must support fast checkpoint writes (hundreds of GB in minutes). NVIDIA Base Command Manager handles checkpoint orchestration.
Object Storage for Datasets
S3-compatible storage (MinIO on-prem, AWS S3 in cloud) enables streaming datasets directly during training. NVIDIA DALI (Data Loading Library) provides GPU-accelerated data pipeline for maximum throughput.
Concept 7 — Cluster Networking: InfiniBand vs Ethernet for AI
InfiniBand
- Purpose-built HPC/AI networking
- Ultra-low latency (~1 microsecond)
- RDMA (zero-copy, kernel-bypass transfers)
- HDR: 200Gb/s per port | NDR: 400Gb/s per port
- NVIDIA acquired Mellanox — now NVIDIA Networking
RoCE (RDMA over Converged Ethernet)
- RDMA semantics over standard Ethernet
- Uses 100/200/400GbE switches
- Lower cost than InfiniBand infrastructure
- Slightly higher latency than IB
- Requires lossless Ethernet (Priority Flow Control)
- Increasingly viable for large-scale AI
Fat-Tree Topology
Standard for AI training clusters. Hierarchical topology where bandwidth doubles at each higher tier — providing full bisection bandwidth between any two nodes. No congestion during all-reduce operations. Implemented as 2-tier or 3-tier leaf-spine.
Rail-Optimized Networking
In 8-GPU nodes, each GPU connects to a different leaf switch ("rail"). Ensures traffic between GPU pairs routes through different switches. Maximizes bandwidth for all-to-all communication patterns. Prevents single-switch bottlenecks.
Storage Networking
Separate storage network (often InfiniBand or 25/100GbE Ethernet) for dataset access. GPUDirect Storage enables direct GPU-to-storage transfers over InfiniBand, eliminating CPU as bottleneck in the data path.
Bandwidth Planning
All-reduce volume per step = gradient size × 2 × (N-1)/N. For a 70B model at FP16: ~280GB per all-reduce step. InfiniBand NDR at 400Gb/s = 50 GB/s per link. Multiple links per node required for large-scale training.
Concept 8 — NVIDIA Reference Architectures: BasePOD and DGX POD
NVIDIA BasePOD
Validated reference architecture for AI/HPC clusters using NVIDIA H100 HGX servers. Includes compute, storage, networking, and management specifications. Vendor-agnostic (any OEM HGX server). Designed for 8–64 GPU nodes.
DGX BasePOD
Same reference architecture as BasePOD but specifically using DGX servers instead of OEM HGX-based servers. Full NVIDIA software stack, enterprise support, and single-vendor accountability. Turnkey AI infrastructure.
DGX POD
20× DGX H100 servers = 160 GPUs. Connected via QM9700 Quantum-2 InfiniBand switches (1.28 Tb/s per switch). Fully characterized for LLM training workloads. NVIDIA's standard reference AI factory unit.
DGX SuperPOD
Multiple DGX PODs with thousands of GPUs and full fat-tree InfiniBand fabric. AI factory scale. Used by cloud providers and large enterprises for frontier model training. Requires NVIDIA Professional Services for design and deployment.
Scaling Guidance
- 1–8 GPUs → single DGX/HGX node
- 16–64 GPUs → small cluster (BasePOD)
- 160–1000 GPUs → DGX POD / SuperPOD
- 1000+ GPUs → custom cluster with NVIDIA Professional Services
HGX vs DGX — Key Distinction
Both include NVSwitch and SXM5 GPU slots with full NVLink bandwidth. Difference: DGX adds CPU board, chassis, PSU, validated OS/software stack, and NVIDIA enterprise support. Same intra-node GPU performance.
Training RAM Rule
Model size (GB) = Parameters × 2 (FP16). Training overhead = 4–6× that.
70B model training ≈ 560–840GB GPU RAM. Minimum 8× H100 80GB.
Inference with INT4 = 70B × 0.5 = 35GB — one H200 handles it alone.
3 Parallelism Types — DTP Escalation
Data Parallel → replicate model, split data → single GPU fits model.
Tensor Parallel → split layers across GPUs, needs NVLink, within node.
Pipeline Parallel → split layer blocks across nodes, needs InfiniBand.
Scale from D → T → P as model grows beyond what one GPU, then one node, can hold.
On-Prem vs Cloud Decision
Sensitive data / steady workload / long-term cost → On-Prem.
Variable / bursty / start-up / no specialized staff → Cloud.
Best of both worlds → Hybrid (on-prem for steady state, cloud burst for peaks).
Cluster 4 Components
Every AI cluster needs all four pillars:
Compute (GPU nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command Manager).
Miss one = guaranteed bottleneck and underutilized GPUs.
IB vs RoCE
InfiniBand: lowest latency (~1µs), purpose-built for HPC/AI, NVIDIA-owned (Mellanox), NDR = 400Gb/s.
RoCE: RDMA over Ethernet, cheaper switch infrastructure, slightly higher latency, needs lossless Ethernet.
Both support RDMA. IB = premium AI training choice.
SXM for Training, PCIe for Inference
SXM = NVLink-connected via NVSwitch = full intra-node bandwidth = best for training all-reduce.
PCIe = standard slot = flexible = good for inference deployment in commodity servers and cloud VMs.
H100 SXM5 for training. A10G/L4/L40S PCIe for inference.
Quiz Complete!
Click a card to reveal the answer
70B parameter model — GPU memory needed for training vs inference?
Inference (INT4): 70B × 0.5B = 35GB → fits easily in 1× H200 141GB.
Tensor Parallelism vs Pipeline Parallelism — key differences?
Pipeline Parallel: split layers vertically (layer groups) across nodes; uses InfiniBand.
Combine both = 3D Parallelism.
InfiniBand NDR speed + key property?
Key property: RDMA — zero-copy, kernel-bypass data transfer.
Latency ≈ 1 microsecond. Essential for all-reduce in large-scale distributed training.
What is a DGX POD?
Fully validated and characterized for LLM training. NVIDIA's reference AI factory unit.
Why use gradient checkpointing?
Costs ~30% more compute but allows larger batch sizes or models that wouldn't otherwise fit in GPU memory.
GPUDirect Storage — what problem does it solve?
With GDS: Storage → GPU RAM directly (0 CPU copies).
Eliminates the CPU bottleneck and effectively doubles storage-to-GPU bandwidth for training data loading.
Rail-optimized networking — what is it?
Ensures traffic between specific GPU pairs routes through different switches. Maximizes all-to-all bandwidth and prevents any single switch from becoming a bottleneck.
HGX vs DGX — which has NVSwitch?
DGX adds: CPU board, chassis, PSU, validated software stack, and NVIDIA enterprise support. Same intra-node GPU-to-GPU bandwidth.
Beginners
- Start with the single-GPU setup; understand the model memory formula (parameters × 2 for FP16 = GB required) before anything else
- Learn why NVLink matters vs PCIe: NVLink = ~600 GB/s GPU-to-GPU vs PCIe = ~64 GB/s — the difference defines why SXM is preferred for training
- Explore the NVIDIA DGX H100 product page to understand what "turnkey AI infrastructure" means in practice
- Focus on the four cluster pillars (Compute + Storage + Network + Management) — ensure you can name all four from memory
- Use the Memory Hooks tab to internalize the Training RAM Rule before moving to parallelism strategies
Official NVIDIA Resources
-
NVIDIA DGX H100 Systems Official product page — specifications, memory, NVLink bandwidth, and use cases for the DGX H100 server platform
-
NVIDIA HGX H100 HGX baseboard platform — NVSwitch fabric, SXM5 GPU slots, OEM integration for custom AI servers
-
NVIDIA BasePOD Reference Architecture Validated reference architecture for AI clusters — compute, storage, networking, and management specifications
-
NVIDIA Magnum IO GPU-accelerated I/O suite including GPUDirect Storage — eliminate CPU bottlenecks in the storage-to-GPU data path
-
Megatron-LM — Distributed Training Framework NVIDIA's open-source framework for tensor, pipeline, and 3D parallelism — used for large-scale LLM training
NCA-AIIO Official Certification
-
NCA-AIIO Official Certification Page NVIDIA's official exam page — blueprint, objectives, registration, and recommended study resources