NCA-AIIO · AI Infrastructure: Hardware & Systems

GPU Memory (HBM) NVLink NVSwitch InfiniBand Data Parallelism Tensor Parallelism Pipeline Parallelism HGX DGX NVIDIA BasePOD Training Cluster Inference Cluster On-Prem vs Cloud

AI Infrastructure is the highest-weight domain on the NCA-AIIO exam — 40% of all questions. This page covers hardware sizing for training vs inference, GPU parallelism strategies, cluster architecture, storage and networking requirements, on-premises vs cloud trade-offs, and NVIDIA's reference architectures (BasePOD, DGX POD, SuperPOD). Mastering these concepts is essential for exam success.

What You'll Master

Training vs Inference Hardware

Why training demands maximum GPU memory and NVLink bandwidth (H100/H200 SXM), while inference prioritizes cost efficiency and latency (A10G, L4, L40S PCIe). Memory sizing formulas for both.

GPU Memory and Model Sizing

HBM (High Bandwidth Memory), the model memory formula (parameters × bytes), training overhead (4–6× inference memory), gradient checkpointing, and KV cache challenges for inference.

GPU Scaling Strategies

Single GPU → Data Parallelism → Tensor Parallelism (within node, NVLink) → Pipeline Parallelism (cross-node, InfiniBand) → 3D Parallelism for trillion-parameter models.

AI Cluster Architecture

Four pillars: compute nodes (DGX/HGX), high-speed interconnect fabric (InfiniBand/RoCE), parallel storage (Lustre/WEKA), and management network (BMC + Base Command Manager).

On-Premises vs Cloud

Data sovereignty, predictable cost (on-prem) vs instant scalability, zero CapEx (cloud). Hybrid approach: steady-state on-prem, burst to cloud. Key decision factors for each scenario.

NVIDIA Reference Architectures

BasePOD (validated reference, OEM-agnostic), DGX BasePOD, DGX POD (20× DGX H100 = 160 GPUs), DGX SuperPOD (thousands of GPUs, AI factory scale).

Exam Weight

Domain	Coverage	Exam Questions (est.)
AI Infrastructure — Hardware & Systems (this page)	40%	~20 questions
Essential AI Knowledge	~38%	~19 questions
AI Operations & MLOps	~22%	~11 questions
Total exam: 50 questions, 60 minutes, passing ~70%

Concept 1 — Hardware Requirements: Training vs Inference

Training Hardware Priorities

Maximum GPU memory (to hold model weights and activations)
High memory bandwidth (HBM3/HBM3e)
High FP16/BF16/FP8 throughput (Tensor Cores)
Fast GPU-to-GPU interconnect (NVLink)
Fast node-to-node networking (InfiniBand)

Training-Optimized GPUs

H100 SXM5 — 80GB HBM3; preferred for training clusters
H200 SXM5 — 141GB HBM3e; larger models per node
B200 — 192GB HBM3e; next-gen frontier training
Always prefer SXM form factor for training (NVLink connectivity via NVSwitch)

Inference Hardware Priorities

Low latency (time-to-first-token, tokens per second)
High throughput (concurrent requests)
Cost efficiency (maximize requests per dollar)
Support for quantized models (INT8/FP8)

Inference-Suitable GPUs

A10G (24GB, PCIe) — cloud inference workhorse
L4 (24GB, PCIe) — energy-efficient inference
L40S (48GB, PCIe) — large model inference
H100 PCIe (80GB) — large model inference, commodity servers
Quantization (INT8, FP8) reduces memory requirements significantly

Memory Requirement Estimation

Parameters × bytes per parameter: FP16 = 2B, FP32 = 4B, INT8 = 1B.

70B model in FP16 = 140GB for weights alone → needs multiple H100s (8× 80GB = 640GB available). Optimizer states add 2–4× during training.

Batch Size Effect

Larger batch = better GPU utilization but more memory needed. Training: use the largest batch that fits. Inference: small or batch-size-1 for low latency; larger batches for throughput optimization.

Concept 2 — GPU Memory and Model Sizing

HBM (High Bandwidth Memory)

Stacked DRAM directly on GPU package — far higher bandwidth than GDDR. Key specs:

H100 SXM5: 80GB HBM3 at 3.35 TB/s
H200: 141GB HBM3e at 4.8 TB/s
B200: 192GB HBM3e at 8 TB/s

Model Memory Formula

Parameters (billions) × 2 bytes (FP16) = GB for weights:

7B model = 14GB
13B model = 26GB
70B model = 140GB
405B model = 810GB

Training Overhead

Adam optimizer adds 2× more memory (momentum + variance states). Activations for backpropagation add memory proportional to batch size and model depth.

Total training memory ≈ 4–6× inference memory.

Gradient Checkpointing

Trade compute for memory during training — recompute activations on the backward pass instead of storing them. Allows larger batch sizes or models at a cost of ~30% more compute. Key technique for memory-constrained training.

KV Cache (Inference)

Transformer inference requires storing key-value pairs for attention across the context window. Size grows with context length and batch size. Managing the KV cache is the primary inference memory challenge for long-context models.

Practical Sizing Examples

70B training (FP16): 140GB weights + optimizer + activations → min 2–4× H100 80GB
70B with tensor parallelism: split across 8 GPUs
70B inference (INT4): 70B × 0.5B = 35GB → fits in 1× H200 141GB

Concept 3 — GPU Scaling Strategies

Single GPU

Baseline; simplest deployment. Limited by GPU memory. Fine for models <30B parameters in FP16, or smaller models with quantization. Suitable for inference, small model development, and experimentation.

Data Parallelism (DP)

Same model replicated across all GPUs; each processes a different data batch. Gradients averaged via all-reduce (NCCL). Scales training throughput linearly with GPU count. Limitation: full model must fit in one GPU.

Tensor Parallelism (TP)

Split individual layers (weight matrices) across GPUs — each holds a shard of every layer. Requires high-bandwidth intra-node interconnect (NVLink). Best within a single node (up to 8 GPUs). Megatron-LM pioneered this approach.

Pipeline Parallelism (PP)

Split model layers into stages across nodes — each node handles a contiguous set of layers. Overlaps forward/backward passes to reduce "pipeline bubble" overhead. Enables models too large for tensor parallelism alone. Uses InfiniBand between nodes.

3D Parallelism

Combine all three strategies: TP within node (NVLink) + PP across nodes (InfiniBand) + DP across pipeline replicas. Used by Megatron-DeepSpeed for trillion-parameter model training.

Parallelism Selection Guide

Model fits one GPU → Data Parallel
Too large for one GPU → Tensor Parallel within node
Too large for one node → Pipeline Parallel + Tensor Parallel across nodes
Largest models → 3D Parallelism

Concept 4 — AI Cluster Architecture and Components

Compute Nodes

GPU servers (DGX, HGX-based, or custom). Each node typically houses 8 GPUs + 2 CPUs + NVSwitch/NVLink fabric. One DGX H100 = 640GB total GPU memory (8× 80GB). The fundamental building block of any AI cluster.

High-Speed Interconnect Fabric

InfiniBand (NDR 400Gb/s or HDR 200Gb/s) or RoCE (RDMA over Converged Ethernet) connects nodes for gradient all-reduce during distributed training. Latency and bandwidth directly determine scaling efficiency.

Storage Subsystem

High-performance parallel storage for datasets and checkpoints: Lustre, GPFS, WEKA, NFS. Local NVMe SSDs for scratch. Must deliver data faster than GPUs consume it — storage bandwidth is a frequent bottleneck.

Management Network

Separate 1/10GbE out-of-band network for BMC/iDRAC access, provisioning, and monitoring. Must not share the data plane. NVIDIA Base Command Manager provides cluster provisioning, job scheduling, and monitoring.

Cooling and Power

High-density GPU nodes generate 5–10+ kW per server. Air or liquid cooling required depending on density. Power distribution must support high PDU density. Key operational consideration for on-premises clusters.

The Four Pillars (Exam Focus)

Every AI cluster requires all four: Compute (GPUs/nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command). Miss one and you get a bottleneck.

Concept 5 — On-Premises vs Cloud GPU Infrastructure

On-Premises Advantages

Data sovereignty — keep sensitive data internal
Predictable long-term cost (no egress fees)
Maximum performance customization
No internet dependency for training runs
NVIDIA DGX systems provide turnkey on-prem AI

On-Premises Challenges

High upfront CapEx (DGX H100 ~$300K+)
Long procurement lead times (weeks to months)
Requires specialized staff (sysadmins, network engineers)
Difficult to scale up/down rapidly
Hardware becomes outdated over time

Cloud Advantages

No upfront cost — pure OpEx model
Instant scalability (spin up 1000 GPUs, release when done)
Access to latest GPUs on day one
Managed AI services (SageMaker, Vertex AI, Azure ML)
Ideal for variable/bursty workloads

Cloud Challenges

High sustained cost at scale (expensive per GPU-hour)
Data egress fees and residency concerns
Latency to cloud data centers
Vendor lock-in risk
Shared hardware (noisy neighbor effect)

Hybrid Approach

Steady-state training on-prem for predictable workloads; burst to cloud during peak demand; cloud for dev/test while on-prem handles production. NVIDIA DGX Cloud bridges on-prem and cloud environments.

Decision Factors

Data sensitivity → on-prem
Steady workload, long-term → on-prem (TCO advantage)
Bursty/variable → cloud
No CapEx budget → cloud
No specialized staff → cloud managed services

Concept 6 — Storage Requirements for AI Infrastructure

Capacity Planning

Training datasets can be petabytes
Model checkpoints saved every N steps
Logs and experiment artifacts accumulate
Plan for 3–5× dataset size for working copies and augmentation

Bandwidth Requirements

GPUs consume data extremely fast. H100 can process millions of images per second. Storage must deliver data faster than GPUs can consume it. Parallel filesystems (Lustre, WEKA) provide aggregate bandwidth across many drives simultaneously.

Storage Tiers

Hot: NVMe SSDs — active datasets and checkpoints
Warm: SAS/SATA HDDs or NFS — datasets in use
Cold: Object storage (S3, MinIO) — archival
Training pipelines stream from the hot tier

NVIDIA Magnum IO

Suite of libraries and protocols for GPU-accelerated I/O. GPUDirect Storage (GDS) allows the GPU to read/write NVMe/InfiniBand storage directly, bypassing CPU memory copies entirely — dramatically improving storage-to-GPU bandwidth.

Checkpoint Strategy

Save model weights periodically during training to enable recovery from failure without full restart. Storage must support fast checkpoint writes (hundreds of GB in minutes). NVIDIA Base Command Manager handles checkpoint orchestration.

Object Storage for Datasets

S3-compatible storage (MinIO on-prem, AWS S3 in cloud) enables streaming datasets directly during training. NVIDIA DALI (Data Loading Library) provides GPU-accelerated data pipeline for maximum throughput.

Concept 7 — Cluster Networking: InfiniBand vs Ethernet for AI

InfiniBand

Purpose-built HPC/AI networking
Ultra-low latency (~1 microsecond)
RDMA (zero-copy, kernel-bypass transfers)
HDR: 200Gb/s per port | NDR: 400Gb/s per port
NVIDIA acquired Mellanox — now NVIDIA Networking

RoCE (RDMA over Converged Ethernet)

RDMA semantics over standard Ethernet
Uses 100/200/400GbE switches
Lower cost than InfiniBand infrastructure
Slightly higher latency than IB
Requires lossless Ethernet (Priority Flow Control)
Increasingly viable for large-scale AI

Fat-Tree Topology

Standard for AI training clusters. Hierarchical topology where bandwidth doubles at each higher tier — providing full bisection bandwidth between any two nodes. No congestion during all-reduce operations. Implemented as 2-tier or 3-tier leaf-spine.

Rail-Optimized Networking

In 8-GPU nodes, each GPU connects to a different leaf switch ("rail"). Ensures traffic between GPU pairs routes through different switches. Maximizes bandwidth for all-to-all communication patterns. Prevents single-switch bottlenecks.

Storage Networking

Separate storage network (often InfiniBand or 25/100GbE Ethernet) for dataset access. GPUDirect Storage enables direct GPU-to-storage transfers over InfiniBand, eliminating CPU as bottleneck in the data path.

Bandwidth Planning

All-reduce volume per step = gradient size × 2 × (N-1)/N. For a 70B model at FP16: ~280GB per all-reduce step. InfiniBand NDR at 400Gb/s = 50 GB/s per link. Multiple links per node required for large-scale training.

Concept 8 — NVIDIA Reference Architectures: BasePOD and DGX POD

NVIDIA BasePOD

Validated reference architecture for AI/HPC clusters using NVIDIA H100 HGX servers. Includes compute, storage, networking, and management specifications. Vendor-agnostic (any OEM HGX server). Designed for 8–64 GPU nodes.

DGX BasePOD

Same reference architecture as BasePOD but specifically using DGX servers instead of OEM HGX-based servers. Full NVIDIA software stack, enterprise support, and single-vendor accountability. Turnkey AI infrastructure.

DGX POD

20× DGX H100 servers = 160 GPUs. Connected via QM9700 Quantum-2 InfiniBand switches (1.28 Tb/s per switch). Fully characterized for LLM training workloads. NVIDIA's standard reference AI factory unit.

DGX SuperPOD

Multiple DGX PODs with thousands of GPUs and full fat-tree InfiniBand fabric. AI factory scale. Used by cloud providers and large enterprises for frontier model training. Requires NVIDIA Professional Services for design and deployment.

Scaling Guidance

1–8 GPUs → single DGX/HGX node
16–64 GPUs → small cluster (BasePOD)
160–1000 GPUs → DGX POD / SuperPOD
1000+ GPUs → custom cluster with NVIDIA Professional Services

HGX vs DGX — Key Distinction

Both include NVSwitch and SXM5 GPU slots with full NVLink bandwidth. Difference: DGX adds CPU board, chassis, PSU, validated OS/software stack, and NVIDIA enterprise support. Same intra-node GPU performance.

Six memory hooks to lock in the most exam-critical hardware and infrastructure concepts — each designed to stick under pressure.

🧮

Training RAM Rule

Model size (GB) = Parameters × 2 (FP16). Training overhead = 4–6× that.

70B model training ≈ 560–840GB GPU RAM. Minimum 8× H100 80GB.

Inference with INT4 = 70B × 0.5 = 35GB — one H200 handles it alone.

📐

3 Parallelism Types — DTP Escalation

Data Parallel → replicate model, split data → single GPU fits model.
Tensor Parallel → split layers across GPUs, needs NVLink, within node.
Pipeline Parallel → split layer blocks across nodes, needs InfiniBand.

Scale from D → T → P as model grows beyond what one GPU, then one node, can hold.

⚖️

On-Prem vs Cloud Decision

Sensitive data / steady workload / long-term cost → On-Prem.
Variable / bursty / start-up / no specialized staff → Cloud.
Best of both worlds → Hybrid (on-prem for steady state, cloud burst for peaks).

🏗️

Cluster 4 Components

Every AI cluster needs all four pillars:
Compute (GPU nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command Manager).

Miss one = guaranteed bottleneck and underutilized GPUs.

🔌

IB vs RoCE

InfiniBand: lowest latency (~1µs), purpose-built for HPC/AI, NVIDIA-owned (Mellanox), NDR = 400Gb/s.

RoCE: RDMA over Ethernet, cheaper switch infrastructure, slightly higher latency, needs lossless Ethernet.

Both support RDMA. IB = premium AI training choice.

🔧

SXM for Training, PCIe for Inference

SXM = NVLink-connected via NVSwitch = full intra-node bandwidth = best for training all-reduce.

PCIe = standard slot = flexible = good for inference deployment in commodity servers and cloud VMs.

H100 SXM5 for training. A10G/L4/L40S PCIe for inference.

8 flashcards covering the most exam-critical hardware and infrastructure concepts. Click any card to flip it.

Click a card to reveal the answer

GPU Memory Sizing

70B parameter model — GPU memory needed for training vs inference?

Training (FP16): 70B × 2B = 140GB weights + ~4× overhead ≈ 560–840GB → needs 8–12× H100 80GB.

Inference (INT4): 70B × 0.5B = 35GB → fits easily in 1× H200 141GB.

Parallelism

Tensor Parallelism vs Pipeline Parallelism — key differences?

Tensor Parallel: split layers horizontally (weight matrices sharded) across GPUs within a node; requires NVLink.

Pipeline Parallel: split layers vertically (layer groups) across nodes; uses InfiniBand.

Combine both = 3D Parallelism.

Networking

InfiniBand NDR speed + key property?

NDR = 400 Gb/s per port (= 50 GB/s).

Key property: RDMA — zero-copy, kernel-bypass data transfer.

Latency ≈ 1 microsecond. Essential for all-reduce in large-scale distributed training.

Reference Architecture

What is a DGX POD?

20× DGX H100 servers = 160 GPUs connected via Quantum-2 InfiniBand switches (1.28 Tb/s per switch).

Fully validated and characterized for LLM training. NVIDIA's reference AI factory unit.

Memory Optimization

Why use gradient checkpointing?

Trade compute for memory — don't store activations during the forward pass; recompute them during the backward pass.

Costs ~30% more compute but allows larger batch sizes or models that wouldn't otherwise fit in GPU memory.

Storage

GPUDirect Storage — what problem does it solve?

Without GDS: Storage → CPU RAM → GPU RAM (2 copies, CPU is bottleneck).

With GDS: Storage → GPU RAM directly (0 CPU copies).

Eliminates the CPU bottleneck and effectively doubles storage-to-GPU bandwidth for training data loading.

Networking

Rail-optimized networking — what is it?

In an 8-GPU node, each GPU connects to a different leaf switch ("rail").

Ensures traffic between specific GPU pairs routes through different switches. Maximizes all-to-all bandwidth and prevents any single switch from becoming a bottleneck.

Hardware

HGX vs DGX — which has NVSwitch?

Both — HGX H100 baseboard includes NVSwitch + SXM5 GPU slots, delivering the same NVLink performance as DGX.

DGX adds: CPU board, chassis, PSU, validated software stack, and NVIDIA enterprise support. Same intra-node GPU-to-GPU bandwidth.

Select your experience level or exam timing to get a personalized study recommendation for AI Infrastructure: Hardware & Systems.

Beginners

Start with the single-GPU setup; understand the model memory formula (parameters × 2 for FP16 = GB required) before anything else
Learn why NVLink matters vs PCIe: NVLink = ~600 GB/s GPU-to-GPU vs PCIe = ~64 GB/s — the difference defines why SXM is preferred for training
Explore the NVIDIA DGX H100 product page to understand what "turnkey AI infrastructure" means in practice
Focus on the four cluster pillars (Compute + Storage + Network + Management) — ensure you can name all four from memory
Use the Memory Hooks tab to internalize the Training RAM Rule before moving to parallelism strategies

Official NVIDIA Resources

NVIDIA DGX H100 Systems Official product page — specifications, memory, NVLink bandwidth, and use cases for the DGX H100 server platform
NVIDIA HGX H100 HGX baseboard platform — NVSwitch fabric, SXM5 GPU slots, OEM integration for custom AI servers
NVIDIA BasePOD Reference Architecture Validated reference architecture for AI clusters — compute, storage, networking, and management specifications
NVIDIA Magnum IO GPU-accelerated I/O suite including GPUDirect Storage — eliminate CPU bottlenecks in the storage-to-GPU data path
Megatron-LM — Distributed Training Framework NVIDIA's open-source framework for tensor, pipeline, and 3D parallelism — used for large-scale LLM training

NCA-AIIO Official Certification

NCA-AIIO Official Certification Page NVIDIA's official exam page — blueprint, objectives, registration, and recommended study resources

Disclaimer

Not affiliated with NVIDIA. NVIDIA® is a registered trademark of NVIDIA Corporation. This page is an independent study resource. Official certification information: nvidia.com/en-us/learn/certification/ai-infrastructure-operations-associate/

What You'll Master

Training vs Inference Hardware

GPU Memory and Model Sizing

GPU Scaling Strategies

AI Cluster Architecture

On-Premises vs Cloud

NVIDIA Reference Architectures

Exam Weight

Concept 1 — Hardware Requirements: Training vs Inference

Training Hardware Priorities

Training-Optimized GPUs

Inference Hardware Priorities

Inference-Suitable GPUs

Memory Requirement Estimation

Batch Size Effect

Concept 2 — GPU Memory and Model Sizing

HBM (High Bandwidth Memory)

Model Memory Formula

Training Overhead

Gradient Checkpointing

KV Cache (Inference)

Practical Sizing Examples

Concept 3 — GPU Scaling Strategies

Single GPU

Data Parallelism (DP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

3D Parallelism

Parallelism Selection Guide

Concept 4 — AI Cluster Architecture and Components

Compute Nodes

High-Speed Interconnect Fabric

Storage Subsystem

Management Network

Cooling and Power

The Four Pillars (Exam Focus)

Concept 5 — On-Premises vs Cloud GPU Infrastructure

On-Premises Advantages

On-Premises Challenges

Cloud Advantages

Cloud Challenges

Hybrid Approach

Decision Factors

Concept 6 — Storage Requirements for AI Infrastructure

Capacity Planning

Bandwidth Requirements

Storage Tiers

NVIDIA Magnum IO

Checkpoint Strategy

Object Storage for Datasets

Concept 7 — Cluster Networking: InfiniBand vs Ethernet for AI

InfiniBand

RoCE (RDMA over Converged Ethernet)

Fat-Tree Topology

Rail-Optimized Networking

Storage Networking

Bandwidth Planning

Concept 8 — NVIDIA Reference Architectures: BasePOD and DGX POD

NVIDIA BasePOD

DGX BasePOD

DGX POD

DGX SuperPOD

Scaling Guidance

HGX vs DGX — Key Distinction

Training RAM Rule

3 Parallelism Types — DTP Escalation

On-Prem vs Cloud Decision

Cluster 4 Components

IB vs RoCE

SXM for Training, PCIe for Inference

Quiz Complete!

Beginners

Official NVIDIA Resources

NCA-AIIO Official Certification

Disclaimer

Ready to Test Your Knowledge?