FlashGenius Logo FlashGenius
NCA-AIIO · Page 3 of 5 · AI Infrastructure: Hardware & Systems

AI Infrastructure: Hardware & Systems

NCA-AIIO · AI Infrastructure · 40% of Exam

GPU Memory · Scaling Strategies · DGX/HGX · NVLink · NVSwitch · InfiniBand · Cluster Architecture

Study with Practice Tests →
GPU Memory (HBM) NVLink NVSwitch InfiniBand Data Parallelism Tensor Parallelism Pipeline Parallelism HGX DGX NVIDIA BasePOD Training Cluster Inference Cluster On-Prem vs Cloud
AI Infrastructure is the highest-weight domain on the NCA-AIIO exam — 40% of all questions. This page covers hardware sizing for training vs inference, GPU parallelism strategies, cluster architecture, storage and networking requirements, on-premises vs cloud trade-offs, and NVIDIA's reference architectures (BasePOD, DGX POD, SuperPOD). Mastering these concepts is essential for exam success.

What You'll Master

Training vs Inference Hardware

Why training demands maximum GPU memory and NVLink bandwidth (H100/H200 SXM), while inference prioritizes cost efficiency and latency (A10G, L4, L40S PCIe). Memory sizing formulas for both.

GPU Memory and Model Sizing

HBM (High Bandwidth Memory), the model memory formula (parameters × bytes), training overhead (4–6× inference memory), gradient checkpointing, and KV cache challenges for inference.

GPU Scaling Strategies

Single GPU → Data Parallelism → Tensor Parallelism (within node, NVLink) → Pipeline Parallelism (cross-node, InfiniBand) → 3D Parallelism for trillion-parameter models.

AI Cluster Architecture

Four pillars: compute nodes (DGX/HGX), high-speed interconnect fabric (InfiniBand/RoCE), parallel storage (Lustre/WEKA), and management network (BMC + Base Command Manager).

On-Premises vs Cloud

Data sovereignty, predictable cost (on-prem) vs instant scalability, zero CapEx (cloud). Hybrid approach: steady-state on-prem, burst to cloud. Key decision factors for each scenario.

NVIDIA Reference Architectures

BasePOD (validated reference, OEM-agnostic), DGX BasePOD, DGX POD (20× DGX H100 = 160 GPUs), DGX SuperPOD (thousands of GPUs, AI factory scale).

Exam Weight

DomainCoverageExam Questions (est.)
AI Infrastructure — Hardware & Systems (this page)40%~20 questions
Essential AI Knowledge~38%~19 questions
AI Operations & MLOps~22%~11 questions
Total exam: 50 questions, 60 minutes, passing ~70%

Concept 1 — Hardware Requirements: Training vs Inference

Training Hardware Priorities

  • Maximum GPU memory (to hold model weights and activations)
  • High memory bandwidth (HBM3/HBM3e)
  • High FP16/BF16/FP8 throughput (Tensor Cores)
  • Fast GPU-to-GPU interconnect (NVLink)
  • Fast node-to-node networking (InfiniBand)

Training-Optimized GPUs

  • H100 SXM5 — 80GB HBM3; preferred for training clusters
  • H200 SXM5 — 141GB HBM3e; larger models per node
  • B200 — 192GB HBM3e; next-gen frontier training
  • Always prefer SXM form factor for training (NVLink connectivity via NVSwitch)

Inference Hardware Priorities

  • Low latency (time-to-first-token, tokens per second)
  • High throughput (concurrent requests)
  • Cost efficiency (maximize requests per dollar)
  • Support for quantized models (INT8/FP8)

Inference-Suitable GPUs

  • A10G (24GB, PCIe) — cloud inference workhorse
  • L4 (24GB, PCIe) — energy-efficient inference
  • L40S (48GB, PCIe) — large model inference
  • H100 PCIe (80GB) — large model inference, commodity servers
  • Quantization (INT8, FP8) reduces memory requirements significantly

Memory Requirement Estimation

Parameters × bytes per parameter: FP16 = 2B, FP32 = 4B, INT8 = 1B.

70B model in FP16 = 140GB for weights alone → needs multiple H100s (8× 80GB = 640GB available). Optimizer states add 2–4× during training.

Batch Size Effect

Larger batch = better GPU utilization but more memory needed. Training: use the largest batch that fits. Inference: small or batch-size-1 for low latency; larger batches for throughput optimization.

Concept 2 — GPU Memory and Model Sizing

HBM (High Bandwidth Memory)

Stacked DRAM directly on GPU package — far higher bandwidth than GDDR. Key specs:

  • H100 SXM5: 80GB HBM3 at 3.35 TB/s
  • H200: 141GB HBM3e at 4.8 TB/s
  • B200: 192GB HBM3e at 8 TB/s

Model Memory Formula

Parameters (billions) × 2 bytes (FP16) = GB for weights:

  • 7B model = 14GB
  • 13B model = 26GB
  • 70B model = 140GB
  • 405B model = 810GB

Training Overhead

Adam optimizer adds 2× more memory (momentum + variance states). Activations for backpropagation add memory proportional to batch size and model depth.

Total training memory ≈ 4–6× inference memory.

Gradient Checkpointing

Trade compute for memory during training — recompute activations on the backward pass instead of storing them. Allows larger batch sizes or models at a cost of ~30% more compute. Key technique for memory-constrained training.

KV Cache (Inference)

Transformer inference requires storing key-value pairs for attention across the context window. Size grows with context length and batch size. Managing the KV cache is the primary inference memory challenge for long-context models.

Practical Sizing Examples

  • 70B training (FP16): 140GB weights + optimizer + activations → min 2–4× H100 80GB
  • 70B with tensor parallelism: split across 8 GPUs
  • 70B inference (INT4): 70B × 0.5B = 35GB → fits in 1× H200 141GB

Concept 3 — GPU Scaling Strategies

Single GPU

Baseline; simplest deployment. Limited by GPU memory. Fine for models <30B parameters in FP16, or smaller models with quantization. Suitable for inference, small model development, and experimentation.

Data Parallelism (DP)

Same model replicated across all GPUs; each processes a different data batch. Gradients averaged via all-reduce (NCCL). Scales training throughput linearly with GPU count. Limitation: full model must fit in one GPU.

Tensor Parallelism (TP)

Split individual layers (weight matrices) across GPUs — each holds a shard of every layer. Requires high-bandwidth intra-node interconnect (NVLink). Best within a single node (up to 8 GPUs). Megatron-LM pioneered this approach.

Pipeline Parallelism (PP)

Split model layers into stages across nodes — each node handles a contiguous set of layers. Overlaps forward/backward passes to reduce "pipeline bubble" overhead. Enables models too large for tensor parallelism alone. Uses InfiniBand between nodes.

3D Parallelism

Combine all three strategies: TP within node (NVLink) + PP across nodes (InfiniBand) + DP across pipeline replicas. Used by Megatron-DeepSpeed for trillion-parameter model training.

Parallelism Selection Guide

  • Model fits one GPU → Data Parallel
  • Too large for one GPU → Tensor Parallel within node
  • Too large for one node → Pipeline Parallel + Tensor Parallel across nodes
  • Largest models → 3D Parallelism

Concept 4 — AI Cluster Architecture and Components

Compute Nodes

GPU servers (DGX, HGX-based, or custom). Each node typically houses 8 GPUs + 2 CPUs + NVSwitch/NVLink fabric. One DGX H100 = 640GB total GPU memory (8× 80GB). The fundamental building block of any AI cluster.

High-Speed Interconnect Fabric

InfiniBand (NDR 400Gb/s or HDR 200Gb/s) or RoCE (RDMA over Converged Ethernet) connects nodes for gradient all-reduce during distributed training. Latency and bandwidth directly determine scaling efficiency.

Storage Subsystem

High-performance parallel storage for datasets and checkpoints: Lustre, GPFS, WEKA, NFS. Local NVMe SSDs for scratch. Must deliver data faster than GPUs consume it — storage bandwidth is a frequent bottleneck.

Management Network

Separate 1/10GbE out-of-band network for BMC/iDRAC access, provisioning, and monitoring. Must not share the data plane. NVIDIA Base Command Manager provides cluster provisioning, job scheduling, and monitoring.

Cooling and Power

High-density GPU nodes generate 5–10+ kW per server. Air or liquid cooling required depending on density. Power distribution must support high PDU density. Key operational consideration for on-premises clusters.

The Four Pillars (Exam Focus)

Every AI cluster requires all four: Compute (GPUs/nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command). Miss one and you get a bottleneck.

Concept 5 — On-Premises vs Cloud GPU Infrastructure

On-Premises Advantages

  • Data sovereignty — keep sensitive data internal
  • Predictable long-term cost (no egress fees)
  • Maximum performance customization
  • No internet dependency for training runs
  • NVIDIA DGX systems provide turnkey on-prem AI

On-Premises Challenges

  • High upfront CapEx (DGX H100 ~$300K+)
  • Long procurement lead times (weeks to months)
  • Requires specialized staff (sysadmins, network engineers)
  • Difficult to scale up/down rapidly
  • Hardware becomes outdated over time

Cloud Advantages

  • No upfront cost — pure OpEx model
  • Instant scalability (spin up 1000 GPUs, release when done)
  • Access to latest GPUs on day one
  • Managed AI services (SageMaker, Vertex AI, Azure ML)
  • Ideal for variable/bursty workloads

Cloud Challenges

  • High sustained cost at scale (expensive per GPU-hour)
  • Data egress fees and residency concerns
  • Latency to cloud data centers
  • Vendor lock-in risk
  • Shared hardware (noisy neighbor effect)

Hybrid Approach

Steady-state training on-prem for predictable workloads; burst to cloud during peak demand; cloud for dev/test while on-prem handles production. NVIDIA DGX Cloud bridges on-prem and cloud environments.

Decision Factors

  • Data sensitivity → on-prem
  • Steady workload, long-term → on-prem (TCO advantage)
  • Bursty/variable → cloud
  • No CapEx budget → cloud
  • No specialized staff → cloud managed services

Concept 6 — Storage Requirements for AI Infrastructure

Capacity Planning

  • Training datasets can be petabytes
  • Model checkpoints saved every N steps
  • Logs and experiment artifacts accumulate
  • Plan for 3–5× dataset size for working copies and augmentation

Bandwidth Requirements

GPUs consume data extremely fast. H100 can process millions of images per second. Storage must deliver data faster than GPUs can consume it. Parallel filesystems (Lustre, WEKA) provide aggregate bandwidth across many drives simultaneously.

Storage Tiers

  • Hot: NVMe SSDs — active datasets and checkpoints
  • Warm: SAS/SATA HDDs or NFS — datasets in use
  • Cold: Object storage (S3, MinIO) — archival
  • Training pipelines stream from the hot tier

NVIDIA Magnum IO

Suite of libraries and protocols for GPU-accelerated I/O. GPUDirect Storage (GDS) allows the GPU to read/write NVMe/InfiniBand storage directly, bypassing CPU memory copies entirely — dramatically improving storage-to-GPU bandwidth.

Checkpoint Strategy

Save model weights periodically during training to enable recovery from failure without full restart. Storage must support fast checkpoint writes (hundreds of GB in minutes). NVIDIA Base Command Manager handles checkpoint orchestration.

Object Storage for Datasets

S3-compatible storage (MinIO on-prem, AWS S3 in cloud) enables streaming datasets directly during training. NVIDIA DALI (Data Loading Library) provides GPU-accelerated data pipeline for maximum throughput.

Concept 7 — Cluster Networking: InfiniBand vs Ethernet for AI

InfiniBand

  • Purpose-built HPC/AI networking
  • Ultra-low latency (~1 microsecond)
  • RDMA (zero-copy, kernel-bypass transfers)
  • HDR: 200Gb/s per port | NDR: 400Gb/s per port
  • NVIDIA acquired Mellanox — now NVIDIA Networking

RoCE (RDMA over Converged Ethernet)

  • RDMA semantics over standard Ethernet
  • Uses 100/200/400GbE switches
  • Lower cost than InfiniBand infrastructure
  • Slightly higher latency than IB
  • Requires lossless Ethernet (Priority Flow Control)
  • Increasingly viable for large-scale AI

Fat-Tree Topology

Standard for AI training clusters. Hierarchical topology where bandwidth doubles at each higher tier — providing full bisection bandwidth between any two nodes. No congestion during all-reduce operations. Implemented as 2-tier or 3-tier leaf-spine.

Rail-Optimized Networking

In 8-GPU nodes, each GPU connects to a different leaf switch ("rail"). Ensures traffic between GPU pairs routes through different switches. Maximizes bandwidth for all-to-all communication patterns. Prevents single-switch bottlenecks.

Storage Networking

Separate storage network (often InfiniBand or 25/100GbE Ethernet) for dataset access. GPUDirect Storage enables direct GPU-to-storage transfers over InfiniBand, eliminating CPU as bottleneck in the data path.

Bandwidth Planning

All-reduce volume per step = gradient size × 2 × (N-1)/N. For a 70B model at FP16: ~280GB per all-reduce step. InfiniBand NDR at 400Gb/s = 50 GB/s per link. Multiple links per node required for large-scale training.

Concept 8 — NVIDIA Reference Architectures: BasePOD and DGX POD

NVIDIA BasePOD

Validated reference architecture for AI/HPC clusters using NVIDIA H100 HGX servers. Includes compute, storage, networking, and management specifications. Vendor-agnostic (any OEM HGX server). Designed for 8–64 GPU nodes.

DGX BasePOD

Same reference architecture as BasePOD but specifically using DGX servers instead of OEM HGX-based servers. Full NVIDIA software stack, enterprise support, and single-vendor accountability. Turnkey AI infrastructure.

DGX POD

20× DGX H100 servers = 160 GPUs. Connected via QM9700 Quantum-2 InfiniBand switches (1.28 Tb/s per switch). Fully characterized for LLM training workloads. NVIDIA's standard reference AI factory unit.

DGX SuperPOD

Multiple DGX PODs with thousands of GPUs and full fat-tree InfiniBand fabric. AI factory scale. Used by cloud providers and large enterprises for frontier model training. Requires NVIDIA Professional Services for design and deployment.

Scaling Guidance

  • 1–8 GPUs → single DGX/HGX node
  • 16–64 GPUs → small cluster (BasePOD)
  • 160–1000 GPUs → DGX POD / SuperPOD
  • 1000+ GPUs → custom cluster with NVIDIA Professional Services

HGX vs DGX — Key Distinction

Both include NVSwitch and SXM5 GPU slots with full NVLink bandwidth. Difference: DGX adds CPU board, chassis, PSU, validated OS/software stack, and NVIDIA enterprise support. Same intra-node GPU performance.

Six memory hooks to lock in the most exam-critical hardware and infrastructure concepts — each designed to stick under pressure.
🧮

Training RAM Rule

Model size (GB) = Parameters × 2 (FP16). Training overhead = 4–6× that.

70B model training ≈ 560–840GB GPU RAM. Minimum 8× H100 80GB.

Inference with INT4 = 70B × 0.5 = 35GB — one H200 handles it alone.

📐

3 Parallelism Types — DTP Escalation

Data Parallel → replicate model, split data → single GPU fits model.
Tensor Parallel → split layers across GPUs, needs NVLink, within node.
Pipeline Parallel → split layer blocks across nodes, needs InfiniBand.

Scale from D → T → P as model grows beyond what one GPU, then one node, can hold.

⚖️

On-Prem vs Cloud Decision

Sensitive data / steady workload / long-term cost → On-Prem.
Variable / bursty / start-up / no specialized staff → Cloud.
Best of both worlds → Hybrid (on-prem for steady state, cloud burst for peaks).

🏗️

Cluster 4 Components

Every AI cluster needs all four pillars:
Compute (GPU nodes) + Storage (parallel filesystem) + Network (IB/RoCE fabric) + Management (BMC + Base Command Manager).

Miss one = guaranteed bottleneck and underutilized GPUs.

🔌

IB vs RoCE

InfiniBand: lowest latency (~1µs), purpose-built for HPC/AI, NVIDIA-owned (Mellanox), NDR = 400Gb/s.

RoCE: RDMA over Ethernet, cheaper switch infrastructure, slightly higher latency, needs lossless Ethernet.

Both support RDMA. IB = premium AI training choice.

🔧

SXM for Training, PCIe for Inference

SXM = NVLink-connected via NVSwitch = full intra-node bandwidth = best for training all-reduce.

PCIe = standard slot = flexible = good for inference deployment in commodity servers and cloud VMs.

H100 SXM5 for training. A10G/L4/L40S PCIe for inference.

10 exam-style questions on AI Infrastructure: Hardware & Systems. Select your answer for each question, then click Submit Quiz to see results.
Question 1 of 10
A company is sizing GPU memory for training a 13 billion parameter model in FP16 precision, including Adam optimizer states and activations. What is the MINIMUM approximate GPU memory required?
Question 2 of 10
A large language model with 70 billion parameters needs to be distributed across multiple GPUs within a single DGX node. Each layer's weight matrices are split across all 8 GPUs so each GPU holds a shard. Which parallelism strategy is this?
Question 3 of 10
Which GPU form factor and generation is MOST appropriate for a large-scale LLM training cluster requiring maximum NVLink bandwidth between GPUs within a node?
Question 4 of 10
A startup wants to train a medium-size model (~7B parameters) on their own data but has variable training schedules — weeks of heavy usage followed by months of inactivity. What infrastructure approach offers the MOST cost advantage?
Question 5 of 10
During distributed training across 64 nodes with InfiniBand networking, performance is bottlenecked at the gradient synchronization step. What is the MOST likely cause?
Question 6 of 10
What is the primary purpose of GPUDirect Storage (GDS) in an AI training cluster?
Question 7 of 10
A research team is training a 175B parameter model that exceeds the memory of any single node (8× H100 = 640GB). How should parallelism be configured?
Question 8 of 10
What defines the "fat-tree" network topology used in AI training clusters?
Question 9 of 10
Which component of an AI cluster is responsible for preventing GPU compute from stalling due to insufficient data supply during training?
Question 10 of 10
A company runs continuous, predictable AI training workloads 24/7 for the next 3 years and has data that must remain on-premises. Comparing total cost of ownership over 3 years, which statement is MOST accurate?

Quiz Complete!

8 flashcards covering the most exam-critical hardware and infrastructure concepts. Click any card to flip it.

Click a card to reveal the answer

GPU Memory Sizing

70B parameter model — GPU memory needed for training vs inference?

Training (FP16): 70B × 2B = 140GB weights + ~4× overhead ≈ 560–840GB → needs 8–12× H100 80GB.

Inference (INT4): 70B × 0.5B = 35GB → fits easily in 1× H200 141GB.
Parallelism

Tensor Parallelism vs Pipeline Parallelism — key differences?

Tensor Parallel: split layers horizontally (weight matrices sharded) across GPUs within a node; requires NVLink.

Pipeline Parallel: split layers vertically (layer groups) across nodes; uses InfiniBand.

Combine both = 3D Parallelism.
Networking

InfiniBand NDR speed + key property?

NDR = 400 Gb/s per port (= 50 GB/s).

Key property: RDMA — zero-copy, kernel-bypass data transfer.

Latency ≈ 1 microsecond. Essential for all-reduce in large-scale distributed training.
Reference Architecture

What is a DGX POD?

20× DGX H100 servers = 160 GPUs connected via Quantum-2 InfiniBand switches (1.28 Tb/s per switch).

Fully validated and characterized for LLM training. NVIDIA's reference AI factory unit.
Memory Optimization

Why use gradient checkpointing?

Trade compute for memory — don't store activations during the forward pass; recompute them during the backward pass.

Costs ~30% more compute but allows larger batch sizes or models that wouldn't otherwise fit in GPU memory.
Storage

GPUDirect Storage — what problem does it solve?

Without GDS: Storage → CPU RAM → GPU RAM (2 copies, CPU is bottleneck).

With GDS: Storage → GPU RAM directly (0 CPU copies).

Eliminates the CPU bottleneck and effectively doubles storage-to-GPU bandwidth for training data loading.
Networking

Rail-optimized networking — what is it?

In an 8-GPU node, each GPU connects to a different leaf switch ("rail").

Ensures traffic between specific GPU pairs routes through different switches. Maximizes all-to-all bandwidth and prevents any single switch from becoming a bottleneck.
Hardware

HGX vs DGX — which has NVSwitch?

Both — HGX H100 baseboard includes NVSwitch + SXM5 GPU slots, delivering the same NVLink performance as DGX.

DGX adds: CPU board, chassis, PSU, validated software stack, and NVIDIA enterprise support. Same intra-node GPU-to-GPU bandwidth.
Select your experience level or exam timing to get a personalized study recommendation for AI Infrastructure: Hardware & Systems.

Beginners

  • Start with the single-GPU setup; understand the model memory formula (parameters × 2 for FP16 = GB required) before anything else
  • Learn why NVLink matters vs PCIe: NVLink = ~600 GB/s GPU-to-GPU vs PCIe = ~64 GB/s — the difference defines why SXM is preferred for training
  • Explore the NVIDIA DGX H100 product page to understand what "turnkey AI infrastructure" means in practice
  • Focus on the four cluster pillars (Compute + Storage + Network + Management) — ensure you can name all four from memory
  • Use the Memory Hooks tab to internalize the Training RAM Rule before moving to parallelism strategies

Official NVIDIA Resources

NCA-AIIO Official Certification

Disclaimer

Not affiliated with NVIDIA. NVIDIA® is a registered trademark of NVIDIA Corporation. This page is an independent study resource. Official certification information: nvidia.com/en-us/learn/certification/ai-infrastructure-operations-associate/

Ready to Test Your Knowledge?

Unlock full practice exams, progress tracking, and all NCA-AIIO topics on FlashGenius.

Start Studying Free →