NCA-AIIO · NVIDIA Solutions & GPU Architecture

NVIDIA Solutions & GPU Architecture

This topic covers the hardware and software that powers modern AI — from the silicon (CUDA cores, Tensor Cores, HBM) to complete systems (DGX, HGX) and the software stack that makes them productive (CUDA-X, NIM, NGC). Together these form roughly 38% of the NCA-AIIO exam.

What You'll Master

GPU vs CPU Architecture

Why thousands of small GPU cores beat a few powerful CPU cores for AI
Memory bandwidth: HBM3 at 3.35 TB/s vs DDR5 at ~100 GB/s
SIMD execution model — neural networks are embarrassingly parallel
Complementary roles: CPU orchestrates, GPU computes

CUDA & Compute Primitives

CUDA: NVIDIA's parallel computing platform (C/C++, Python, Fortran)
CUDA Cores: general-purpose FP/INT parallel execution units
Tensor Cores: specialized matrix-multiply-accumulate hardware for DL
Mixed precision: FP16, BF16, TF32, INT8, FP8, FP4 support

NVIDIA GPU Families

A100 (Ampere, 2020): 80GB HBM2e, 312 TFLOPS TF32, 3rd Gen Tensor Cores
H100 (Hopper, 2022): 80GB HBM3, 989 TFLOPS FP16, Transformer Engine
H200 (2024): 141GB HBM3e — memory-optimized refresh of H100
B200 (Blackwell, 2024): FP4, 20 PetaFLOPS, NVLink 5th Gen (1.8 TB/s)
Jetson: edge AI platform for embedded/robotics deployments

NVIDIA Systems

DGX: complete, turnkey NVIDIA AI server (8× H100 SXM5, 640 GB GPU RAM)
HGX: GPU baseboard for OEM server builders (Dell, HPE, Lenovo)
MGX: modular platform — mix Grace CPU, BlueField DPU, any NVIDIA GPU
DGX POD (20 DGX H100s) → DGX SuperPOD (32+ PODs)

NVIDIA Software Stack

CUDA-X libraries: cuDNN, cuBLAS, NCCL, TensorRT, Triton, RAPIDS
NVIDIA AI Enterprise: enterprise-grade optimized containers + SLA
NIM (Inference Microservices): pre-packaged, API-compatible model containers
NGC catalog: GPU-optimized containers, models, Helm charts
NeMo: end-to-end LLM training and customization platform
Run:ai: Kubernetes GPU orchestration and scheduling

AI Development Lifecycle

Data Collection & Curation → NeMo Curator, object storage
Preprocessing & Feature Engineering → RAPIDS (GPU-accelerated)
Distributed Training → NeMo Trainer, data/model/pipeline parallel
Evaluation & Validation → benchmark suites, NVIDIA eval frameworks
Optimization → TensorRT quantization (INT8/FP8), pruning, distillation
Deployment & Serving → NIM, Triton, Kubernetes auto-scaling

Key Products & Services

CUDATensor CoresHBM3e NVLinkNVSwitchH100 B200A100DGX H100 HGX H100MGXJetson NVIDIA AI EnterpriseNIMNGC cuDNNTensorRTRAPIDS Triton Inference ServerNeMoRun:ai

GPU vs CPU — At a Glance

Attribute	CPU	GPU (NVIDIA H100)
Core count	4–128 powerful cores	16,896 CUDA cores
Optimization target	Low latency, complex control flow	High throughput, data-parallel compute
Memory bandwidth	~100 GB/s (DDR5)	3.35 TB/s (HBM3) — 33× faster
FP32 peak	~1–2 TFLOPS	67 TFLOPS (CUDA cores)
FP16 Tensor Core peak	N/A	989 TFLOPS
Best for	OS, databases, orchestration, control flow	Matrix multiply, convolutions, AI training

Core Concepts

Eight detailed concept cards covering every major area tested on the NCA-AIIO GPU Architecture topic.

Concept 1 of 8

GPU vs CPU Architecture

CPU (Central Processing Unit): designed for sequential, low-latency tasks; few powerful cores (4–128); large cache hierarchy; optimized for complex control flow (branch prediction, out-of-order execution); best for OS tasks, databases, web servers
GPU (Graphics Processing Unit): designed for parallel, high-throughput computation; thousands of smaller cores (H100: 16,896 CUDA cores); minimal cache per core; optimized for data-parallel tasks — same operation on many data points simultaneously
Why GPUs dominate AI: neural network operations (matrix multiplications, convolutions) are embarrassingly parallel — perfect for GPU's SIMD (Single Instruction Multiple Data) architecture
Memory bandwidth: GPUs have far higher memory bandwidth than CPUs — H100 SXM: 3.35 TB/s HBM3 vs CPU DDR5: ~100 GB/s; critical because AI training is memory-bandwidth bound
Complementarity: CPUs handle control flow, data loading, orchestration; GPUs handle the heavy compute; a typical AI server has both connected via PCIe or NVLink
FLOPS comparison: modern CPUs peak at ~1–2 TFLOPS FP32; NVIDIA H100 SXM reaches 989 TFLOPS FP16 (Tensor Core), 3,958 TFLOPS FP8 — orders of magnitude difference

Concept 2 of 8

CUDA: The Foundation of NVIDIA's Ecosystem

CUDA (Compute Unified Device Architecture): NVIDIA's parallel computing platform and programming model (launched 2006); C/C++ extensions to write GPU kernels; now supports Python, Fortran, Java via libraries
CUDA Cores: general-purpose parallel processing units; execute floating-point and integer operations; H100 has 16,896 CUDA cores; the workhorses for any non-matrix-math computation
Tensor Cores: specialized matrix-math accelerators introduced in Volta (V100); dramatically accelerate matrix multiply-accumulate (MMA) operations central to deep learning; support mixed precision (FP16, BF16, TF32, INT8, FP8, FP4 in Blackwell)
CUDA ecosystem lock-in: CUDA libraries (cuDNN, cuBLAS, cuFFT, NCCL) are NVIDIA-exclusive; most AI frameworks (PyTorch, TensorFlow) are CUDA-optimized; competitor GPUs (AMD ROCm, Intel oneAPI) have compatibility layers but lag in performance
CUDA toolkit: compiler (nvcc), debugger (cuda-gdb), profiler (Nsight, nvprof), libraries; developers write kernels that run on GPU; frameworks abstract this away for most data scientists
Importance for exam: understanding that CUDA enabled the AI revolution by giving developers a productive GPU programming model is key NCA-AIIO knowledge

Concept 3 of 8

NVIDIA GPU Families: Data Center

A100 (Ampere, 2020): 80GB HBM2e; 312 TFLOPS TF32; NVLink 3rd gen (600 GB/s); workhorse of the cloud GPU market; still widely deployed; 3rd Gen Tensor Cores
H100 (Hopper, 2022): 80GB HBM3; 989 TFLOPS FP16 (Tensor); NVLink 4th gen (900 GB/s/GPU); 4th Gen Tensor Cores with FP8; Transformer Engine (dynamic FP8/BF16 switching); confidential computing; SXM5 (server) and PCIe form factors
H200 (Hopper refresh, 2024): 141GB HBM3e (75% more memory); higher bandwidth; same compute as H100; ideal for large model inference where memory is the bottleneck
B200 (Blackwell, 2024): 5th Gen Tensor Cores; FP4 precision; 20 PetaFLOPS in FP4; NVLink 5th gen (1.8 TB/s); designed for trillion-parameter model training; GB200 = Grace CPU + B200 GPU on NVLink-C2C
L4 / L40S (Ada Lovelace): PCIe; designed for inference and content creation; L4 = 24GB GDDR6, good for video AI; L40S = 48GB, excellent for LLM inference at scale
A10G: PCIe, 24GB; popular for cloud inference (AWS, Azure); good price/performance for medium-scale deployment

GPU	Architecture	Memory	FP16 Tensor TFLOPS	NVLink Gen
A100	Ampere (2020)	80GB HBM2e	312 (TF32)	3rd — 600 GB/s
H100	Hopper (2022)	80GB HBM3	989	4th — 900 GB/s
H200	Hopper (2024)	141GB HBM3e	989	4th — 900 GB/s
B200	Blackwell (2024)	192GB HBM3e	20 PFLOPS FP4	5th — 1.8 TB/s

Concept 4 of 8

NVIDIA Systems: DGX, HGX, MGX

DGX (Data Center GPU): NVIDIA's complete, turnkey AI server; fully validated hardware + software (DGX OS, NVIDIA AI Enterprise); DGX H100 = 8× H100 SXM5 connected via NVLink Switch System; 640GB total GPU memory; 3.2 TB/s NVLink bandwidth; DGX B200 uses 8× B200
DGX POD: 20 DGX H100 servers connected via Quantum-2 InfiniBand; 160 GPUs; full software stack; NVIDIA-validated networking for AI training at scale
DGX SuperPOD: 32+ DGX PODs; thousands of GPUs; full fat-tree InfiniBand fabric; enterprise-scale AI factory
HGX (Hyperscale GPU): building block for OEM server vendors (Dell, HPE, Lenovo); H100 HGX baseboard contains 4 or 8 H100 SXMs + NVSwitch; OEMs add CPUs, cooling, chassis; more flexible than DGX for custom deployments
MGX (Modular GPU eXtensible): modular server architecture; mix-and-match NVIDIA GPUs, CPUs (Grace, x86), DPUs (BlueField); designed for diverse AI and HPC workloads with standardized form factor
Key distinction: DGX = complete product, NVIDIA-sold; HGX = OEM component/baseboard; MGX = modular platform for partners

Concept 5 of 8

NVLink, NVSwitch, and GPU Interconnects

NVLink: high-bandwidth, low-latency GPU-to-GPU interconnect; bypasses PCIe for GPU communication; NVLink 4 (H100): 900 GB/s bidirectional per GPU; enables GPUs to share memory and communicate directly
NVSwitch: non-blocking switch chip for all-to-all NVLink connectivity; allows any GPU to communicate with any other at full NVLink speed; H100 DGX uses 4× NVSwitch chips providing 3.2 TB/s total
Why this matters for AI: large model training requires frequent gradient exchange between GPUs (all-reduce); NVLink/NVSwitch bandwidth directly determines how well training scales
PCIe vs SXM: PCIe = standard slot, lower power/bandwidth, more flexible; SXM = mezzanine form factor, higher TDP, direct NVLink to NVSwitch, maximum performance; DGX and HGX use SXM
NVLink-C2C (Chip-to-Chip): connects Grace CPU directly to Blackwell GPU with 900 GB/s coherent link; CPU and GPU share memory address space; used in GB200/GB100 products
Multi-node scaling: within a node = NVLink; across nodes = InfiniBand or Ethernet; NVLink (intra-node) + InfiniBand (inter-node) enables clusters of thousands of GPUs

Concept 6 of 8

NVIDIA Software Stack: CUDA-X Libraries

cuDNN: deep neural network library; optimized primitives (convolutions, pooling, normalization, activation); used by all major DL frameworks (PyTorch, TensorFlow) under the hood
cuBLAS: optimized BLAS (Basic Linear Algebra Subprograms) for GPU; matrix multiply at the core of transformer attention layers; uses Tensor Cores automatically
NCCL (NVIDIA Collective Communications Library): optimized all-reduce, broadcast, gather for multi-GPU and multi-node training; critical for distributed training performance; uses NVLink intra-node, RDMA InfiniBand inter-node
TensorRT: inference optimization SDK; takes trained model, applies layer fusion, precision calibration (INT8/FP8), kernel auto-tuning; produces optimized inference engine; integrates with Triton
Triton Inference Server: open-source, production inference serving; multi-framework (TensorRT, PyTorch, ONNX, TensorFlow); dynamic batching; multi-GPU and multi-node; REST/gRPC API
RAPIDS: GPU-accelerated data science library suite; cuDF (pandas-like DataFrames on GPU), cuML (ML on GPU), cuGraph (graph analytics); integrates with Python ecosystem

Library	Purpose	Used By
cuDNN	DNN primitives (conv, pool, norm)	PyTorch, TensorFlow (auto)
cuBLAS	GPU matrix multiply (BLAS)	All DL frameworks
NCCL	Multi-GPU collective comms	Distributed training
TensorRT	Inference optimization engine	Production serving
Triton	Inference serving framework	MLOps teams
RAPIDS	GPU data science (cuDF/cuML)	Data scientists

Concept 7 of 8

NVIDIA AI Enterprise & NIM

NVIDIA AI Enterprise: software platform for enterprise AI; includes optimized containers for popular AI frameworks (PyTorch, TensorFlow, JAX); certified on NVIDIA hardware; includes NIM, NeMo, RAPIDS, Triton; supported with enterprise SLA
NIM (NVIDIA Inference Microservices): pre-packaged, optimized containers for deploying AI models; each NIM = model + TensorRT optimization + Triton server + API endpoint; API-compatible with OpenAI standards; deploy on-prem or cloud in minutes
NVIDIA NeMo: end-to-end platform for LLM training and customization; NeMo Curator (data curation), NeMo Trainer (distributed training), NeMo Customizer (fine-tuning/LoRA), NeMo Guardrails (safety)
NGC (NVIDIA GPU Cloud): catalog of GPU-optimized containers, pre-trained models, SDKs, Helm charts; free to access; pulls optimized containers directly; includes models from NVIDIA and partners
NVIDIA AI Workbench: local development environment for AI/ML; GPU-accelerated; project-based; sync to cloud or cluster; simplifies environment management for data scientists
Run:ai: Kubernetes-based GPU orchestration platform (NVIDIA acquired 2024); workload scheduling, GPU pooling, quota management; manages GPU clusters for ML teams; integrates with DGX systems

Concept 8 of 8

AI Development and Deployment Lifecycle

Data Collection & Curation: gather raw data; label or generate annotations; quality filtering; NVIDIA NeMo Curator; storage in data lakes (S3, HDFS, Lustre, GPFS)
Data Preprocessing & Feature Engineering: clean, normalize, tokenize; GPU-accelerated with RAPIDS; store in efficient formats (Parquet, TFRecord, WebDataset)
Model Training: define model architecture; configure distributed training (data parallel, model parallel, pipeline parallel); run on GPU cluster; monitor with Weights & Biases, MLflow, DCGM
Evaluation & Validation: evaluate on held-out test set; benchmark against baselines; human evaluation for generative models; NVIDIA provides evaluation frameworks
Optimization for Deployment: quantization (TensorRT), pruning, distillation; optimize for target hardware (GPU type, batch size, latency SLA)
Deployment & Inference Serving: containerize with NIM or custom Triton config; deploy via Kubernetes; auto-scale based on load; monitor latency, throughput, GPU utilization, accuracy drift

Lifecycle Stage	Key Tools	Output
Data Curation	NeMo Curator, object storage	Clean dataset
Preprocessing	RAPIDS, cuDF, tokenizers	Training-ready tensors
Training	NeMo Trainer, NCCL, DGX cluster	Checkpoint / model weights
Evaluation	Eval frameworks, DCGM	Accuracy metrics
Optimization	TensorRT, quantization tools	Optimized inference engine
Deployment	NIM, Triton, Kubernetes	Production API endpoint

Memory Hooks

Six cognitive anchors to lock in the most exam-critical distinctions — ready when the pressure is on.

🧮

Hook 1

Tensor Cores = Matrix Magic

Regular CUDA Cores: general compute. Tensor Cores: purpose-built matrix multiply. Every transformer attention = matrix multiply. Tensor Cores make LLMs 10–30× faster than CUDA cores alone. Know both exist; know Tensor Cores dominate DL workloads.

📦

Hook 2

DGX vs HGX — Done vs Half-Done

DGX = Done (complete, turnkey NVIDIA server). HGX = Half-done (GPU baseboard for OEMs to complete). Same GPUs, different packaging. DGX ships with DGX OS + NVIDIA AI Enterprise. HGX is a component Dell/HPE builds around.

⚡

Hook 3

NVLink vs PCIe — 14× Faster

NVLink 4 (H100): 900 GB/s. PCIe Gen5 ×16: ~64 GB/s. NVLink = 14× faster GPU-GPU. This is why DGX always uses SXM (NVLink-connected) form factor, not PCIe. When you see SXM on the exam → think NVLink → think maximum training performance.

🏗️

Hook 4

CUDA Stack: Framework → Library → CUDA → GPU

Framework (PyTorch/TF) → cuDNN/cuBLAS → CUDA → GPU hardware. Every deep learning operation ultimately becomes CUDA kernel calls. You don't need to write CUDA to benefit — frameworks abstract it. But the CUDA ecosystem lock-in is why NVIDIA dominates.

🚀

Hook 5

NIM in 5 Words

Pre-built. Optimized. Containerized. API-compatible. Deploy-anywhere. → NVIDIA NIM turns any foundation model into a production microservice. Each NIM bundles model + TensorRT engine + Triton server + OpenAI-compatible REST/gRPC endpoint. Minutes, not days.

🔄

Hook 6

AI Lifecycle: DPDOD

Data collection → Preprocessing → Distributed training → Optimization → Deployment. Each step has NVIDIA tools: NeMo Curator → RAPIDS → NeMo Trainer → TensorRT → Triton/NIM. Know the tool for each stage.

Practice Quiz

10 exam-style questions on GPU Architecture, CUDA, NVIDIA systems, and the software stack. Select your answer and submit each question.

Flashcards

Click any card to flip it. Eight cards covering the most testable concepts.

Tap a card to reveal the answer

CUDA Core vs Tensor Core

What is the difference?

CUDA Core: general-purpose FP/INT compute unit. Tensor Core: specialized matrix-multiply-accumulate (MMA) hardware. Tensor Cores are 10–30× faster for DL operations. H100 has both: 16,896 CUDA Cores + 4th Gen Tensor Cores.

H100 vs B200

What is the key upgrade in Blackwell?

B200 (Blackwell): adds FP4 precision + 5th Gen Tensor Cores → 20 PetaFLOPS FP4. NVLink 5 (1.8 TB/s). Designed for trillion-parameter models. H100 = Hopper (2022); B200 = Blackwell (2024). GB200 pairs B200 with Grace CPU on NVLink-C2C.

NVSwitch

What is its purpose inside a DGX node?

All-to-all NVLink switch chip inside DGX nodes. 4× NVSwitch in DGX H100 = 3.2 TB/s total GPU-to-GPU bandwidth. Ensures any GPU can talk to any other GPU at full NVLink speed — non-blocking fabric.

TensorRT vs Triton

What is the role of each?

TensorRT: optimizes a trained model (quantization, layer fusion, kernel tuning) → produces an engine file. Triton: serves the model (REST/gRPC API, dynamic batching, multi-model). TensorRT output is often served via Triton in production.

NCCL

What does it do in distributed training?

All-reduce, broadcast, all-gather across multiple GPUs/nodes during distributed training. Gradient averaging across 8 GPUs requires all-reduce. NCCL uses NVLink (intra-node) and RDMA InfiniBand (inter-node) for maximum throughput.

NGC Catalog

What is inside it?

GPU-optimized containers (PyTorch, TF, RAPIDS), pre-trained models (LLMs, vision models), Helm charts, SDK installers. Free to pull. Foundation for NVIDIA AI Enterprise. NIM models are also published to NGC.

NIM in One Sentence

What is NVIDIA NIM?

A pre-packaged container that bundles a foundation model + TensorRT optimization + Triton serving + OpenAI-compatible API endpoint — deploy in minutes, not days. Compatible with any NVIDIA GPU on-prem or cloud.

SXM vs PCIe GPU

When do you choose each form factor?

SXM: maximum performance, NVLink-connected, inside DGX/HGX; use for training and large-scale inference. PCIe: standard slot, lower cost, commodity servers; use for edge inference and smaller deployments.

Study Advisor

Select your experience level for a tailored study plan for this topic.

Beginners

Start with the GPU vs CPU comparison — understand why parallel cores matter for matrix math
Watch NVIDIA's "What is a GPU?" explainer before diving into CUDA
Explore the NVIDIA NGC catalog — browse available containers and models
Learn the three system tiers: DGX (complete) → HGX (baseboard) → MGX (modular)
Focus on the Overview tab first — nail the key products list before the detailed concepts

Intermediate

Learn the CUDA library stack: cuDNN → cuBLAS → NCCL → TensorRT → Triton — understand each role
Memorize the DGX vs HGX distinction — this appears frequently in scenario questions
Study NVLink bandwidth numbers: Gen3=600, Gen4=900, Gen5=1800 GB/s per GPU
Understand why TensorRT optimizes and Triton serves — they are complementary, not competing
Review the AI Lifecycle table in Concept 8 — map NVIDIA tools to each stage

Advanced

Deep-dive Tensor Core precision modes: TF32, BF16, FP8 (H100), FP4 (B200) — know the TFLOPS at each
Understand NVLink-C2C for Grace-Blackwell: 900 GB/s coherent CPU-GPU link with shared memory address space
Study the full TensorRT optimization workflow: layer fusion → precision calibration → kernel auto-tuning
Evaluate the NCCL all-reduce algorithm in the context of ring-allreduce vs tree-allreduce
Compare NIM deployment vs custom Triton config — when does each make sense?

Exam Week

Memorize: DGX = complete product / HGX = baseboard — most common distractor pair
Know bandwidth hierarchy: NVLink Gen4 (900 GB/s) >> PCIe Gen5 (64 GB/s)
Lock in: NIM = pre-packaged, optimized, OpenAI-compatible inference container
Remember: TensorRT = optimize model; Triton = serve model — two different roles
Review the DPDOD lifecycle mnemonic from Memory Hooks — maps to exam scenario questions

Day Before

Review GPU generation timeline: A100 (Ampere) → H100 (Hopper) → H200 → B200 (Blackwell)
Know NCCL's role in distributed training: all-reduce for gradient synchronization across GPUs
Remember: SXM form factor = NVLink = maximum training performance
Flashcard drill: flip all 8 cards until you nail every back side without hesitation
Do a quick pass of the quiz — flag anything below 80% and re-read only those concept cards

Resources

Official NVIDIA documentation and reference pages for deeper study.

NVIDIA H100 Datasheet — Official GPU specifications, memory bandwidth, Tensor Core performance figures
nvidia.com/en-us/data-center/h100/
NVIDIA DGX H100 Product Page — Complete AI server specifications, DGX POD and SuperPOD architecture
nvidia.com/en-us/data-center/dgx-h100/
NVIDIA NIM Microservices — NIM overview, supported models, deployment guides
nvidia.com/en-us/ai-data-science/products/nim-microservices/
NGC Catalog — Browse GPU-optimized containers, models, and SDK installers
catalog.ngc.nvidia.com
CUDA Documentation — Programming guide, best practices, API reference for CUDA toolkit
docs.nvidia.com/cuda/
NCA-AIIO Certification Page — Official exam objectives, study resources, and registration
NVIDIA AI Infrastructure & Operations Associate

NVIDIA Solutions & GPU Architecture

NVIDIA Solutions & GPU Architecture

What You'll Master

GPU vs CPU Architecture

CUDA & Compute Primitives

NVIDIA GPU Families

NVIDIA Systems

NVIDIA Software Stack

AI Development Lifecycle

Key Products & Services

GPU vs CPU — At a Glance

Core Concepts

GPU vs CPU Architecture

CUDA: The Foundation of NVIDIA's Ecosystem

NVIDIA GPU Families: Data Center

NVIDIA Systems: DGX, HGX, MGX

NVLink, NVSwitch, and GPU Interconnects

NVIDIA Software Stack: CUDA-X Libraries

NVIDIA AI Enterprise & NIM

AI Development and Deployment Lifecycle

Memory Hooks

Practice Quiz

Flashcards

Study Advisor

Beginners

Intermediate

Advanced

Exam Week

Day Before

Resources

Ready to Test Your Knowledge?