FlashGenius Logo FlashGenius
NCA-AIIO Exam Prep · Page 2 of 5

NVIDIA Solutions & GPU Architecture

NCA-AIIO · Essential AI Knowledge · 38% of Exam

CUDA · Tensor Cores · HBM · H100 · B200 · DGX · HGX · NIM · CUDA-X · AI Software Stack

Study with Practice Tests →

NVIDIA Solutions & GPU Architecture

This topic covers the hardware and software that powers modern AI — from the silicon (CUDA cores, Tensor Cores, HBM) to complete systems (DGX, HGX) and the software stack that makes them productive (CUDA-X, NIM, NGC). Together these form roughly 38% of the NCA-AIIO exam.

What You'll Master

GPU vs CPU Architecture

  • Why thousands of small GPU cores beat a few powerful CPU cores for AI
  • Memory bandwidth: HBM3 at 3.35 TB/s vs DDR5 at ~100 GB/s
  • SIMD execution model — neural networks are embarrassingly parallel
  • Complementary roles: CPU orchestrates, GPU computes

CUDA & Compute Primitives

  • CUDA: NVIDIA's parallel computing platform (C/C++, Python, Fortran)
  • CUDA Cores: general-purpose FP/INT parallel execution units
  • Tensor Cores: specialized matrix-multiply-accumulate hardware for DL
  • Mixed precision: FP16, BF16, TF32, INT8, FP8, FP4 support

NVIDIA GPU Families

  • A100 (Ampere, 2020): 80GB HBM2e, 312 TFLOPS TF32, 3rd Gen Tensor Cores
  • H100 (Hopper, 2022): 80GB HBM3, 989 TFLOPS FP16, Transformer Engine
  • H200 (2024): 141GB HBM3e — memory-optimized refresh of H100
  • B200 (Blackwell, 2024): FP4, 20 PetaFLOPS, NVLink 5th Gen (1.8 TB/s)
  • Jetson: edge AI platform for embedded/robotics deployments

NVIDIA Systems

  • DGX: complete, turnkey NVIDIA AI server (8× H100 SXM5, 640 GB GPU RAM)
  • HGX: GPU baseboard for OEM server builders (Dell, HPE, Lenovo)
  • MGX: modular platform — mix Grace CPU, BlueField DPU, any NVIDIA GPU
  • DGX POD (20 DGX H100s) → DGX SuperPOD (32+ PODs)

NVIDIA Software Stack

  • CUDA-X libraries: cuDNN, cuBLAS, NCCL, TensorRT, Triton, RAPIDS
  • NVIDIA AI Enterprise: enterprise-grade optimized containers + SLA
  • NIM (Inference Microservices): pre-packaged, API-compatible model containers
  • NGC catalog: GPU-optimized containers, models, Helm charts
  • NeMo: end-to-end LLM training and customization platform
  • Run:ai: Kubernetes GPU orchestration and scheduling

AI Development Lifecycle

  • Data Collection & Curation → NeMo Curator, object storage
  • Preprocessing & Feature Engineering → RAPIDS (GPU-accelerated)
  • Distributed Training → NeMo Trainer, data/model/pipeline parallel
  • Evaluation & Validation → benchmark suites, NVIDIA eval frameworks
  • Optimization → TensorRT quantization (INT8/FP8), pruning, distillation
  • Deployment & Serving → NIM, Triton, Kubernetes auto-scaling

Key Products & Services

CUDATensor CoresHBM3e NVLinkNVSwitchH100 B200A100DGX H100 HGX H100MGXJetson NVIDIA AI EnterpriseNIMNGC cuDNNTensorRTRAPIDS Triton Inference ServerNeMoRun:ai

GPU vs CPU — At a Glance

AttributeCPUGPU (NVIDIA H100)
Core count4–128 powerful cores16,896 CUDA cores
Optimization targetLow latency, complex control flowHigh throughput, data-parallel compute
Memory bandwidth~100 GB/s (DDR5)3.35 TB/s (HBM3) — 33× faster
FP32 peak~1–2 TFLOPS67 TFLOPS (CUDA cores)
FP16 Tensor Core peakN/A989 TFLOPS
Best forOS, databases, orchestration, control flowMatrix multiply, convolutions, AI training

Core Concepts

Eight detailed concept cards covering every major area tested on the NCA-AIIO GPU Architecture topic.

Concept 1 of 8

GPU vs CPU Architecture

  • CPU (Central Processing Unit): designed for sequential, low-latency tasks; few powerful cores (4–128); large cache hierarchy; optimized for complex control flow (branch prediction, out-of-order execution); best for OS tasks, databases, web servers
  • GPU (Graphics Processing Unit): designed for parallel, high-throughput computation; thousands of smaller cores (H100: 16,896 CUDA cores); minimal cache per core; optimized for data-parallel tasks — same operation on many data points simultaneously
  • Why GPUs dominate AI: neural network operations (matrix multiplications, convolutions) are embarrassingly parallel — perfect for GPU's SIMD (Single Instruction Multiple Data) architecture
  • Memory bandwidth: GPUs have far higher memory bandwidth than CPUs — H100 SXM: 3.35 TB/s HBM3 vs CPU DDR5: ~100 GB/s; critical because AI training is memory-bandwidth bound
  • Complementarity: CPUs handle control flow, data loading, orchestration; GPUs handle the heavy compute; a typical AI server has both connected via PCIe or NVLink
  • FLOPS comparison: modern CPUs peak at ~1–2 TFLOPS FP32; NVIDIA H100 SXM reaches 989 TFLOPS FP16 (Tensor Core), 3,958 TFLOPS FP8 — orders of magnitude difference
Concept 2 of 8

CUDA: The Foundation of NVIDIA's Ecosystem

  • CUDA (Compute Unified Device Architecture): NVIDIA's parallel computing platform and programming model (launched 2006); C/C++ extensions to write GPU kernels; now supports Python, Fortran, Java via libraries
  • CUDA Cores: general-purpose parallel processing units; execute floating-point and integer operations; H100 has 16,896 CUDA cores; the workhorses for any non-matrix-math computation
  • Tensor Cores: specialized matrix-math accelerators introduced in Volta (V100); dramatically accelerate matrix multiply-accumulate (MMA) operations central to deep learning; support mixed precision (FP16, BF16, TF32, INT8, FP8, FP4 in Blackwell)
  • CUDA ecosystem lock-in: CUDA libraries (cuDNN, cuBLAS, cuFFT, NCCL) are NVIDIA-exclusive; most AI frameworks (PyTorch, TensorFlow) are CUDA-optimized; competitor GPUs (AMD ROCm, Intel oneAPI) have compatibility layers but lag in performance
  • CUDA toolkit: compiler (nvcc), debugger (cuda-gdb), profiler (Nsight, nvprof), libraries; developers write kernels that run on GPU; frameworks abstract this away for most data scientists
  • Importance for exam: understanding that CUDA enabled the AI revolution by giving developers a productive GPU programming model is key NCA-AIIO knowledge
Concept 3 of 8

NVIDIA GPU Families: Data Center

  • A100 (Ampere, 2020): 80GB HBM2e; 312 TFLOPS TF32; NVLink 3rd gen (600 GB/s); workhorse of the cloud GPU market; still widely deployed; 3rd Gen Tensor Cores
  • H100 (Hopper, 2022): 80GB HBM3; 989 TFLOPS FP16 (Tensor); NVLink 4th gen (900 GB/s/GPU); 4th Gen Tensor Cores with FP8; Transformer Engine (dynamic FP8/BF16 switching); confidential computing; SXM5 (server) and PCIe form factors
  • H200 (Hopper refresh, 2024): 141GB HBM3e (75% more memory); higher bandwidth; same compute as H100; ideal for large model inference where memory is the bottleneck
  • B200 (Blackwell, 2024): 5th Gen Tensor Cores; FP4 precision; 20 PetaFLOPS in FP4; NVLink 5th gen (1.8 TB/s); designed for trillion-parameter model training; GB200 = Grace CPU + B200 GPU on NVLink-C2C
  • L4 / L40S (Ada Lovelace): PCIe; designed for inference and content creation; L4 = 24GB GDDR6, good for video AI; L40S = 48GB, excellent for LLM inference at scale
  • A10G: PCIe, 24GB; popular for cloud inference (AWS, Azure); good price/performance for medium-scale deployment
GPUArchitectureMemoryFP16 Tensor TFLOPSNVLink Gen
A100Ampere (2020)80GB HBM2e312 (TF32)3rd — 600 GB/s
H100Hopper (2022)80GB HBM39894th — 900 GB/s
H200Hopper (2024)141GB HBM3e9894th — 900 GB/s
B200Blackwell (2024)192GB HBM3e20 PFLOPS FP45th — 1.8 TB/s
Concept 4 of 8

NVIDIA Systems: DGX, HGX, MGX

  • DGX (Data Center GPU): NVIDIA's complete, turnkey AI server; fully validated hardware + software (DGX OS, NVIDIA AI Enterprise); DGX H100 = 8× H100 SXM5 connected via NVLink Switch System; 640GB total GPU memory; 3.2 TB/s NVLink bandwidth; DGX B200 uses 8× B200
  • DGX POD: 20 DGX H100 servers connected via Quantum-2 InfiniBand; 160 GPUs; full software stack; NVIDIA-validated networking for AI training at scale
  • DGX SuperPOD: 32+ DGX PODs; thousands of GPUs; full fat-tree InfiniBand fabric; enterprise-scale AI factory
  • HGX (Hyperscale GPU): building block for OEM server vendors (Dell, HPE, Lenovo); H100 HGX baseboard contains 4 or 8 H100 SXMs + NVSwitch; OEMs add CPUs, cooling, chassis; more flexible than DGX for custom deployments
  • MGX (Modular GPU eXtensible): modular server architecture; mix-and-match NVIDIA GPUs, CPUs (Grace, x86), DPUs (BlueField); designed for diverse AI and HPC workloads with standardized form factor
  • Key distinction: DGX = complete product, NVIDIA-sold; HGX = OEM component/baseboard; MGX = modular platform for partners
Concept 5 of 8

NVLink, NVSwitch, and GPU Interconnects

  • NVLink: high-bandwidth, low-latency GPU-to-GPU interconnect; bypasses PCIe for GPU communication; NVLink 4 (H100): 900 GB/s bidirectional per GPU; enables GPUs to share memory and communicate directly
  • NVSwitch: non-blocking switch chip for all-to-all NVLink connectivity; allows any GPU to communicate with any other at full NVLink speed; H100 DGX uses 4× NVSwitch chips providing 3.2 TB/s total
  • Why this matters for AI: large model training requires frequent gradient exchange between GPUs (all-reduce); NVLink/NVSwitch bandwidth directly determines how well training scales
  • PCIe vs SXM: PCIe = standard slot, lower power/bandwidth, more flexible; SXM = mezzanine form factor, higher TDP, direct NVLink to NVSwitch, maximum performance; DGX and HGX use SXM
  • NVLink-C2C (Chip-to-Chip): connects Grace CPU directly to Blackwell GPU with 900 GB/s coherent link; CPU and GPU share memory address space; used in GB200/GB100 products
  • Multi-node scaling: within a node = NVLink; across nodes = InfiniBand or Ethernet; NVLink (intra-node) + InfiniBand (inter-node) enables clusters of thousands of GPUs
Concept 6 of 8

NVIDIA Software Stack: CUDA-X Libraries

  • cuDNN: deep neural network library; optimized primitives (convolutions, pooling, normalization, activation); used by all major DL frameworks (PyTorch, TensorFlow) under the hood
  • cuBLAS: optimized BLAS (Basic Linear Algebra Subprograms) for GPU; matrix multiply at the core of transformer attention layers; uses Tensor Cores automatically
  • NCCL (NVIDIA Collective Communications Library): optimized all-reduce, broadcast, gather for multi-GPU and multi-node training; critical for distributed training performance; uses NVLink intra-node, RDMA InfiniBand inter-node
  • TensorRT: inference optimization SDK; takes trained model, applies layer fusion, precision calibration (INT8/FP8), kernel auto-tuning; produces optimized inference engine; integrates with Triton
  • Triton Inference Server: open-source, production inference serving; multi-framework (TensorRT, PyTorch, ONNX, TensorFlow); dynamic batching; multi-GPU and multi-node; REST/gRPC API
  • RAPIDS: GPU-accelerated data science library suite; cuDF (pandas-like DataFrames on GPU), cuML (ML on GPU), cuGraph (graph analytics); integrates with Python ecosystem
LibraryPurposeUsed By
cuDNNDNN primitives (conv, pool, norm)PyTorch, TensorFlow (auto)
cuBLASGPU matrix multiply (BLAS)All DL frameworks
NCCLMulti-GPU collective commsDistributed training
TensorRTInference optimization engineProduction serving
TritonInference serving frameworkMLOps teams
RAPIDSGPU data science (cuDF/cuML)Data scientists
Concept 7 of 8

NVIDIA AI Enterprise & NIM

  • NVIDIA AI Enterprise: software platform for enterprise AI; includes optimized containers for popular AI frameworks (PyTorch, TensorFlow, JAX); certified on NVIDIA hardware; includes NIM, NeMo, RAPIDS, Triton; supported with enterprise SLA
  • NIM (NVIDIA Inference Microservices): pre-packaged, optimized containers for deploying AI models; each NIM = model + TensorRT optimization + Triton server + API endpoint; API-compatible with OpenAI standards; deploy on-prem or cloud in minutes
  • NVIDIA NeMo: end-to-end platform for LLM training and customization; NeMo Curator (data curation), NeMo Trainer (distributed training), NeMo Customizer (fine-tuning/LoRA), NeMo Guardrails (safety)
  • NGC (NVIDIA GPU Cloud): catalog of GPU-optimized containers, pre-trained models, SDKs, Helm charts; free to access; pulls optimized containers directly; includes models from NVIDIA and partners
  • NVIDIA AI Workbench: local development environment for AI/ML; GPU-accelerated; project-based; sync to cloud or cluster; simplifies environment management for data scientists
  • Run:ai: Kubernetes-based GPU orchestration platform (NVIDIA acquired 2024); workload scheduling, GPU pooling, quota management; manages GPU clusters for ML teams; integrates with DGX systems
Concept 8 of 8

AI Development and Deployment Lifecycle

  • Data Collection & Curation: gather raw data; label or generate annotations; quality filtering; NVIDIA NeMo Curator; storage in data lakes (S3, HDFS, Lustre, GPFS)
  • Data Preprocessing & Feature Engineering: clean, normalize, tokenize; GPU-accelerated with RAPIDS; store in efficient formats (Parquet, TFRecord, WebDataset)
  • Model Training: define model architecture; configure distributed training (data parallel, model parallel, pipeline parallel); run on GPU cluster; monitor with Weights & Biases, MLflow, DCGM
  • Evaluation & Validation: evaluate on held-out test set; benchmark against baselines; human evaluation for generative models; NVIDIA provides evaluation frameworks
  • Optimization for Deployment: quantization (TensorRT), pruning, distillation; optimize for target hardware (GPU type, batch size, latency SLA)
  • Deployment & Inference Serving: containerize with NIM or custom Triton config; deploy via Kubernetes; auto-scale based on load; monitor latency, throughput, GPU utilization, accuracy drift
Lifecycle StageKey ToolsOutput
Data CurationNeMo Curator, object storageClean dataset
PreprocessingRAPIDS, cuDF, tokenizersTraining-ready tensors
TrainingNeMo Trainer, NCCL, DGX clusterCheckpoint / model weights
EvaluationEval frameworks, DCGMAccuracy metrics
OptimizationTensorRT, quantization toolsOptimized inference engine
DeploymentNIM, Triton, KubernetesProduction API endpoint

Memory Hooks

Six cognitive anchors to lock in the most exam-critical distinctions — ready when the pressure is on.

🧮
Hook 1
Tensor Cores = Matrix Magic
Regular CUDA Cores: general compute. Tensor Cores: purpose-built matrix multiply. Every transformer attention = matrix multiply. Tensor Cores make LLMs 10–30× faster than CUDA cores alone. Know both exist; know Tensor Cores dominate DL workloads.
📦
Hook 2
DGX vs HGX — Done vs Half-Done
DGX = Done (complete, turnkey NVIDIA server). HGX = Half-done (GPU baseboard for OEMs to complete). Same GPUs, different packaging. DGX ships with DGX OS + NVIDIA AI Enterprise. HGX is a component Dell/HPE builds around.
Hook 3
NVLink vs PCIe — 14× Faster
NVLink 4 (H100): 900 GB/s. PCIe Gen5 ×16: ~64 GB/s. NVLink = 14× faster GPU-GPU. This is why DGX always uses SXM (NVLink-connected) form factor, not PCIe. When you see SXM on the exam → think NVLink → think maximum training performance.
🏗️
Hook 4
CUDA Stack: Framework → Library → CUDA → GPU
Framework (PyTorch/TF) → cuDNN/cuBLAS → CUDA → GPU hardware. Every deep learning operation ultimately becomes CUDA kernel calls. You don't need to write CUDA to benefit — frameworks abstract it. But the CUDA ecosystem lock-in is why NVIDIA dominates.
🚀
Hook 5
NIM in 5 Words
Pre-built. Optimized. Containerized. API-compatible. Deploy-anywhere. → NVIDIA NIM turns any foundation model into a production microservice. Each NIM bundles model + TensorRT engine + Triton server + OpenAI-compatible REST/gRPC endpoint. Minutes, not days.
🔄
Hook 6
AI Lifecycle: DPDOD
Data collection → Preprocessing → Distributed training → Optimization → Deployment. Each step has NVIDIA tools: NeMo Curator → RAPIDS → NeMo Trainer → TensorRT → Triton/NIM. Know the tool for each stage.

Practice Quiz

10 exam-style questions on GPU Architecture, CUDA, NVIDIA systems, and the software stack. Select your answer and submit each question.

Flashcards

Click any card to flip it. Eight cards covering the most testable concepts.

Tap a card to reveal the answer

CUDA Core vs Tensor Core
What is the difference?
CUDA Core: general-purpose FP/INT compute unit. Tensor Core: specialized matrix-multiply-accumulate (MMA) hardware. Tensor Cores are 10–30× faster for DL operations. H100 has both: 16,896 CUDA Cores + 4th Gen Tensor Cores.
H100 vs B200
What is the key upgrade in Blackwell?
B200 (Blackwell): adds FP4 precision + 5th Gen Tensor Cores → 20 PetaFLOPS FP4. NVLink 5 (1.8 TB/s). Designed for trillion-parameter models. H100 = Hopper (2022); B200 = Blackwell (2024). GB200 pairs B200 with Grace CPU on NVLink-C2C.
NVSwitch
What is its purpose inside a DGX node?
All-to-all NVLink switch chip inside DGX nodes. 4× NVSwitch in DGX H100 = 3.2 TB/s total GPU-to-GPU bandwidth. Ensures any GPU can talk to any other GPU at full NVLink speed — non-blocking fabric.
TensorRT vs Triton
What is the role of each?
TensorRT: optimizes a trained model (quantization, layer fusion, kernel tuning) → produces an engine file. Triton: serves the model (REST/gRPC API, dynamic batching, multi-model). TensorRT output is often served via Triton in production.
NCCL
What does it do in distributed training?
All-reduce, broadcast, all-gather across multiple GPUs/nodes during distributed training. Gradient averaging across 8 GPUs requires all-reduce. NCCL uses NVLink (intra-node) and RDMA InfiniBand (inter-node) for maximum throughput.
NGC Catalog
What is inside it?
GPU-optimized containers (PyTorch, TF, RAPIDS), pre-trained models (LLMs, vision models), Helm charts, SDK installers. Free to pull. Foundation for NVIDIA AI Enterprise. NIM models are also published to NGC.
NIM in One Sentence
What is NVIDIA NIM?
A pre-packaged container that bundles a foundation model + TensorRT optimization + Triton serving + OpenAI-compatible API endpoint — deploy in minutes, not days. Compatible with any NVIDIA GPU on-prem or cloud.
SXM vs PCIe GPU
When do you choose each form factor?
SXM: maximum performance, NVLink-connected, inside DGX/HGX; use for training and large-scale inference. PCIe: standard slot, lower cost, commodity servers; use for edge inference and smaller deployments.

Study Advisor

Select your experience level for a tailored study plan for this topic.

Beginners

  • Start with the GPU vs CPU comparison — understand why parallel cores matter for matrix math
  • Watch NVIDIA's "What is a GPU?" explainer before diving into CUDA
  • Explore the NVIDIA NGC catalog — browse available containers and models
  • Learn the three system tiers: DGX (complete) → HGX (baseboard) → MGX (modular)
  • Focus on the Overview tab first — nail the key products list before the detailed concepts

Resources

Official NVIDIA documentation and reference pages for deeper study.

Ready to Test Your Knowledge?

Practice tests, timed quizzes, and adaptive flashcards for the NCA-AIIO exam.

Start Studying Free →