NVIDIA Solutions & GPU Architecture
This topic covers the hardware and software that powers modern AI — from the silicon (CUDA cores, Tensor Cores, HBM) to complete systems (DGX, HGX) and the software stack that makes them productive (CUDA-X, NIM, NGC). Together these form roughly 38% of the NCA-AIIO exam.
What You'll Master
GPU vs CPU Architecture
- Why thousands of small GPU cores beat a few powerful CPU cores for AI
- Memory bandwidth: HBM3 at 3.35 TB/s vs DDR5 at ~100 GB/s
- SIMD execution model — neural networks are embarrassingly parallel
- Complementary roles: CPU orchestrates, GPU computes
CUDA & Compute Primitives
- CUDA: NVIDIA's parallel computing platform (C/C++, Python, Fortran)
- CUDA Cores: general-purpose FP/INT parallel execution units
- Tensor Cores: specialized matrix-multiply-accumulate hardware for DL
- Mixed precision: FP16, BF16, TF32, INT8, FP8, FP4 support
NVIDIA GPU Families
- A100 (Ampere, 2020): 80GB HBM2e, 312 TFLOPS TF32, 3rd Gen Tensor Cores
- H100 (Hopper, 2022): 80GB HBM3, 989 TFLOPS FP16, Transformer Engine
- H200 (2024): 141GB HBM3e — memory-optimized refresh of H100
- B200 (Blackwell, 2024): FP4, 20 PetaFLOPS, NVLink 5th Gen (1.8 TB/s)
- Jetson: edge AI platform for embedded/robotics deployments
NVIDIA Systems
- DGX: complete, turnkey NVIDIA AI server (8× H100 SXM5, 640 GB GPU RAM)
- HGX: GPU baseboard for OEM server builders (Dell, HPE, Lenovo)
- MGX: modular platform — mix Grace CPU, BlueField DPU, any NVIDIA GPU
- DGX POD (20 DGX H100s) → DGX SuperPOD (32+ PODs)
NVIDIA Software Stack
- CUDA-X libraries: cuDNN, cuBLAS, NCCL, TensorRT, Triton, RAPIDS
- NVIDIA AI Enterprise: enterprise-grade optimized containers + SLA
- NIM (Inference Microservices): pre-packaged, API-compatible model containers
- NGC catalog: GPU-optimized containers, models, Helm charts
- NeMo: end-to-end LLM training and customization platform
- Run:ai: Kubernetes GPU orchestration and scheduling
AI Development Lifecycle
- Data Collection & Curation → NeMo Curator, object storage
- Preprocessing & Feature Engineering → RAPIDS (GPU-accelerated)
- Distributed Training → NeMo Trainer, data/model/pipeline parallel
- Evaluation & Validation → benchmark suites, NVIDIA eval frameworks
- Optimization → TensorRT quantization (INT8/FP8), pruning, distillation
- Deployment & Serving → NIM, Triton, Kubernetes auto-scaling
Key Products & Services
GPU vs CPU — At a Glance
| Attribute | CPU | GPU (NVIDIA H100) |
|---|---|---|
| Core count | 4–128 powerful cores | 16,896 CUDA cores |
| Optimization target | Low latency, complex control flow | High throughput, data-parallel compute |
| Memory bandwidth | ~100 GB/s (DDR5) | 3.35 TB/s (HBM3) — 33× faster |
| FP32 peak | ~1–2 TFLOPS | 67 TFLOPS (CUDA cores) |
| FP16 Tensor Core peak | N/A | 989 TFLOPS |
| Best for | OS, databases, orchestration, control flow | Matrix multiply, convolutions, AI training |
Core Concepts
Eight detailed concept cards covering every major area tested on the NCA-AIIO GPU Architecture topic.
GPU vs CPU Architecture
- CPU (Central Processing Unit): designed for sequential, low-latency tasks; few powerful cores (4–128); large cache hierarchy; optimized for complex control flow (branch prediction, out-of-order execution); best for OS tasks, databases, web servers
- GPU (Graphics Processing Unit): designed for parallel, high-throughput computation; thousands of smaller cores (H100: 16,896 CUDA cores); minimal cache per core; optimized for data-parallel tasks — same operation on many data points simultaneously
- Why GPUs dominate AI: neural network operations (matrix multiplications, convolutions) are embarrassingly parallel — perfect for GPU's SIMD (Single Instruction Multiple Data) architecture
- Memory bandwidth: GPUs have far higher memory bandwidth than CPUs — H100 SXM: 3.35 TB/s HBM3 vs CPU DDR5: ~100 GB/s; critical because AI training is memory-bandwidth bound
- Complementarity: CPUs handle control flow, data loading, orchestration; GPUs handle the heavy compute; a typical AI server has both connected via PCIe or NVLink
- FLOPS comparison: modern CPUs peak at ~1–2 TFLOPS FP32; NVIDIA H100 SXM reaches 989 TFLOPS FP16 (Tensor Core), 3,958 TFLOPS FP8 — orders of magnitude difference
CUDA: The Foundation of NVIDIA's Ecosystem
- CUDA (Compute Unified Device Architecture): NVIDIA's parallel computing platform and programming model (launched 2006); C/C++ extensions to write GPU kernels; now supports Python, Fortran, Java via libraries
- CUDA Cores: general-purpose parallel processing units; execute floating-point and integer operations; H100 has 16,896 CUDA cores; the workhorses for any non-matrix-math computation
- Tensor Cores: specialized matrix-math accelerators introduced in Volta (V100); dramatically accelerate matrix multiply-accumulate (MMA) operations central to deep learning; support mixed precision (FP16, BF16, TF32, INT8, FP8, FP4 in Blackwell)
- CUDA ecosystem lock-in: CUDA libraries (cuDNN, cuBLAS, cuFFT, NCCL) are NVIDIA-exclusive; most AI frameworks (PyTorch, TensorFlow) are CUDA-optimized; competitor GPUs (AMD ROCm, Intel oneAPI) have compatibility layers but lag in performance
- CUDA toolkit: compiler (nvcc), debugger (cuda-gdb), profiler (Nsight, nvprof), libraries; developers write kernels that run on GPU; frameworks abstract this away for most data scientists
- Importance for exam: understanding that CUDA enabled the AI revolution by giving developers a productive GPU programming model is key NCA-AIIO knowledge
NVIDIA GPU Families: Data Center
- A100 (Ampere, 2020): 80GB HBM2e; 312 TFLOPS TF32; NVLink 3rd gen (600 GB/s); workhorse of the cloud GPU market; still widely deployed; 3rd Gen Tensor Cores
- H100 (Hopper, 2022): 80GB HBM3; 989 TFLOPS FP16 (Tensor); NVLink 4th gen (900 GB/s/GPU); 4th Gen Tensor Cores with FP8; Transformer Engine (dynamic FP8/BF16 switching); confidential computing; SXM5 (server) and PCIe form factors
- H200 (Hopper refresh, 2024): 141GB HBM3e (75% more memory); higher bandwidth; same compute as H100; ideal for large model inference where memory is the bottleneck
- B200 (Blackwell, 2024): 5th Gen Tensor Cores; FP4 precision; 20 PetaFLOPS in FP4; NVLink 5th gen (1.8 TB/s); designed for trillion-parameter model training; GB200 = Grace CPU + B200 GPU on NVLink-C2C
- L4 / L40S (Ada Lovelace): PCIe; designed for inference and content creation; L4 = 24GB GDDR6, good for video AI; L40S = 48GB, excellent for LLM inference at scale
- A10G: PCIe, 24GB; popular for cloud inference (AWS, Azure); good price/performance for medium-scale deployment
| GPU | Architecture | Memory | FP16 Tensor TFLOPS | NVLink Gen |
|---|---|---|---|---|
| A100 | Ampere (2020) | 80GB HBM2e | 312 (TF32) | 3rd — 600 GB/s |
| H100 | Hopper (2022) | 80GB HBM3 | 989 | 4th — 900 GB/s |
| H200 | Hopper (2024) | 141GB HBM3e | 989 | 4th — 900 GB/s |
| B200 | Blackwell (2024) | 192GB HBM3e | 20 PFLOPS FP4 | 5th — 1.8 TB/s |
NVIDIA Systems: DGX, HGX, MGX
- DGX (Data Center GPU): NVIDIA's complete, turnkey AI server; fully validated hardware + software (DGX OS, NVIDIA AI Enterprise); DGX H100 = 8× H100 SXM5 connected via NVLink Switch System; 640GB total GPU memory; 3.2 TB/s NVLink bandwidth; DGX B200 uses 8× B200
- DGX POD: 20 DGX H100 servers connected via Quantum-2 InfiniBand; 160 GPUs; full software stack; NVIDIA-validated networking for AI training at scale
- DGX SuperPOD: 32+ DGX PODs; thousands of GPUs; full fat-tree InfiniBand fabric; enterprise-scale AI factory
- HGX (Hyperscale GPU): building block for OEM server vendors (Dell, HPE, Lenovo); H100 HGX baseboard contains 4 or 8 H100 SXMs + NVSwitch; OEMs add CPUs, cooling, chassis; more flexible than DGX for custom deployments
- MGX (Modular GPU eXtensible): modular server architecture; mix-and-match NVIDIA GPUs, CPUs (Grace, x86), DPUs (BlueField); designed for diverse AI and HPC workloads with standardized form factor
- Key distinction: DGX = complete product, NVIDIA-sold; HGX = OEM component/baseboard; MGX = modular platform for partners
NVLink, NVSwitch, and GPU Interconnects
- NVLink: high-bandwidth, low-latency GPU-to-GPU interconnect; bypasses PCIe for GPU communication; NVLink 4 (H100): 900 GB/s bidirectional per GPU; enables GPUs to share memory and communicate directly
- NVSwitch: non-blocking switch chip for all-to-all NVLink connectivity; allows any GPU to communicate with any other at full NVLink speed; H100 DGX uses 4× NVSwitch chips providing 3.2 TB/s total
- Why this matters for AI: large model training requires frequent gradient exchange between GPUs (all-reduce); NVLink/NVSwitch bandwidth directly determines how well training scales
- PCIe vs SXM: PCIe = standard slot, lower power/bandwidth, more flexible; SXM = mezzanine form factor, higher TDP, direct NVLink to NVSwitch, maximum performance; DGX and HGX use SXM
- NVLink-C2C (Chip-to-Chip): connects Grace CPU directly to Blackwell GPU with 900 GB/s coherent link; CPU and GPU share memory address space; used in GB200/GB100 products
- Multi-node scaling: within a node = NVLink; across nodes = InfiniBand or Ethernet; NVLink (intra-node) + InfiniBand (inter-node) enables clusters of thousands of GPUs
NVIDIA Software Stack: CUDA-X Libraries
- cuDNN: deep neural network library; optimized primitives (convolutions, pooling, normalization, activation); used by all major DL frameworks (PyTorch, TensorFlow) under the hood
- cuBLAS: optimized BLAS (Basic Linear Algebra Subprograms) for GPU; matrix multiply at the core of transformer attention layers; uses Tensor Cores automatically
- NCCL (NVIDIA Collective Communications Library): optimized all-reduce, broadcast, gather for multi-GPU and multi-node training; critical for distributed training performance; uses NVLink intra-node, RDMA InfiniBand inter-node
- TensorRT: inference optimization SDK; takes trained model, applies layer fusion, precision calibration (INT8/FP8), kernel auto-tuning; produces optimized inference engine; integrates with Triton
- Triton Inference Server: open-source, production inference serving; multi-framework (TensorRT, PyTorch, ONNX, TensorFlow); dynamic batching; multi-GPU and multi-node; REST/gRPC API
- RAPIDS: GPU-accelerated data science library suite; cuDF (pandas-like DataFrames on GPU), cuML (ML on GPU), cuGraph (graph analytics); integrates with Python ecosystem
| Library | Purpose | Used By |
|---|---|---|
| cuDNN | DNN primitives (conv, pool, norm) | PyTorch, TensorFlow (auto) |
| cuBLAS | GPU matrix multiply (BLAS) | All DL frameworks |
| NCCL | Multi-GPU collective comms | Distributed training |
| TensorRT | Inference optimization engine | Production serving |
| Triton | Inference serving framework | MLOps teams |
| RAPIDS | GPU data science (cuDF/cuML) | Data scientists |
NVIDIA AI Enterprise & NIM
- NVIDIA AI Enterprise: software platform for enterprise AI; includes optimized containers for popular AI frameworks (PyTorch, TensorFlow, JAX); certified on NVIDIA hardware; includes NIM, NeMo, RAPIDS, Triton; supported with enterprise SLA
- NIM (NVIDIA Inference Microservices): pre-packaged, optimized containers for deploying AI models; each NIM = model + TensorRT optimization + Triton server + API endpoint; API-compatible with OpenAI standards; deploy on-prem or cloud in minutes
- NVIDIA NeMo: end-to-end platform for LLM training and customization; NeMo Curator (data curation), NeMo Trainer (distributed training), NeMo Customizer (fine-tuning/LoRA), NeMo Guardrails (safety)
- NGC (NVIDIA GPU Cloud): catalog of GPU-optimized containers, pre-trained models, SDKs, Helm charts; free to access; pulls optimized containers directly; includes models from NVIDIA and partners
- NVIDIA AI Workbench: local development environment for AI/ML; GPU-accelerated; project-based; sync to cloud or cluster; simplifies environment management for data scientists
- Run:ai: Kubernetes-based GPU orchestration platform (NVIDIA acquired 2024); workload scheduling, GPU pooling, quota management; manages GPU clusters for ML teams; integrates with DGX systems
AI Development and Deployment Lifecycle
- Data Collection & Curation: gather raw data; label or generate annotations; quality filtering; NVIDIA NeMo Curator; storage in data lakes (S3, HDFS, Lustre, GPFS)
- Data Preprocessing & Feature Engineering: clean, normalize, tokenize; GPU-accelerated with RAPIDS; store in efficient formats (Parquet, TFRecord, WebDataset)
- Model Training: define model architecture; configure distributed training (data parallel, model parallel, pipeline parallel); run on GPU cluster; monitor with Weights & Biases, MLflow, DCGM
- Evaluation & Validation: evaluate on held-out test set; benchmark against baselines; human evaluation for generative models; NVIDIA provides evaluation frameworks
- Optimization for Deployment: quantization (TensorRT), pruning, distillation; optimize for target hardware (GPU type, batch size, latency SLA)
- Deployment & Inference Serving: containerize with NIM or custom Triton config; deploy via Kubernetes; auto-scale based on load; monitor latency, throughput, GPU utilization, accuracy drift
| Lifecycle Stage | Key Tools | Output |
|---|---|---|
| Data Curation | NeMo Curator, object storage | Clean dataset |
| Preprocessing | RAPIDS, cuDF, tokenizers | Training-ready tensors |
| Training | NeMo Trainer, NCCL, DGX cluster | Checkpoint / model weights |
| Evaluation | Eval frameworks, DCGM | Accuracy metrics |
| Optimization | TensorRT, quantization tools | Optimized inference engine |
| Deployment | NIM, Triton, Kubernetes | Production API endpoint |
Memory Hooks
Six cognitive anchors to lock in the most exam-critical distinctions — ready when the pressure is on.
Practice Quiz
10 exam-style questions on GPU Architecture, CUDA, NVIDIA systems, and the software stack. Select your answer and submit each question.
Flashcards
Click any card to flip it. Eight cards covering the most testable concepts.
Tap a card to reveal the answer
Study Advisor
Select your experience level for a tailored study plan for this topic.
Beginners
- Start with the GPU vs CPU comparison — understand why parallel cores matter for matrix math
- Watch NVIDIA's "What is a GPU?" explainer before diving into CUDA
- Explore the NVIDIA NGC catalog — browse available containers and models
- Learn the three system tiers: DGX (complete) → HGX (baseboard) → MGX (modular)
- Focus on the Overview tab first — nail the key products list before the detailed concepts
Resources
Official NVIDIA documentation and reference pages for deeper study.
- NVIDIA H100 Datasheet — Official GPU specifications, memory bandwidth, Tensor Core performance figures
nvidia.com/en-us/data-center/h100/ - NVIDIA DGX H100 Product Page — Complete AI server specifications, DGX POD and SuperPOD architecture
nvidia.com/en-us/data-center/dgx-h100/ - NVIDIA NIM Microservices — NIM overview, supported models, deployment guides
nvidia.com/en-us/ai-data-science/products/nim-microservices/ - NGC Catalog — Browse GPU-optimized containers, models, and SDK installers
catalog.ngc.nvidia.com - CUDA Documentation — Programming guide, best practices, API reference for CUDA toolkit
docs.nvidia.com/cuda/ - NCA-AIIO Certification Page — Official exam objectives, study resources, and registration
NVIDIA AI Infrastructure & Operations Associate