The hardware foundation — SMs, Tensor Cores, HBM memory hierarchy, GPU generations, NVLink, and MIG partitioning. Everything you need to understand how NVIDIA GPUs power modern AI.
A GPU is a massively parallel processor designed to execute thousands of threads simultaneously. Where a CPU optimizes for low-latency serial execution (few cores, deep caches, branch prediction), a GPU optimizes for high-throughput parallel execution — trading single-thread performance for aggregate parallelism across thousands of simpler cores.
This architecture maps perfectly onto AI workloads. Training a neural network requires billions of multiply-accumulate operations on large matrices. A GPU can pipeline these across thousands of cores simultaneously, achieving computational throughput that would be impossible on a CPU.
| Metric | Value | Context |
|---|---|---|
| Streaming Multiprocessors | 132 SMs | Each SM runs up to 64 warps (2048 threads) |
| CUDA Cores (FP32) | 16,896 | 128 CUDA cores × 132 SMs |
| Tensor Cores | 528 | 4 per SM × 132 SMs (4th generation) |
| HBM3 Capacity | 80 GB | H200 upgrades to HBM3e 141 GB |
| HBM3 Bandwidth | 3.35 TB/s | ~1.7× vs A100 HBM2e (2.0 TB/s) |
| TF32 TFLOPS | 989 | Dense; sparse 1,979 TFLOPS |
| FP8 TFLOPS | 3,958 | Via Transformer Engine (H100 only) |
| NVLink 4 Bandwidth | 900 GB/s | Bidirectional; 18 links × 50 GB/s |
| MIG Instances (max) | 7 | Full hardware isolation per instance |
| TDP (SXM form factor) | 700 W | PCIe variant: 350W |
| Characteristic | CPU (Xeon/EPYC) | GPU (H100) |
|---|---|---|
| Core count | 8–64 cores | 16,896 CUDA + 528 Tensor Cores |
| Optimization target | Low latency, serial tasks | High throughput, parallel tasks |
| Memory bandwidth | ~300–500 GB/s (DDR5) | 3.35 TB/s (HBM3) |
| Cache hierarchy | Large L1/L2/L3 caches | Smaller L2 (50 MB), relies on SM shared mem |
| Peak FP16 TFLOPS | ~2–5 TFLOPS | 1,979 TFLOPS (BF16) |
| AI training use | Pre/post-processing, data loading | Forward pass, backward pass, optimizer |
| Programming model | OpenMP, MPI, threads | CUDA, SIMT, warps, blocks, grids |
The Streaming Multiprocessor is the fundamental compute unit of every NVIDIA GPU. All CUDA kernels execute on SMs. A GPU's throughput scales with SM count — the H100 SXM has 132 SMs, compared to 108 in the A100 and 192 in the B200. Understanding what's inside an SM is critical for the NCP-AII exam.
A warp is a group of 32 threads that execute together. The warp is the hardware scheduling unit — the warp scheduler selects which warp to issue an instruction to each clock cycle. Because warps have independent program counters, the GPU can interleave warps to hide memory latency: while one warp waits for a memory operation to complete, the scheduler switches to another warp with ready operands.
The H100 SM has 4 warp schedulers, each capable of issuing one instruction per clock cycle. With up to 64 active warps per SM, the GPU can tolerate long memory latencies (hundreds of clock cycles) by keeping the schedulers busy with other warps. This is why maximizing occupancy (the ratio of active warps to the theoretical maximum) is crucial for performance.
Because all 32 threads in a warp execute the same instruction in lockstep (SIMT), warp divergence occurs when threads take different branches. The hardware serializes both paths, executing threads that take the "true" branch while masking threads that take the "false" branch, then vice versa. This can halve throughput in the worst case. NCP-AII candidates should understand why data-dependent branching in GPU kernels is expensive.
| Core Type | Operation | Precision | AI Relevance | H100 Count |
|---|---|---|---|---|
| CUDA Cores | Scalar FP32 / INT32 arithmetic (add, mul, FMA) | FP32, FP64, INT32 | Activation functions, elementwise ops, normalization | 16,896 |
| Tensor Cores (4th Gen) | Matrix multiply-accumulate: D = A×B + C (GEMM) | FP8, BF16, FP16, TF32, FP64 | Dense matmul in forward/backward pass — primary AI workload | 528 |
| RT Cores (3rd Gen) | Hardware ray-triangle intersection, BVH traversal | Fixed-function | Not used for AI training/inference — graphics only | N/A (H100) |
The H100's Tensor Cores are the engine behind its massive throughput advantage over A100. Key improvements over 3rd-gen (A100):
wgmma (Warpgroup Matrix Multiply-Accumulate) instructionGPU memory is organized in a hierarchy that trades capacity for latency and bandwidth. Understanding this hierarchy is essential for both system design (NCP-AII) and performance optimization. The H100's HBM3 delivers 3.35 TB/s — the dominant reason GPUs outperform CPUs for AI.
| Level | Capacity | Bandwidth | Scope | Key Use |
|---|---|---|---|---|
| Registers | 256 KB/SM | ~10+ TB/s (est.) | Per-thread | Active computation values |
| Shared Memory | Up to 228 KB/SM | ~10–20 TB/s (est.) | Per-SM, per-threadblock | Tile caching, reductions |
| L2 Cache | 50 MB | ~3.5–7 TB/s (internal) | All SMs | Repeated data reuse across SMs |
| HBM3 | 80 GB | 3.35 TB/s | Full GPU | Model weights, activations |
| NVLink 4 | Peer GPU | 900 GB/s | Intra-node GPUs | AllReduce, P2P transfers |
| PCIe 5.0 x16 | Host RAM | 128 GB/s | Host-GPU | Data loading, checkpointing |
The roofline model determines whether a kernel is compute-bound or memory-bound by comparing its arithmetic intensity against the machine's peak performance / bandwidth ratio (the "ridge point").
| Operation | Arithmetic Intensity | Bound |
|---|---|---|
| Large GEMM (4096×4096) | >500 FLOPs/B | Compute-bound (Tensor Core) |
| Batched GEMM (small matrices) | 50–200 FLOPs/B | Often memory-bound |
| Elementwise (add, ReLU, scale) | ~1 FLOPs/B | Memory-bound |
| Layer Normalization | ~3–5 FLOPs/B | Memory-bound |
| Softmax | ~5–10 FLOPs/B | Memory-bound |
| Attention (FlashAttention) | ~10–50 FLOPs/B | Mixed (optimized via tiling) |
Shared memory is programmer-managed cache (L1-speed) within each SM. High-performance GPU kernels like cuBLAS, FlashAttention, and CUTLASS use shared memory to cache tile blocks from HBM, perform computation, then write results back — dramatically reducing the number of HBM accesses. The H100 allows configuring up to 228 KB of shared memory per SM:
// Configure max shared memory for H100 kernel
cudaFuncSetAttribute(myKernel,
cudaFuncAttributeMaxDynamicSharedMemorySize, 228 * 1024);
Each GPU generation from NVIDIA represents a major architectural leap for AI. The NCP-AII exam tests your knowledge of what changed between generations and the quantitative improvements. Know these numbers cold.
| Feature | Ampere A100 SXM | Hopper H100 SXM5 | Blackwell B200 SXM |
|---|---|---|---|
| Process Node | Samsung 7nm | TSMC 4nm (GH100) | TSMC 4nm (GB200) |
| Transistors | 54.2 billion | 80 billion | 208 billion |
| SM Count | 108 | 132 | 192 |
| CUDA Cores | 6,912 | 16,896 | ~23,040 |
| Tensor Core Gen | 3rd Gen | 4th Gen | 5th Gen |
| HBM Type | HBM2e | HBM3 | HBM3e |
| HBM Capacity | 80 GB | 80 GB | 192 GB |
| HBM Bandwidth | 2.0 TB/s | 3.35 TB/s | 8.0 TB/s |
| TF32 TFLOPS | 312 | 989 | ~2,200 |
| FP16/BF16 TFLOPS | 312 | 1,979 | ~4,500 |
| FP8 TFLOPS | — | 3,958 | ~9,000 |
| FP4 TFLOPS | — | — | ~18,000 |
| NVLink Generation | NVLink 3 | NVLink 4 | NVLink 5 |
| NVLink BW/GPU | 600 GB/s | 900 GB/s | 1.8 TB/s |
| TDP (SXM) | 400 W | 700 W | 1,000 W |
| MIG (max instances) | 7 | 7 | 7 |
The Transformer Engine is one of the most exam-critical features of H100. Understanding how it works distinguishes expert candidates from those who just memorize numbers.
Traditional mixed-precision training (AMP) uses FP16 for forward/backward passes with a static loss scaler. The Transformer Engine goes further by using FP8 (8-bit) computation with per-tensor dynamic scaling — automatically selecting the right scale factor for each tensor to prevent overflow/underflow, without programmer intervention.
| Approach | Precision | Scaling | TFLOPS |
|---|---|---|---|
| FP32 baseline | FP32 | None | 67 |
| AMP (automatic mixed precision) | FP16/BF16 | Static loss scaling | 1,979 |
| Transformer Engine (H100) | FP8 + BF16 | Per-tensor dynamic scaling | 3,958 |
The Transformer Engine exposes the transformer_engine Python library. Wrapping standard layers with TE equivalents (e.g., te.Linear, te.LayerNorm) enables FP8 compute automatically with no manual scaling code.
NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect, providing bandwidth that is an order of magnitude higher than PCIe. It is the critical enabler of multi-GPU AI training within a single server — allowing model parallelism and fast AllReduce communication without leaving the server chassis.
| NVLink Gen | GPU | Lanes | BW per GPU (BiDir) | NVSwitch Gen | Aggregate Switching BW |
|---|---|---|---|---|---|
| NVLink 3 | A100 | 12 | 600 GB/s | 2nd Gen | 7.2 TB/s |
| NVLink 4 | H100 | 18 | 900 GB/s | 3rd Gen | 57.6 TB/s |
| NVLink 5 | B200 | 18+ | 1.8 TB/s | 4th Gen | — |
NVIDIA's NVLink Switch System (introduced with Hopper) enables NVLink connectivity across multiple servers using NVLink Switch chips in a separate fabric. Up to 576 GPUs can be connected in a single NVLink domain, enabling scale-up configurations with GPU-speed bandwidth across nodes — blurring the line between scale-up and scale-out.
MIG (Multi-Instance GPU), introduced with Ampere, enables a single physical GPU to be partitioned into up to seven independent, hardware-isolated instances. Unlike CPU virtualization which is software-based, MIG creates actual hardware boundaries — each instance gets its own slice of every physical resource.
| Resource | MIG Behavior | Why It Matters |
|---|---|---|
| Streaming Multiprocessors (SMs) | Dedicated SM partition (GPC slice) | No compute interference between instances |
| HBM Memory | Dedicated memory slice | Private address space, no cross-instance reads |
| L2 Cache | Dedicated L2 slice | Prevents cache side-channel attacks |
| Memory Controllers | Dedicated memory controllers | Guaranteed memory bandwidth per instance |
| NVLink | Dedicated NVLink ports | MIG instances can use NVLink for peer access |
| Feature | MIG | MPS (Multi-Process Service) | Time-Slicing |
|---|---|---|---|
| Isolation level | Hardware (strongest) | Process-level (software) | None |
| Memory protection | Full (dedicated HBM) | Partial (shared address space) | None |
| Fault containment | Yes — per instance | Partial — MPS server crash kills all | No |
| QoS guarantee | Yes — bandwidth guaranteed | Best-effort | Best-effort |
| GPU utilization | Fixed slices (may waste) | High — shares idle resources | High |
| Use case | Multi-tenant inference, CI/CD | Co-running ML jobs, research clusters | Dev/test, low-SLA inference |
| GPU support | A100, H100, B200 | Volta+ | All |
Click any card to reveal the answer
Select a topic for exam-focused recommendations:
--maxrregcount), shared memory allocation, and block size.cublasMath_t = CUBLAS_TF32_TENSOR_OP_MATH to force Tensor Core usage.nvcc --ptxas-options=-v to check register count.__shared__. Tiling strategy: load HBM block → process in shared mem → write back. Used by FlashAttention, cuBLAS, CUTLASS.cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 228*1024).cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size).ncu --metrics l1tex__t_bytes,smsp__sass_thread_inst_executed_op_fadd_pred_on.cudaMemcpyPeerAsync and are far faster than going through CPU.nvidia-smi topo -m. Look for NV18 (18 NVLinks) in the GPU-to-GPU matrix, vs SYS (PCIe cross-socket) which has far lower bandwidth.cudaDeviceEnablePeerAccess(peerDevice, 0). Check support: cudaDeviceCanAccessPeer(&canAccess, dev0, dev1). Over NVLink, P2P is always supported.NCCL_P2P_LEVEL=NVL to ensure NVLink paths are used.sudo nvidia-smi -i 0 -mig 1. Requires a reboot or GPU reset. Confirm with nvidia-smi -L.sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb. Each gets 1/7 SMs, 10 GB HBM, dedicated L2 and memory controllers.sudo nvidia-smi mig -cci 1c.1g.10gb. CI defines the number of Tensor Core slices visible to the application.nvidia-smi mig -lgi and nvidia-smi mig -lci. Each CI gets a unique MIG device UUID, visible as a separate GPU to CUDA applications.nvidia.com/mig-1g.10gb resource type in pod spec. The device plugin handles assignment. Set MIG_PARTED_DEPLOYMENT_TYPE in the NVIDIA GPU Operator.cudaErrorIllegalAddress in one MIG instance raises Xid 74 only for that instance. Other instances continue running. This is the key MIG vs MPS isolation difference.| Concept | Mnemonic / Hook |
|---|---|
| Warp size = 32 | "32 soldiers march in SIMT lockstep — always 32, every generation" |
| H100 SMs = 132 | "H100 = H(8) × 100 → 132 SMs (one for each day of Q1 + January)" |
| H100 HBM3 = 3.35 TB/s | "Three Point Three Five — H-100 needs speed to feed 132 SMs" |
| NVLink 4 = 900 GB/s | "18 links × 50 GB/s = 900 = the magic number for H100 fabric" |
| MIG max = 7 | "7 MIG slices on H100 — like 7 days in a week, full isolation guaranteed" |
| Memory hierarchy order | "Really Short L-two HBMs — Registers, Shared, L2, HBM, (host via) bandwidth" |
| Tensor Core precision order | "Four, Eight, Sixteen/BF, TF32, FP64 — go up in bits, lose throughput" |
| Transformer Engine = FP8 auto | "TE = auto-FP8 with dynamic scale — no code, pure speed" |