🔲 NCP-AII  ·  Topic 1  ·  GPU Architecture

GPU Architecture &
Compute Fundamentals

The hardware foundation — SMs, Tensor Cores, HBM memory hierarchy, GPU generations, NVLink, and MIG partitioning. Everything you need to understand how NVIDIA GPUs power modern AI.

7
Core Topics
10
Quiz Questions
132
H100 SMs
3.35
TB/s HBM3

🏗️ The GPU Compute Model

A GPU is a massively parallel processor designed to execute thousands of threads simultaneously. Where a CPU optimizes for low-latency serial execution (few cores, deep caches, branch prediction), a GPU optimizes for high-throughput parallel execution — trading single-thread performance for aggregate parallelism across thousands of simpler cores.

This architecture maps perfectly onto AI workloads. Training a neural network requires billions of multiply-accumulate operations on large matrices. A GPU can pipeline these across thousands of cores simultaneously, achieving computational throughput that would be impossible on a CPU.

🔑 The SIMT Execution Model NVIDIA GPUs execute threads using SIMT — Single Instruction, Multiple Threads. Groups of 32 threads called warps execute the same instruction in lockstep on different data. This is the fundamental unit of GPU parallelism.
⚙️
SM Architecture
Streaming Multiprocessors, warps, CUDA & Tensor Cores, occupancy
🧠
Memory Hierarchy
Registers → Shared Mem → L2 → HBM, bandwidth & roofline model
📈
GPU Generations
Ampere (A100) → Hopper (H100) → Blackwell (B200) key innovations
🔗
NVLink & NVSwitch
Intra-node GPU interconnect, 900 GB/s, DGX H100 topology
🔀
MIG Partitioning
Hardware-isolated GPU slices: dedicated SMs, HBM, L2, memory controllers

📊 H100 SXM5 Key Metrics — Quick Reference

MetricValueContext
Streaming Multiprocessors132 SMsEach SM runs up to 64 warps (2048 threads)
CUDA Cores (FP32)16,896128 CUDA cores × 132 SMs
Tensor Cores5284 per SM × 132 SMs (4th generation)
HBM3 Capacity80 GBH200 upgrades to HBM3e 141 GB
HBM3 Bandwidth3.35 TB/s~1.7× vs A100 HBM2e (2.0 TB/s)
TF32 TFLOPS989Dense; sparse 1,979 TFLOPS
FP8 TFLOPS3,958Via Transformer Engine (H100 only)
NVLink 4 Bandwidth900 GB/sBidirectional; 18 links × 50 GB/s
MIG Instances (max)7Full hardware isolation per instance
TDP (SXM form factor)700 WPCIe variant: 350W

🆚 GPU vs CPU: Why GPUs Win at AI

CharacteristicCPU (Xeon/EPYC)GPU (H100)
Core count8–64 cores16,896 CUDA + 528 Tensor Cores
Optimization targetLow latency, serial tasksHigh throughput, parallel tasks
Memory bandwidth~300–500 GB/s (DDR5)3.35 TB/s (HBM3)
Cache hierarchyLarge L1/L2/L3 cachesSmaller L2 (50 MB), relies on SM shared mem
Peak FP16 TFLOPS~2–5 TFLOPS1,979 TFLOPS (BF16)
AI training usePre/post-processing, data loadingForward pass, backward pass, optimizer
Programming modelOpenMP, MPI, threadsCUDA, SIMT, warps, blocks, grids

⚙️ The Streaming Multiprocessor (SM)

The Streaming Multiprocessor is the fundamental compute unit of every NVIDIA GPU. All CUDA kernels execute on SMs. A GPU's throughput scales with SM count — the H100 SXM has 132 SMs, compared to 108 in the A100 and 192 in the B200. Understanding what's inside an SM is critical for the NCP-AII exam.

H100 Streaming Multiprocessor (SM) — Internal Structure
CUDA Cores (FP32)
128
FP32 multiply-accumulate
INT32 Cores
64
Integer arithmetic
Tensor Cores (4th Gen)
4
FP8/BF16/FP16/TF32 GEMM
Warp Schedulers
4
Select ready warps each cycle
Register File
256 KB
65,536 × 32-bit registers
L1 / Shared Mem
256 KB
Up to 228 KB shared mem
Max Active Warps / Threads per SM
64 warps  ·  2,048 threads
Max 32 blocks per SM · 1,024 threads per block

🔄 Warps and SIMT Execution

A warp is a group of 32 threads that execute together. The warp is the hardware scheduling unit — the warp scheduler selects which warp to issue an instruction to each clock cycle. Because warps have independent program counters, the GPU can interleave warps to hide memory latency: while one warp waits for a memory operation to complete, the scheduler switches to another warp with ready operands.

Warp Scheduling and Latency Hiding

The H100 SM has 4 warp schedulers, each capable of issuing one instruction per clock cycle. With up to 64 active warps per SM, the GPU can tolerate long memory latencies (hundreds of clock cycles) by keeping the schedulers busy with other warps. This is why maximizing occupancy (the ratio of active warps to the theoretical maximum) is crucial for performance.

📐 Occupancy Formula Occupancy = Active Warps ÷ Max Warps per SM. H100 maximum = 64 warps per SM. Occupancy is limited by: (1) register usage per thread, (2) shared memory usage per block, (3) block size. Low occupancy = poor latency hiding = poor throughput.

Warp Divergence

Because all 32 threads in a warp execute the same instruction in lockstep (SIMT), warp divergence occurs when threads take different branches. The hardware serializes both paths, executing threads that take the "true" branch while masking threads that take the "false" branch, then vice versa. This can halve throughput in the worst case. NCP-AII candidates should understand why data-dependent branching in GPU kernels is expensive.

⚠️ Exam tip: Warp divergence does NOT occur between different warps — only within a warp. Divergence between warps is free since warps execute independently.

🎯 Core Types: CUDA vs Tensor vs RT Cores

Core TypeOperationPrecisionAI RelevanceH100 Count
CUDA Cores Scalar FP32 / INT32 arithmetic (add, mul, FMA) FP32, FP64, INT32 Activation functions, elementwise ops, normalization 16,896
Tensor Cores (4th Gen) Matrix multiply-accumulate: D = A×B + C (GEMM) FP8, BF16, FP16, TF32, FP64 Dense matmul in forward/backward pass — primary AI workload 528
RT Cores (3rd Gen) Hardware ray-triangle intersection, BVH traversal Fixed-function Not used for AI training/inference — graphics only N/A (H100)

4th-Generation Tensor Core (H100)

The H100's Tensor Cores are the engine behind its massive throughput advantage over A100. Key improvements over 3rd-gen (A100):

  • FP8 support — native 8-bit floating-point (E4M3 and E5M2 formats) for training and inference, enabling the Transformer Engine to cut model size and double throughput vs FP16
  • Asynchronous execution — Tensor Core operations can overlap with other SM operations using the wgmma (Warpgroup Matrix Multiply-Accumulate) instruction
  • Larger tile sizes — H100 operates on 16×16×16 or 16×8×16 tiles in a single instruction vs 8×8×4 on A100
  • TF32 speedup — 989 TF32 TFLOPS on H100 vs 312 on A100 (3.2×)
🔑 Tensor Core Precision Hierarchy (H100) FP4 (B200 only) > FP8 (3,958 TFLOPS) > BF16/FP16 (1,979 TFLOPS) > TF32 (989 TFLOPS) > FP64 (67 TFLOPS). Higher number precision = lower throughput but higher accuracy.

🧠 The GPU Memory Hierarchy

GPU memory is organized in a hierarchy that trades capacity for latency and bandwidth. Understanding this hierarchy is essential for both system design (NCP-AII) and performance optimization. The H100's HBM3 delivers 3.35 TB/s — the dominant reason GPUs outperform CPUs for AI.

🔴 Register File
256 KB per SM · 65,536 registers/SM · max 255 regs/thread
~1 cycle · Fastest
🟠 Shared Memory / L1 Cache
Up to 228 KB shared mem per SM · configurable split with L1
~5–32 cycles · Per SM
🟡 L2 Cache
50 MB total · shared across all 132 SMs
~100–300 cycles · Shared
🟢 HBM3 (Device Memory)
80 GB capacity · 3.35 TB/s bandwidth · 5 HBM3 stacks
~400–700 cycles · On-package
🔵 PCIe / NVLink (Host/Peer)
PCIe 5.0: 128 GB/s · NVLink 4: 900 GB/s · host DRAM: ~200 GB/s
Microseconds · Off-GPU

📏 Bandwidth Numbers — Exam-Critical

LevelCapacityBandwidthScopeKey Use
Registers256 KB/SM~10+ TB/s (est.)Per-threadActive computation values
Shared MemoryUp to 228 KB/SM~10–20 TB/s (est.)Per-SM, per-threadblockTile caching, reductions
L2 Cache50 MB~3.5–7 TB/s (internal)All SMsRepeated data reuse across SMs
HBM380 GB3.35 TB/sFull GPUModel weights, activations
NVLink 4Peer GPU900 GB/sIntra-node GPUsAllReduce, P2P transfers
PCIe 5.0 x16Host RAM128 GB/sHost-GPUData loading, checkpointing
📌 Key ratio: NVLink 4 (900 GB/s) is ~7× faster than PCIe 5.0 x16 (128 GB/s). This is why multi-GPU AI clusters use NVSwitch topologies rather than PCIe for GPU-to-GPU communication.

📈 The Roofline Model & Arithmetic Intensity

The roofline model determines whether a kernel is compute-bound or memory-bound by comparing its arithmetic intensity against the machine's peak performance / bandwidth ratio (the "ridge point").

📐 Arithmetic Intensity (AI) = FLOPs ÷ Bytes Accessed High AI (e.g., large GEMM) → compute-bound → limited by Tensor Core throughput. Low AI (e.g., elementwise add, layer norm) → memory-bound → limited by HBM bandwidth.

H100 Roofline Landmarks

Peak BF16 (dense, Tensor Core) 1,979 TFLOPS
Peak TF32 (dense, Tensor Core) 989 TFLOPS
Peak FP32 (CUDA Cores) 67 TFLOPS
HBM3 Memory Bandwidth 3.35 TB/s
Ridge Point (BF16): 1,979 / 3.35 ~591 FLOPs/Byte
Ridge Point (FP32): 67 / 3.35 ~20 FLOPs/Byte

Practical Arithmetic Intensity by Operation

OperationArithmetic IntensityBound
Large GEMM (4096×4096)>500 FLOPs/BCompute-bound (Tensor Core)
Batched GEMM (small matrices)50–200 FLOPs/BOften memory-bound
Elementwise (add, ReLU, scale)~1 FLOPs/BMemory-bound
Layer Normalization~3–5 FLOPs/BMemory-bound
Softmax~5–10 FLOPs/BMemory-bound
Attention (FlashAttention)~10–50 FLOPs/BMixed (optimized via tiling)

Shared Memory Optimization

Shared memory is programmer-managed cache (L1-speed) within each SM. High-performance GPU kernels like cuBLAS, FlashAttention, and CUTLASS use shared memory to cache tile blocks from HBM, perform computation, then write results back — dramatically reducing the number of HBM accesses. The H100 allows configuring up to 228 KB of shared memory per SM:

// Configure max shared memory for H100 kernel
cudaFuncSetAttribute(myKernel,
  cudaFuncAttributeMaxDynamicSharedMemorySize, 228 * 1024);

📈 GPU Generations: Ampere → Hopper → Blackwell

Each GPU generation from NVIDIA represents a major architectural leap for AI. The NCP-AII exam tests your knowledge of what changed between generations and the quantitative improvements. Know these numbers cold.

Feature Ampere A100 SXM Hopper H100 SXM5 Blackwell B200 SXM
Process NodeSamsung 7nmTSMC 4nm (GH100)TSMC 4nm (GB200)
Transistors54.2 billion80 billion208 billion
SM Count108132192
CUDA Cores6,91216,896~23,040
Tensor Core Gen3rd Gen4th Gen5th Gen
HBM TypeHBM2eHBM3HBM3e
HBM Capacity80 GB80 GB192 GB
HBM Bandwidth2.0 TB/s3.35 TB/s8.0 TB/s
TF32 TFLOPS312989~2,200
FP16/BF16 TFLOPS3121,979~4,500
FP8 TFLOPS3,958~9,000
FP4 TFLOPS~18,000
NVLink GenerationNVLink 3NVLink 4NVLink 5
NVLink BW/GPU600 GB/s900 GB/s1.8 TB/s
TDP (SXM)400 W700 W1,000 W
MIG (max instances)777

Key Innovations Per Generation

Ampere — A100

  • Multi-Instance GPU (MIG) — first GPU with full hardware partitioning. Up to 7 isolated MIG instances with guaranteed QoS. Each instance has its own SMs, HBM, L2 cache, and memory controllers.
  • 3rd-gen Tensor Cores with sparsity — 2:4 structured sparsity support doubles effective throughput (312 → 624 TFLOPS). Works by pruning 50% of weights in a structured pattern.
  • TF32 precision — new "TensorFloat-32" format: 10-bit mantissa (like FP16) but 8-bit exponent (like FP32). Drops into FP32 code paths automatically for ~10× speedup vs FP32 CUDA Cores.
  • NVLink 3 — 600 GB/s bidirectional per GPU; 12 NVLink lanes.
  • PCIe Gen 4 — connects to host; DGX A100 uses NVSwitch 2nd gen.

Hopper — H100 ★ Most tested on NCP-AII

  • Transformer Engine — dedicated hardware that automatically selects FP8 or BF16 precision on a per-layer, per-tensor basis during training. No accuracy loss through dynamic per-tensor scaling. This is the mechanism behind H100's 4× throughput advantage vs A100 for LLM training.
  • FP8 Tensor Cores (4th gen) — native 8-bit floating-point in E4M3 (higher precision) and E5M2 (higher range) formats. 3,958 TFLOPS dense FP8 throughput.
  • HBM3 — 80 GB, 3.35 TB/s (vs 2.0 TB/s on A100). Critical for large model weight storage.
  • NVLink 4 — 900 GB/s bidirectional; 18 NVLink lanes. DGX H100 uses NVSwitch 3rd gen (4 switches connecting 8 GPUs).
  • Confidential Computing — Trusted Execution Environment (TEE) for secure multi-tenant AI workloads.
  • PCIe Gen 5 — 128 GB/s host-device bandwidth.

Blackwell — B200

  • FP4 Tensor Cores (5th gen) — 4-bit floating-point support for inference, enabling up to 18,000 TFLOPS. Requires careful calibration but 2× inference throughput vs FP8.
  • 2nd-gen Transformer Engine — extends dynamic precision management to FP4, with per-block scaling for higher accuracy at ultra-low precision.
  • HBM3e — 192 GB capacity (2.4× vs H100), 8.0 TB/s bandwidth (2.4× vs H100). Enables fitting 405B+ parameter models on a single node.
  • NVLink 5 — 1.8 TB/s bidirectional per GPU (2× vs H100).
  • RAS Engine — dedicated Reliability, Availability, Serviceability engine for datacenter-grade fault detection and recovery without halting workloads.
📊 Generation-over-Generation FP8 Throughput A100: No FP8 → H100: 3,958 TFLOPS FP8 → B200: ~9,000 TFLOPS FP8. And B200 adds FP4 at ~18,000 TFLOPS. Each generation roughly doubles AI training throughput, driven primarily by Tensor Core and memory advances.

🔄 Transformer Engine — Deep Dive

The Transformer Engine is one of the most exam-critical features of H100. Understanding how it works distinguishes expert candidates from those who just memorize numbers.

Traditional mixed-precision training (AMP) uses FP16 for forward/backward passes with a static loss scaler. The Transformer Engine goes further by using FP8 (8-bit) computation with per-tensor dynamic scaling — automatically selecting the right scale factor for each tensor to prevent overflow/underflow, without programmer intervention.

ApproachPrecisionScalingTFLOPS
FP32 baselineFP32None67
AMP (automatic mixed precision)FP16/BF16Static loss scaling1,979
Transformer Engine (H100)FP8 + BF16Per-tensor dynamic scaling3,958

The Transformer Engine exposes the transformer_engine Python library. Wrapping standard layers with TE equivalents (e.g., te.Linear, te.LayerNorm) enables FP8 compute automatically with no manual scaling code.

🔗 NVLink: Intra-Node GPU Interconnect

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect, providing bandwidth that is an order of magnitude higher than PCIe. It is the critical enabler of multi-GPU AI training within a single server — allowing model parallelism and fast AllReduce communication without leaving the server chassis.

NVLink GenGPULanesBW per GPU (BiDir)NVSwitch GenAggregate Switching BW
NVLink 3A10012600 GB/s2nd Gen7.2 TB/s
NVLink 4H10018900 GB/s3rd Gen57.6 TB/s
NVLink 5B20018+1.8 TB/s4th Gen
🔑 NVLink vs PCIe Bandwidth Comparison NVLink 4: 900 GB/s  |  PCIe 5.0 x16: 128 GB/s  |  Advantage: ~7×. For AllReduce in data-parallel training across 8 GPUs, NVLink makes communication nearly free compared to the compute. This is why DGX servers are the standard training platform for large models.

NVLink Switch System — Scale Beyond a Single Node

NVIDIA's NVLink Switch System (introduced with Hopper) enables NVLink connectivity across multiple servers using NVLink Switch chips in a separate fabric. Up to 576 GPUs can be connected in a single NVLink domain, enabling scale-up configurations with GPU-speed bandwidth across nodes — blurring the line between scale-up and scale-out.

🔀 MIG: Multi-Instance GPU

MIG (Multi-Instance GPU), introduced with Ampere, enables a single physical GPU to be partitioned into up to seven independent, hardware-isolated instances. Unlike CPU virtualization which is software-based, MIG creates actual hardware boundaries — each instance gets its own slice of every physical resource.

What MIG Partitions

ResourceMIG BehaviorWhy It Matters
Streaming Multiprocessors (SMs)Dedicated SM partition (GPC slice)No compute interference between instances
HBM MemoryDedicated memory slicePrivate address space, no cross-instance reads
L2 CacheDedicated L2 slicePrevents cache side-channel attacks
Memory ControllersDedicated memory controllersGuaranteed memory bandwidth per instance
NVLinkDedicated NVLink portsMIG instances can use NVLink for peer access

H100 MIG Profiles

7g.80gb
7 GPCs · 80 GB HBM
1 instance max
4g.40gb
4 GPCs · 40 GB HBM
1 instance max
3g.40gb
3 GPCs · 40 GB HBM
2 instances max
2g.20gb
2 GPCs · 20 GB HBM
3 instances max
1g.10gb
1 GPC · 10 GB HBM
7 instances max

MIG vs MPS vs Time-Slicing

FeatureMIGMPS (Multi-Process Service)Time-Slicing
Isolation levelHardware (strongest)Process-level (software)None
Memory protectionFull (dedicated HBM)Partial (shared address space)None
Fault containmentYes — per instancePartial — MPS server crash kills allNo
QoS guaranteeYes — bandwidth guaranteedBest-effortBest-effort
GPU utilizationFixed slices (may waste)High — shares idle resourcesHigh
Use caseMulti-tenant inference, CI/CDCo-running ML jobs, research clustersDev/test, low-SLA inference
GPU supportA100, H100, B200Volta+All
🔑 MIG Isolation Guarantees (Exam-Critical) A fault in one MIG instance (ECC error, illegal memory access) does NOT affect other MIG instances. This is the defining advantage of MIG over MPS. For multi-tenant cloud deployments where multiple customers share a GPU, MIG provides security-grade isolation equivalent to separate physical GPUs.

🧪 Practice Quiz — GPU Architecture

Question 1 of 10
How many Streaming Multiprocessors (SMs) does the H100 SXM5 have?
A) 80
B) 108
C) 132
D) 160
1 / 10
Question 2 of 10
What is the warp size used in all NVIDIA GPU architectures?
A) 16 threads
B) 32 threads
C) 64 threads
D) 128 threads
2 / 10
Question 3 of 10
What is the approximate HBM3 memory bandwidth of the H100 SXM5?
A) 2.0 TB/s
B) 2.4 TB/s
C) 3.35 TB/s
D) 4.8 TB/s
3 / 10
Question 4 of 10
Which core type handles dense matrix multiply-accumulate (GEMM) operations — the primary compute operation in AI training?
A) CUDA Cores (FP32)
B) RT Cores
C) Tensor Cores
D) INT32 Cores
4 / 10
Question 5 of 10
What is the total bidirectional NVLink 4 bandwidth per H100 GPU?
A) 600 GB/s
B) 900 GB/s
C) 1.2 TB/s
D) 1.8 TB/s
5 / 10
Question 6 of 10
What is the maximum number of MIG instances supported on an H100 GPU?
A) 4
B) 5
C) 7
D) 8
6 / 10
Question 7 of 10
The Transformer Engine in H100 primarily enables which capability during AI training?
A) Faster PCIe host-to-device data transfers
B) Automatic per-tensor FP8 ↔ BF16 precision switching
C) Hardware ray tracing for visualization
D) Lossless NVLink compression
7 / 10
Question 8 of 10
In the GPU memory hierarchy, which level offers the absolute lowest latency (~1 clock cycle) but also the smallest capacity?
A) HBM3 Device Memory
B) L2 Cache
C) L1 / Shared Memory
D) Register File
8 / 10
Question 9 of 10
Which GPU generation introduced the Transformer Engine and native FP8 training support?
A) Ampere (A100)
B) Hopper (H100)
C) Volta (V100)
D) Blackwell (B200)
9 / 10
Question 10 of 10
In MIG partitioning, each instance receives which set of dedicated hardware resources?
A) Dedicated SMs only, shared HBM
B) Dedicated SMs and HBM slice, shared L2 cache
C) Dedicated SMs, HBM slice, L2 cache, and memory controllers
D) Shared GPU resources with guaranteed minimum bandwidth
10 / 10
0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

SM Architecture
How many SMs, CUDA cores, and Tensor Cores in the H100 SXM5?
132 SMs · 128 CUDA cores/SM = 16,896 CUDA cores · 4 Tensor Cores/SM = 528 Tensor Cores
Formula: SMs × 128 = CUDA cores
SIMT Execution
What is a warp and why does it matter?
32 threads executing the same instruction in lockstep (SIMT). Divergent branches serialize — this is warp divergence.
Warp = 32. Always 32. Every gen.
Memory Bandwidth
H100 HBM3 capacity and bandwidth?
80 GB capacity · 3.35 TB/s bandwidth. Compare: A100 = 2.0 TB/s · H200 (HBM3e) = 141 GB @ 4.8 TB/s
H100 = 3.35 TB/s HBM3
NVLink 4
H100 NVLink 4 — how many lanes and what bandwidth?
18 NVLink lanes · 900 GB/s bidirectional per GPU · 4 NVSwitch chips in DGX H100 for full all-to-all mesh
18 lanes × 50 GB/s = 900 GB/s
MIG Partitioning
What 4 hardware resources does each MIG instance get exclusively?
Dedicated: (1) SM partition (GPCs), (2) HBM memory slice, (3) L2 cache slice, (4) memory controllers
SM + HBM + L2 + mem-ctrl = full isolation
Transformer Engine
What does the H100 Transformer Engine do?
Automatically switches between FP8 and BF16 on a per-tensor, per-layer basis using dynamic scaling — no accuracy loss, 2× throughput vs BF16
FP8 + dynamic scale = Transformer Engine
Roofline Model
What is arithmetic intensity and when is a kernel memory-bound?
AI = FLOPs ÷ Bytes accessed. Memory-bound when AI is below the ridge point (~591 FLOPs/B for BF16 on H100). Elementwise ops (~1 FLOPs/B) are always memory-bound.
Low AI = memory wall
GPU Generations
A100 → H100 → B200: What precision did each generation add?
A100: TF32 + FP16 sparsity. H100: FP8 (Transformer Engine). B200: FP4 (2nd-gen Transformer Engine). Each ~2× throughput over prior gen.
TF32 → FP8 → FP4

🤖 GPU Architecture Advisor

Select a topic for exam-focused recommendations:

SM Architecture Deep Dive
Memory Optimization
GPU Generation Selection
NVLink & Multi-GPU
MIG Configuration

⚙️ SM Architecture — Key Exam Points

  • H100 has 132 SMs. Each SM has 128 CUDA Cores, 4 Tensor Cores (4th gen), 4 warp schedulers, 256 KB register file, and up to 228 KB shared memory.
  • Warp = 32 threads executing the same instruction (SIMT). Warp size is fixed at 32 across all NVIDIA architectures.
  • Occupancy = active warps / max warps. H100 max = 64 warps/SM. Limited by register usage (--maxrregcount), shared memory allocation, and block size.
  • Warp divergence (branches within a warp) serializes both code paths, reducing effective throughput. Design kernels with uniform control flow where possible.
  • Tensor Cores (4th gen on H100) perform D = A×B + C where A, B, C, D are 16×16 matrices. Operates in FP8, BF16, FP16, or TF32. Use cublasMath_t = CUBLAS_TF32_TENSOR_OP_MATH to force Tensor Core usage.
  • RT Cores are absent on compute GPUs (H100 is HPC/AI, not graphics). Do not confuse with AI cores on the exam.
  • H100 dual-issue capability: 4 warp schedulers can each issue 2 independent instructions per clock cycle if no data dependency — enabling higher instruction-level parallelism.

🧠 Memory Hierarchy — Optimization Guide

  • Registers: fastest, ~1 cycle. If a kernel exceeds 255 registers/thread, excess spills to local memory (HBM latency). Profile with nvcc --ptxas-options=-v to check register count.
  • Shared memory: programmer-managed, L1-speed. Allocate with __shared__. Tiling strategy: load HBM block → process in shared mem → write back. Used by FlashAttention, cuBLAS, CUTLASS.
  • Maximize shared memory: H100 allows up to 228 KB/SM via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 228*1024).
  • L2 cache (50 MB on H100) is shared. For multi-kernel pipelines, the cache set-aside API can reserve a portion: cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size).
  • HBM3 bandwidth (3.35 TB/s) is the bottleneck for memory-bound kernels. Use coalesced memory access — consecutive threads should access consecutive addresses for full HBM burst efficiency.
  • Arithmetic intensity determines the bound. Large GEMMs (>500 FLOPs/B) are compute-bound. Elementwise ops (~1 FLOPs/B) are memory-bound. Profile with ncu --metrics l1tex__t_bytes,smsp__sass_thread_inst_executed_op_fadd_pred_on.
  • NVLink 4 (900 GB/s) is ~7× PCIe 5 (128 GB/s). Peer-to-peer GPU copies over NVLink use cudaMemcpyPeerAsync and are far faster than going through CPU.

📈 GPU Generation Selection — Use Case Guide

  • A100 (Ampere): Best for existing codebases without FP8 support, large MIG deployments (7 × 10 GB for multi-tenant inference), or where budget limits H100 deployment. 312 TF32 TFLOPS, 2.0 TB/s HBM2e.
  • H100 (Hopper): Current training standard. Choose H100 when: training LLMs/transformers (Transformer Engine gives 2× vs A100), running FP8 inference, needing Confidential Computing TEE, or maximizing throughput per watt at 700W TDP. 989 TF32 / 3,958 FP8 TFLOPS.
  • H200: Drop-in H100 upgrade with HBM3e 141 GB at 4.8 TB/s. Choose when model size exceeds 80 GB or when inference is memory-bandwidth bound (larger KV cache, longer sequences).
  • B200 (Blackwell): Next-gen for inference at scale. FP4 (18,000 TFLOPS) and 192 GB HBM3e (8 TB/s) make it the choice for large-scale deployment of 405B+ models. 1,000W TDP requires updated power/cooling design.
  • Key exam comparison: H100 vs A100 — H100 wins on every metric: 3.17× more SMs (actually A100 has 108, H100 has 132 = 1.22×), 1.68× HBM bandwidth, 3.17× TF32 TFLOPS, 12.7× FP8 TFLOPS (A100 has none natively).
  • Form factors matter: SXM (highest performance, requires NVLink baseboard) vs PCIe (lower BW, standard server, 350W). DGX H100 uses SXM5.

🔀 MIG Configuration — Partitioning Guide

  • Enable MIG mode: sudo nvidia-smi -i 0 -mig 1. Requires a reboot or GPU reset. Confirm with nvidia-smi -L.
  • Create 7 equal instances (1g.10gb): sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb. Each gets 1/7 SMs, 10 GB HBM, dedicated L2 and memory controllers.
  • Create Compute Instances within each GPU Instance: sudo nvidia-smi mig -cci 1c.1g.10gb. CI defines the number of Tensor Core slices visible to the application.
  • List all MIG instances: nvidia-smi mig -lgi and nvidia-smi mig -lci. Each CI gets a unique MIG device UUID, visible as a separate GPU to CUDA applications.
  • Kubernetes with MIG: use nvidia.com/mig-1g.10gb resource type in pod spec. The device plugin handles assignment. Set MIG_PARTED_DEPLOYMENT_TYPE in the NVIDIA GPU Operator.
  • Fault isolation test: an cudaErrorIllegalAddress in one MIG instance raises Xid 74 only for that instance. Other instances continue running. This is the key MIG vs MPS isolation difference.
  • MIG profile selection strategy: 7×1g.10gb for inference serving many small models; 2×3g.40gb for two medium training jobs; 7g.80gb for a single job needing full GPU bandwidth.

🧩 Memory Mnemonics

ConceptMnemonic / Hook
Warp size = 32"32 soldiers march in SIMT lockstep — always 32, every generation"
H100 SMs = 132"H100 = H(8) × 100 → 132 SMs (one for each day of Q1 + January)"
H100 HBM3 = 3.35 TB/s"Three Point Three Five — H-100 needs speed to feed 132 SMs"
NVLink 4 = 900 GB/s"18 links × 50 GB/s = 900 = the magic number for H100 fabric"
MIG max = 7"7 MIG slices on H100 — like 7 days in a week, full isolation guaranteed"
Memory hierarchy order"Really Short L-two HBMs — Registers, Shared, L2, HBM, (host via) bandwidth"
Tensor Core precision order"Four, Eight, Sixteen/BF, TF32, FP64 — go up in bits, lose throughput"
Transformer Engine = FP8 auto"TE = auto-FP8 with dynamic scale — no code, pure speed"
NCP-AII · Topic 1 Complete

Ready to ace the NCP-AII?

Build your GPU architecture mastery with FlashGenius — study guides, quiz banks, and adaptive flashcards for every NVIDIA certification.

Start Free → NVIDIA Certifications ↗