GPU Architecture & Compute Fundamentals

🏗️ The GPU Compute Model

A GPU is a massively parallel processor designed to execute thousands of threads simultaneously. Where a CPU optimizes for low-latency serial execution (few cores, deep caches, branch prediction), a GPU optimizes for high-throughput parallel execution — trading single-thread performance for aggregate parallelism across thousands of simpler cores.

This architecture maps perfectly onto AI workloads. Training a neural network requires billions of multiply-accumulate operations on large matrices. A GPU can pipeline these across thousands of cores simultaneously, achieving computational throughput that would be impossible on a CPU.

🔑 The SIMT Execution Model NVIDIA GPUs execute threads using SIMT — Single Instruction, Multiple Threads. Groups of 32 threads called warps execute the same instruction in lockstep on different data. This is the fundamental unit of GPU parallelism.

⚙️

SM Architecture

Streaming Multiprocessors, warps, CUDA & Tensor Cores, occupancy

🧠

Memory Hierarchy

Registers → Shared Mem → L2 → HBM, bandwidth & roofline model

📈

GPU Generations

Ampere (A100) → Hopper (H100) → Blackwell (B200) key innovations

🔗

NVLink & NVSwitch

Intra-node GPU interconnect, 900 GB/s, DGX H100 topology

🔀

MIG Partitioning

Hardware-isolated GPU slices: dedicated SMs, HBM, L2, memory controllers

📊 H100 SXM5 Key Metrics — Quick Reference

Metric	Value	Context
Streaming Multiprocessors	132 SMs	Each SM runs up to 64 warps (2048 threads)
CUDA Cores (FP32)	16,896	128 CUDA cores × 132 SMs
Tensor Cores	528	4 per SM × 132 SMs (4th generation)
HBM3 Capacity	80 GB	H200 upgrades to HBM3e 141 GB
HBM3 Bandwidth	3.35 TB/s	~1.7× vs A100 HBM2e (2.0 TB/s)
TF32 TFLOPS	989	Dense; sparse 1,979 TFLOPS
FP8 TFLOPS	3,958	Via Transformer Engine (H100 only)
NVLink 4 Bandwidth	900 GB/s	Bidirectional; 18 links × 50 GB/s
MIG Instances (max)	7	Full hardware isolation per instance
TDP (SXM form factor)	700 W	PCIe variant: 350W

🆚 GPU vs CPU: Why GPUs Win at AI

Characteristic	CPU (Xeon/EPYC)	GPU (H100)
Core count	8–64 cores	16,896 CUDA + 528 Tensor Cores
Optimization target	Low latency, serial tasks	High throughput, parallel tasks
Memory bandwidth	~300–500 GB/s (DDR5)	3.35 TB/s (HBM3)
Cache hierarchy	Large L1/L2/L3 caches	Smaller L2 (50 MB), relies on SM shared mem
Peak FP16 TFLOPS	~2–5 TFLOPS	1,979 TFLOPS (BF16)
AI training use	Pre/post-processing, data loading	Forward pass, backward pass, optimizer
Programming model	OpenMP, MPI, threads	CUDA, SIMT, warps, blocks, grids

⚙️ The Streaming Multiprocessor (SM)

The Streaming Multiprocessor is the fundamental compute unit of every NVIDIA GPU. All CUDA kernels execute on SMs. A GPU's throughput scales with SM count — the H100 SXM has 132 SMs, compared to 108 in the A100 and 192 in the B200. Understanding what's inside an SM is critical for the NCP-AII exam.

H100 Streaming Multiprocessor (SM) — Internal Structure

CUDA Cores (FP32)

128

FP32 multiply-accumulate

INT32 Cores

Integer arithmetic

Tensor Cores (4th Gen)

FP8/BF16/FP16/TF32 GEMM

Warp Schedulers

Select ready warps each cycle

256 KB

65,536 × 32-bit registers

L1 / Shared Mem

256 KB

Up to 228 KB shared mem

Max Active Warps / Threads per SM

64 warps · 2,048 threads

Max 32 blocks per SM · 1,024 threads per block

🔄 Warps and SIMT Execution

A warp is a group of 32 threads that execute together. The warp is the hardware scheduling unit — the warp scheduler selects which warp to issue an instruction to each clock cycle. Because warps have independent program counters, the GPU can interleave warps to hide memory latency: while one warp waits for a memory operation to complete, the scheduler switches to another warp with ready operands.

Warp Scheduling and Latency Hiding

The H100 SM has 4 warp schedulers, each capable of issuing one instruction per clock cycle. With up to 64 active warps per SM, the GPU can tolerate long memory latencies (hundreds of clock cycles) by keeping the schedulers busy with other warps. This is why maximizing occupancy (the ratio of active warps to the theoretical maximum) is crucial for performance.

📐 Occupancy Formula Occupancy = Active Warps ÷ Max Warps per SM. H100 maximum = 64 warps per SM. Occupancy is limited by: (1) register usage per thread, (2) shared memory usage per block, (3) block size. Low occupancy = poor latency hiding = poor throughput.

Warp Divergence

Because all 32 threads in a warp execute the same instruction in lockstep (SIMT), warp divergence occurs when threads take different branches. The hardware serializes both paths, executing threads that take the "true" branch while masking threads that take the "false" branch, then vice versa. This can halve throughput in the worst case. NCP-AII candidates should understand why data-dependent branching in GPU kernels is expensive.

⚠️ Exam tip: Warp divergence does NOT occur between different warps — only within a warp. Divergence between warps is free since warps execute independently.

🎯 Core Types: CUDA vs Tensor vs RT Cores

Core Type	Operation	Precision	AI Relevance	H100 Count
CUDA Cores	Scalar FP32 / INT32 arithmetic (add, mul, FMA)	FP32, FP64, INT32	Activation functions, elementwise ops, normalization	16,896
Tensor Cores (4th Gen)	Matrix multiply-accumulate: D = A×B + C (GEMM)	FP8, BF16, FP16, TF32, FP64	Dense matmul in forward/backward pass — primary AI workload	528
RT Cores (3rd Gen)	Hardware ray-triangle intersection, BVH traversal	Fixed-function	Not used for AI training/inference — graphics only	N/A (H100)

4th-Generation Tensor Core (H100)

The H100's Tensor Cores are the engine behind its massive throughput advantage over A100. Key improvements over 3rd-gen (A100):

FP8 support — native 8-bit floating-point (E4M3 and E5M2 formats) for training and inference, enabling the Transformer Engine to cut model size and double throughput vs FP16
Asynchronous execution — Tensor Core operations can overlap with other SM operations using the wgmma (Warpgroup Matrix Multiply-Accumulate) instruction
Larger tile sizes — H100 operates on 16×16×16 or 16×8×16 tiles in a single instruction vs 8×8×4 on A100
TF32 speedup — 989 TF32 TFLOPS on H100 vs 312 on A100 (3.2×)

🔑 Tensor Core Precision Hierarchy (H100) FP4 (B200 only) > FP8 (3,958 TFLOPS) > BF16/FP16 (1,979 TFLOPS) > TF32 (989 TFLOPS) > FP64 (67 TFLOPS). Higher number precision = lower throughput but higher accuracy.

🧠 The GPU Memory Hierarchy

GPU memory is organized in a hierarchy that trades capacity for latency and bandwidth. Understanding this hierarchy is essential for both system design (NCP-AII) and performance optimization. The H100's HBM3 delivers 3.35 TB/s — the dominant reason GPUs outperform CPUs for AI.

🔴 Register File

256 KB per SM · 65,536 registers/SM · max 255 regs/thread

~1 cycle · Fastest

🟠 Shared Memory / L1 Cache

Up to 228 KB shared mem per SM · configurable split with L1

~5–32 cycles · Per SM

🟡 L2 Cache

50 MB total · shared across all 132 SMs

~100–300 cycles · Shared

🟢 HBM3 (Device Memory)

80 GB capacity · 3.35 TB/s bandwidth · 5 HBM3 stacks

~400–700 cycles · On-package

🔵 PCIe / NVLink (Host/Peer)

PCIe 5.0: 128 GB/s · NVLink 4: 900 GB/s · host DRAM: ~200 GB/s

Microseconds · Off-GPU

📏 Bandwidth Numbers — Exam-Critical

Level	Capacity	Bandwidth	Scope	Key Use
Registers	256 KB/SM	~10+ TB/s (est.)	Per-thread	Active computation values
Shared Memory	Up to 228 KB/SM	~10–20 TB/s (est.)	Per-SM, per-threadblock	Tile caching, reductions
L2 Cache	50 MB	~3.5–7 TB/s (internal)	All SMs	Repeated data reuse across SMs
HBM3	80 GB	3.35 TB/s	Full GPU	Model weights, activations
NVLink 4	Peer GPU	900 GB/s	Intra-node GPUs	AllReduce, P2P transfers
PCIe 5.0 x16	Host RAM	128 GB/s	Host-GPU	Data loading, checkpointing

📌 Key ratio: NVLink 4 (900 GB/s) is ~7× faster than PCIe 5.0 x16 (128 GB/s). This is why multi-GPU AI clusters use NVSwitch topologies rather than PCIe for GPU-to-GPU communication.

📈 The Roofline Model & Arithmetic Intensity

The roofline model determines whether a kernel is compute-bound or memory-bound by comparing its arithmetic intensity against the machine's peak performance / bandwidth ratio (the "ridge point").

📐 Arithmetic Intensity (AI) = FLOPs ÷ Bytes Accessed High AI (e.g., large GEMM) → compute-bound → limited by Tensor Core throughput. Low AI (e.g., elementwise add, layer norm) → memory-bound → limited by HBM bandwidth.

H100 Roofline Landmarks

Peak BF16 (dense, Tensor Core) 1,979 TFLOPS

Peak TF32 (dense, Tensor Core) 989 TFLOPS

Peak FP32 (CUDA Cores) 67 TFLOPS

HBM3 Memory Bandwidth 3.35 TB/s

Ridge Point (BF16): 1,979 / 3.35 ~591 FLOPs/Byte

Ridge Point (FP32): 67 / 3.35 ~20 FLOPs/Byte

Practical Arithmetic Intensity by Operation

Operation	Arithmetic Intensity	Bound
Large GEMM (4096×4096)	>500 FLOPs/B	Compute-bound (Tensor Core)
Batched GEMM (small matrices)	50–200 FLOPs/B	Often memory-bound
Elementwise (add, ReLU, scale)	~1 FLOPs/B	Memory-bound
Layer Normalization	~3–5 FLOPs/B	Memory-bound
Softmax	~5–10 FLOPs/B	Memory-bound
Attention (FlashAttention)	~10–50 FLOPs/B	Mixed (optimized via tiling)

Shared Memory Optimization

Shared memory is programmer-managed cache (L1-speed) within each SM. High-performance GPU kernels like cuBLAS, FlashAttention, and CUTLASS use shared memory to cache tile blocks from HBM, perform computation, then write results back — dramatically reducing the number of HBM accesses. The H100 allows configuring up to 228 KB of shared memory per SM:

// Configure max shared memory for H100 kernel
cudaFuncSetAttribute(myKernel,
  cudaFuncAttributeMaxDynamicSharedMemorySize, 228 * 1024);

📈 GPU Generations: Ampere → Hopper → Blackwell

Each GPU generation from NVIDIA represents a major architectural leap for AI. The NCP-AII exam tests your knowledge of what changed between generations and the quantitative improvements. Know these numbers cold.

Feature	Ampere A100 SXM	Hopper H100 SXM5	Blackwell B200 SXM
Process Node	Samsung 7nm	TSMC 4nm (GH100)	TSMC 4nm (GB200)
Transistors	54.2 billion	80 billion	208 billion
SM Count	108	132	192
CUDA Cores	6,912	16,896	~23,040
Tensor Core Gen	3rd Gen	4th Gen	5th Gen
HBM Type	HBM2e	HBM3	HBM3e
HBM Capacity	80 GB	80 GB	192 GB
HBM Bandwidth	2.0 TB/s	3.35 TB/s	8.0 TB/s
TF32 TFLOPS	312	989	~2,200
FP16/BF16 TFLOPS	312	1,979	~4,500
FP8 TFLOPS	—	3,958	~9,000
FP4 TFLOPS	—	—	~18,000
NVLink Generation	NVLink 3	NVLink 4	NVLink 5
NVLink BW/GPU	600 GB/s	900 GB/s	1.8 TB/s
TDP (SXM)	400 W	700 W	1,000 W
MIG (max instances)	7	7	7

⚡ Key Innovations Per Generation

Ampere — A100

Multi-Instance GPU (MIG) — first GPU with full hardware partitioning. Up to 7 isolated MIG instances with guaranteed QoS. Each instance has its own SMs, HBM, L2 cache, and memory controllers.
3rd-gen Tensor Cores with sparsity — 2:4 structured sparsity support doubles effective throughput (312 → 624 TFLOPS). Works by pruning 50% of weights in a structured pattern.
TF32 precision — new "TensorFloat-32" format: 10-bit mantissa (like FP16) but 8-bit exponent (like FP32). Drops into FP32 code paths automatically for ~10× speedup vs FP32 CUDA Cores.
NVLink 3 — 600 GB/s bidirectional per GPU; 12 NVLink lanes.
PCIe Gen 4 — connects to host; DGX A100 uses NVSwitch 2nd gen.

Hopper — H100 ★ Most tested on NCP-AII

Transformer Engine — dedicated hardware that automatically selects FP8 or BF16 precision on a per-layer, per-tensor basis during training. No accuracy loss through dynamic per-tensor scaling. This is the mechanism behind H100's 4× throughput advantage vs A100 for LLM training.
FP8 Tensor Cores (4th gen) — native 8-bit floating-point in E4M3 (higher precision) and E5M2 (higher range) formats. 3,958 TFLOPS dense FP8 throughput.
HBM3 — 80 GB, 3.35 TB/s (vs 2.0 TB/s on A100). Critical for large model weight storage.
NVLink 4 — 900 GB/s bidirectional; 18 NVLink lanes. DGX H100 uses NVSwitch 3rd gen (4 switches connecting 8 GPUs).
Confidential Computing — Trusted Execution Environment (TEE) for secure multi-tenant AI workloads.
PCIe Gen 5 — 128 GB/s host-device bandwidth.

Blackwell — B200

FP4 Tensor Cores (5th gen) — 4-bit floating-point support for inference, enabling up to 18,000 TFLOPS. Requires careful calibration but 2× inference throughput vs FP8.
2nd-gen Transformer Engine — extends dynamic precision management to FP4, with per-block scaling for higher accuracy at ultra-low precision.
HBM3e — 192 GB capacity (2.4× vs H100), 8.0 TB/s bandwidth (2.4× vs H100). Enables fitting 405B+ parameter models on a single node.
NVLink 5 — 1.8 TB/s bidirectional per GPU (2× vs H100).
RAS Engine — dedicated Reliability, Availability, Serviceability engine for datacenter-grade fault detection and recovery without halting workloads.

📊 Generation-over-Generation FP8 Throughput A100: No FP8 → H100: 3,958 TFLOPS FP8 → B200: ~9,000 TFLOPS FP8. And B200 adds FP4 at ~18,000 TFLOPS. Each generation roughly doubles AI training throughput, driven primarily by Tensor Core and memory advances.

🔄 Transformer Engine — Deep Dive

The Transformer Engine is one of the most exam-critical features of H100. Understanding how it works distinguishes expert candidates from those who just memorize numbers.

Traditional mixed-precision training (AMP) uses FP16 for forward/backward passes with a static loss scaler. The Transformer Engine goes further by using FP8 (8-bit) computation with per-tensor dynamic scaling — automatically selecting the right scale factor for each tensor to prevent overflow/underflow, without programmer intervention.

Approach	Precision	Scaling	TFLOPS
FP32 baseline	FP32	None	67
AMP (automatic mixed precision)	FP16/BF16	Static loss scaling	1,979
Transformer Engine (H100)	FP8 + BF16	Per-tensor dynamic scaling	3,958

The Transformer Engine exposes the transformer_engine Python library. Wrapping standard layers with TE equivalents (e.g., te.Linear, te.LayerNorm) enables FP8 compute automatically with no manual scaling code.

🔗 NVLink: Intra-Node GPU Interconnect

NVLink is NVIDIA's proprietary high-speed GPU-to-GPU interconnect, providing bandwidth that is an order of magnitude higher than PCIe. It is the critical enabler of multi-GPU AI training within a single server — allowing model parallelism and fast AllReduce communication without leaving the server chassis.

DGX H100 — 8-GPU NVLink Topology

GPU 0

GPU 1

GPU 2

GPU 3

GPU 4

GPU 5

GPU 6

GPU 7

NVSwitch 3G #0

NVSwitch 3G #1

NVSwitch 3G #2

NVSwitch 3G #3

All 8 GPUs connect to all 4 NVSwitch chips → full all-to-all mesh, 900 GB/s between any GPU pair

NVLink Gen	GPU	Lanes	BW per GPU (BiDir)	NVSwitch Gen	Aggregate Switching BW
NVLink 3	A100	12	600 GB/s	2nd Gen	7.2 TB/s
NVLink 4	H100	18	900 GB/s	3rd Gen	57.6 TB/s
NVLink 5	B200	18+	1.8 TB/s	4th Gen	—

🔑 NVLink vs PCIe Bandwidth Comparison NVLink 4: 900 GB/s | PCIe 5.0 x16: 128 GB/s | Advantage: ~7×. For AllReduce in data-parallel training across 8 GPUs, NVLink makes communication nearly free compared to the compute. This is why DGX servers are the standard training platform for large models.

NVLink Switch System — Scale Beyond a Single Node

NVIDIA's NVLink Switch System (introduced with Hopper) enables NVLink connectivity across multiple servers using NVLink Switch chips in a separate fabric. Up to 576 GPUs can be connected in a single NVLink domain, enabling scale-up configurations with GPU-speed bandwidth across nodes — blurring the line between scale-up and scale-out.

🔀 MIG: Multi-Instance GPU

MIG (Multi-Instance GPU), introduced with Ampere, enables a single physical GPU to be partitioned into up to seven independent, hardware-isolated instances. Unlike CPU virtualization which is software-based, MIG creates actual hardware boundaries — each instance gets its own slice of every physical resource.

What MIG Partitions

Resource	MIG Behavior	Why It Matters
Streaming Multiprocessors (SMs)	Dedicated SM partition (GPC slice)	No compute interference between instances
HBM Memory	Dedicated memory slice	Private address space, no cross-instance reads
L2 Cache	Dedicated L2 slice	Prevents cache side-channel attacks
Memory Controllers	Dedicated memory controllers	Guaranteed memory bandwidth per instance
NVLink	Dedicated NVLink ports	MIG instances can use NVLink for peer access

H100 MIG Profiles

7g.80gb

7 GPCs · 80 GB HBM

1 instance max

4g.40gb

4 GPCs · 40 GB HBM

1 instance max

3g.40gb

3 GPCs · 40 GB HBM

2 instances max

2g.20gb

2 GPCs · 20 GB HBM

3 instances max

1g.10gb

1 GPC · 10 GB HBM

7 instances max

MIG vs MPS vs Time-Slicing

Feature	MIG	MPS (Multi-Process Service)	Time-Slicing
Isolation level	Hardware (strongest)	Process-level (software)	None
Memory protection	Full (dedicated HBM)	Partial (shared address space)	None
Fault containment	Yes — per instance	Partial — MPS server crash kills all	No
QoS guarantee	Yes — bandwidth guaranteed	Best-effort	Best-effort
GPU utilization	Fixed slices (may waste)	High — shares idle resources	High
Use case	Multi-tenant inference, CI/CD	Co-running ML jobs, research clusters	Dev/test, low-SLA inference
GPU support	A100, H100, B200	Volta+	All

🔑 MIG Isolation Guarantees (Exam-Critical) A fault in one MIG instance (ECC error, illegal memory access) does NOT affect other MIG instances. This is the defining advantage of MIG over MPS. For multi-tenant cloud deployments where multiple customers share a GPU, MIG provides security-grade isolation equivalent to separate physical GPUs.

🧪 Practice Quiz — GPU Architecture

Question 1 of 10

How many Streaming Multiprocessors (SMs) does the H100 SXM5 have?

A) 80

B) 108

C) 132

D) 160

1 / 10

Question 2 of 10

What is the warp size used in all NVIDIA GPU architectures?

A) 16 threads

B) 32 threads

C) 64 threads

D) 128 threads

2 / 10

Question 3 of 10

What is the approximate HBM3 memory bandwidth of the H100 SXM5?

A) 2.0 TB/s

B) 2.4 TB/s

C) 3.35 TB/s

D) 4.8 TB/s

3 / 10

Question 4 of 10

Which core type handles dense matrix multiply-accumulate (GEMM) operations — the primary compute operation in AI training?

A) CUDA Cores (FP32)

B) RT Cores

C) Tensor Cores

D) INT32 Cores

4 / 10

Question 5 of 10

What is the total bidirectional NVLink 4 bandwidth per H100 GPU?

A) 600 GB/s

B) 900 GB/s

C) 1.2 TB/s

D) 1.8 TB/s

5 / 10

Question 6 of 10

What is the maximum number of MIG instances supported on an H100 GPU?

A) 4

B) 5

C) 7

D) 8

6 / 10

Question 7 of 10

The Transformer Engine in H100 primarily enables which capability during AI training?

A) Faster PCIe host-to-device data transfers

B) Automatic per-tensor FP8 ↔ BF16 precision switching

C) Hardware ray tracing for visualization

D) Lossless NVLink compression

7 / 10

Question 8 of 10

In the GPU memory hierarchy, which level offers the absolute lowest latency (~1 clock cycle) but also the smallest capacity?

A) HBM3 Device Memory

B) L2 Cache

C) L1 / Shared Memory

D) Register File

8 / 10

Question 9 of 10

Which GPU generation introduced the Transformer Engine and native FP8 training support?

A) Ampere (A100)

B) Hopper (H100)

C) Volta (V100)

D) Blackwell (B200)

9 / 10

Question 10 of 10

In MIG partitioning, each instance receives which set of dedicated hardware resources?

A) Dedicated SMs only, shared HBM

B) Dedicated SMs and HBM slice, shared L2 cache

C) Dedicated SMs, HBM slice, L2 cache, and memory controllers

D) Shared GPU resources with guaranteed minimum bandwidth

10 / 10

0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

SM Architecture

How many SMs, CUDA cores, and Tensor Cores in the H100 SXM5?

132 SMs · 128 CUDA cores/SM = 16,896 CUDA cores · 4 Tensor Cores/SM = 528 Tensor Cores

Formula: SMs × 128 = CUDA cores

SIMT Execution

What is a warp and why does it matter?

32 threads executing the same instruction in lockstep (SIMT). Divergent branches serialize — this is warp divergence.

Warp = 32. Always 32. Every gen.

Memory Bandwidth

H100 HBM3 capacity and bandwidth?

80 GB capacity · 3.35 TB/s bandwidth. Compare: A100 = 2.0 TB/s · H200 (HBM3e) = 141 GB @ 4.8 TB/s

H100 = 3.35 TB/s HBM3

NVLink 4

H100 NVLink 4 — how many lanes and what bandwidth?

18 NVLink lanes · 900 GB/s bidirectional per GPU · 4 NVSwitch chips in DGX H100 for full all-to-all mesh

18 lanes × 50 GB/s = 900 GB/s

MIG Partitioning

What 4 hardware resources does each MIG instance get exclusively?

Dedicated: (1) SM partition (GPCs), (2) HBM memory slice, (3) L2 cache slice, (4) memory controllers

SM + HBM + L2 + mem-ctrl = full isolation

Transformer Engine

What does the H100 Transformer Engine do?

Automatically switches between FP8 and BF16 on a per-tensor, per-layer basis using dynamic scaling — no accuracy loss, 2× throughput vs BF16

FP8 + dynamic scale = Transformer Engine

Roofline Model

What is arithmetic intensity and when is a kernel memory-bound?

AI = FLOPs ÷ Bytes accessed. Memory-bound when AI is below the ridge point (~591 FLOPs/B for BF16 on H100). Elementwise ops (~1 FLOPs/B) are always memory-bound.

Low AI = memory wall

GPU Generations

A100 → H100 → B200: What precision did each generation add?

A100: TF32 + FP16 sparsity. H100: FP8 (Transformer Engine). B200: FP4 (2nd-gen Transformer Engine). Each ~2× throughput over prior gen.

TF32 → FP8 → FP4

🤖 GPU Architecture Advisor

Select a topic for exam-focused recommendations:

SM Architecture Deep Dive

Memory Optimization

GPU Generation Selection

NVLink & Multi-GPU

MIG Configuration

⚙️ SM Architecture — Key Exam Points

H100 has 132 SMs. Each SM has 128 CUDA Cores, 4 Tensor Cores (4th gen), 4 warp schedulers, 256 KB register file, and up to 228 KB shared memory.
Warp = 32 threads executing the same instruction (SIMT). Warp size is fixed at 32 across all NVIDIA architectures.
Occupancy = active warps / max warps. H100 max = 64 warps/SM. Limited by register usage (--maxrregcount), shared memory allocation, and block size.
Warp divergence (branches within a warp) serializes both code paths, reducing effective throughput. Design kernels with uniform control flow where possible.
Tensor Cores (4th gen on H100) perform D = A×B + C where A, B, C, D are 16×16 matrices. Operates in FP8, BF16, FP16, or TF32. Use cublasMath_t = CUBLAS_TF32_TENSOR_OP_MATH to force Tensor Core usage.
RT Cores are absent on compute GPUs (H100 is HPC/AI, not graphics). Do not confuse with AI cores on the exam.
H100 dual-issue capability: 4 warp schedulers can each issue 2 independent instructions per clock cycle if no data dependency — enabling higher instruction-level parallelism.

🧠 Memory Hierarchy — Optimization Guide

Registers: fastest, ~1 cycle. If a kernel exceeds 255 registers/thread, excess spills to local memory (HBM latency). Profile with nvcc --ptxas-options=-v to check register count.
Shared memory: programmer-managed, L1-speed. Allocate with __shared__. Tiling strategy: load HBM block → process in shared mem → write back. Used by FlashAttention, cuBLAS, CUTLASS.
Maximize shared memory: H100 allows up to 228 KB/SM via cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, 228*1024).
L2 cache (50 MB on H100) is shared. For multi-kernel pipelines, the cache set-aside API can reserve a portion: cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size).
HBM3 bandwidth (3.35 TB/s) is the bottleneck for memory-bound kernels. Use coalesced memory access — consecutive threads should access consecutive addresses for full HBM burst efficiency.
Arithmetic intensity determines the bound. Large GEMMs (>500 FLOPs/B) are compute-bound. Elementwise ops (~1 FLOPs/B) are memory-bound. Profile with ncu --metrics l1tex__t_bytes,smsp__sass_thread_inst_executed_op_fadd_pred_on.
NVLink 4 (900 GB/s) is ~7× PCIe 5 (128 GB/s). Peer-to-peer GPU copies over NVLink use cudaMemcpyPeerAsync and are far faster than going through CPU.

📈 GPU Generation Selection — Use Case Guide

A100 (Ampere): Best for existing codebases without FP8 support, large MIG deployments (7 × 10 GB for multi-tenant inference), or where budget limits H100 deployment. 312 TF32 TFLOPS, 2.0 TB/s HBM2e.
H100 (Hopper): Current training standard. Choose H100 when: training LLMs/transformers (Transformer Engine gives 2× vs A100), running FP8 inference, needing Confidential Computing TEE, or maximizing throughput per watt at 700W TDP. 989 TF32 / 3,958 FP8 TFLOPS.
H200: Drop-in H100 upgrade with HBM3e 141 GB at 4.8 TB/s. Choose when model size exceeds 80 GB or when inference is memory-bandwidth bound (larger KV cache, longer sequences).
B200 (Blackwell): Next-gen for inference at scale. FP4 (18,000 TFLOPS) and 192 GB HBM3e (8 TB/s) make it the choice for large-scale deployment of 405B+ models. 1,000W TDP requires updated power/cooling design.
Key exam comparison: H100 vs A100 — H100 wins on every metric: 3.17× more SMs (actually A100 has 108, H100 has 132 = 1.22×), 1.68× HBM bandwidth, 3.17× TF32 TFLOPS, 12.7× FP8 TFLOPS (A100 has none natively).
Form factors matter: SXM (highest performance, requires NVLink baseboard) vs PCIe (lower BW, standard server, 350W). DGX H100 uses SXM5.

🔗 NVLink & Multi-GPU — Setup Guide

NVLink 4 provides 900 GB/s bidirectional per H100 via 18 NVLink lanes at 50 GB/s each. This equals ~7× PCIe 5.0 x16 (128 GB/s).
DGX H100: 8 GPU server with 4 NVSwitch 3rd-gen chips. Every GPU pair has 900 GB/s — full non-blocking all-to-all mesh. Aggregate NVSwitch switching BW = 57.6 TB/s.
Verify NVLink topology with: nvidia-smi topo -m. Look for NV18 (18 NVLinks) in the GPU-to-GPU matrix, vs SYS (PCIe cross-socket) which has far lower bandwidth.
Enable peer-to-peer access: cudaDeviceEnablePeerAccess(peerDevice, 0). Check support: cudaDeviceCanAccessPeer(&canAccess, dev0, dev1). Over NVLink, P2P is always supported.
NVLink Switch System: scale to 576 GPUs in a single NVLink domain using NVLink Switch chassis. Used in NVIDIA DGX SuperPOD and GB200 NVL72 configurations.
NVLink vs InfiniBand: NVLink is used for intra-node GPU-to-GPU. InfiniBand (or Spectrum-X/RoCEv2) is used for inter-node (server-to-server). An AI cluster needs both.
NCCL (NVIDIA Collective Communication Library) automatically detects NVLink topology and uses it for Ring/Tree/AllReduce algorithms. Set NCCL_P2P_LEVEL=NVL to ensure NVLink paths are used.

🔀 MIG Configuration — Partitioning Guide

Enable MIG mode: sudo nvidia-smi -i 0 -mig 1. Requires a reboot or GPU reset. Confirm with nvidia-smi -L.
Create 7 equal instances (1g.10gb): sudo nvidia-smi mig -cgi 1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb,1g.10gb. Each gets 1/7 SMs, 10 GB HBM, dedicated L2 and memory controllers.
Create Compute Instances within each GPU Instance: sudo nvidia-smi mig -cci 1c.1g.10gb. CI defines the number of Tensor Core slices visible to the application.
List all MIG instances: nvidia-smi mig -lgi and nvidia-smi mig -lci. Each CI gets a unique MIG device UUID, visible as a separate GPU to CUDA applications.
Kubernetes with MIG: use nvidia.com/mig-1g.10gb resource type in pod spec. The device plugin handles assignment. Set MIG_PARTED_DEPLOYMENT_TYPE in the NVIDIA GPU Operator.
Fault isolation test: an cudaErrorIllegalAddress in one MIG instance raises Xid 74 only for that instance. Other instances continue running. This is the key MIG vs MPS isolation difference.
MIG profile selection strategy: 7×1g.10gb for inference serving many small models; 2×3g.40gb for two medium training jobs; 7g.80gb for a single job needing full GPU bandwidth.

🧩 Memory Mnemonics

Concept	Mnemonic / Hook
Warp size = 32	"32 soldiers march in SIMT lockstep — always 32, every generation"
H100 SMs = 132	"H100 = H(8) × 100 → 132 SMs (one for each day of Q1 + January)"
H100 HBM3 = 3.35 TB/s	"Three Point Three Five — H-100 needs speed to feed 132 SMs"
NVLink 4 = 900 GB/s	"18 links × 50 GB/s = 900 = the magic number for H100 fabric"
MIG max = 7	"7 MIG slices on H100 — like 7 days in a week, full isolation guaranteed"
Memory hierarchy order	"Really Short L-two HBMs — Registers, Shared, L2, HBM, (host via) bandwidth"
Tensor Core precision order	"Four, Eight, Sixteen/BF, TF32, FP64 — go up in bits, lose throughput"
Transformer Engine = FP8 auto	"TE = auto-FP8 with dynamic scale — no code, pure speed"

GPU Architecture &Compute Fundamentals