What You'll Master
nvidia-smi — GPU Monitoring CLI
The primary tool for GPU status, utilization, temperature, power draw, running processes, and driver/CUDA version. nvidia-smi -l 1 for continuous monitoring; --query-gpu for custom metric export.
DCGM — Enterprise GPU Health
Data Center GPU Manager provides enterprise-grade monitoring, diagnostics, and health checks across entire GPU fleets. Integrates with Prometheus/Grafana via DCGM-Exporter for Kubernetes-native observability.
Critical GPU Metrics
GPU Utilization % (SM activity), Memory Utilization %, Temperature (°C), Power Draw vs. TDP, SM Clock speed, and NVLink/PCIe bandwidth — understanding each and normal vs. alarm thresholds.
Cluster Orchestration Frameworks
Three key platforms: Slurm (HPC-native, sbatch/srun), Kubernetes + GPU Operator (containerized workloads), and Run:ai (advanced GPU quota management and topology-aware scheduling).
GPU Virtualization: MIG
Multi-Instance GPU provides hardware-level isolation — A100 supports up to 7 MIG instances with dedicated VRAM and compute. Ideal for inference serving and multi-tenant environments.
GPU Virtualization: vGPU
NVIDIA Virtual GPU uses software time-slicing to share one physical GPU across multiple VMs. Requires hypervisor support (VMware, Citrix). Profiles: vCompute, vWS, vApps for different use cases.
Domain Weight & Exam Focus
| Sub-Topic Area | Exam Emphasis | Key Concepts to Know |
|---|---|---|
| GPU Monitoring (nvidia-smi & DCGM) | High | Commands, metrics, thresholds, DCGM-Exporter |
| Cluster Orchestration (Slurm, K8s, Run:ai) | High | Job submission, GPU allocation, scheduling policies |
| GPU Virtualization (MIG, vGPU) | High | Hardware vs. software isolation, profiles, use cases |
| Data Center Management | Medium | PUE, cooling, power density, rack design |
| Job Scheduling Concepts | Medium | Gang scheduling, preemption, backfill, queues |
nvidia-smi: GPU Monitoring CLI
Core nvidia-smi Commands
- nvidia-smi — snapshot: driver, CUDA, GPU name, temp, power, utilization
- nvidia-smi -l 1 — live refresh every 1 second
- nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv — custom CSV export
- nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv — running GPU processes
- nvidia-smi topo -m — GPU-to-GPU and GPU-to-CPU topology (NVLink, PCIe)
- nvidia-smi -pm 1 — enable persistent mode (reduces driver load latency)
Key nvidia-smi Output Fields
- GPU Util %: percentage of time SMs are busy (compute activity)
- Mem-Usage / Total: VRAM used vs. total capacity in MiB
- Temp: GPU die temperature in °C
- Power/Limit: current draw vs. TDP (Thermal Design Power)
- Perf: performance state — P0 (max) to P12 (min)
- SM Clock / Mem Clock: operating frequencies in MHz
GPU Temperature Thresholds
- 30-60°C: idle or light workload — normal
- 60-75°C: moderate workload — normal
- 75-83°C: heavy workload — acceptable but monitor
- 80°C+: thermal throttling begins — clocks reduce to protect hardware
- 85°C+: critical — investigate cooling, airflow, ambient temp
- Shutdown threshold: typically 90-95°C to prevent hardware damage
DCGM: Data Center GPU Manager
DCGM Overview
DCGM is NVIDIA's enterprise tool for managing GPU health and performance at scale across entire data center fleets. It provides: continuous health monitoring, active diagnostics, field watches, and REST API access. Available as open-source on GitHub.
dcgmi CLI Commands
- dcgmi discovery -l — list all discovered GPUs
- dcgmi health -g 0 -c — check health of GPU group 0
- dcgmi diag -r 1 — run quick diagnostics (r1=quick, r3=extended)
- dcgmi stats -e -g 0 — enable stats collection for group
- dcgmi dmon -e 203,204 — monitor specific field IDs continuously
DCGM-Exporter for Kubernetes
DCGM-Exporter runs as a DaemonSet on every GPU node, exposing GPU metrics in Prometheus format at /metrics. Enables GPU observability in Grafana dashboards. Metrics include GPU utilization, memory, temperature, power, and ECC errors — all labeled per-GPU and per-pod.
DCGM Health Checks
- PCIe health: bandwidth and error rate on the PCIe bus
- NVLink health: inter-GPU link status and bandwidth
- Memory health: ECC error counts (SBE = correctable, DBE = uncorrectable)
- SM stress test: sustained compute stress to surface thermal/power issues
- XID errors: NVIDIA driver-level errors logged to dmesg (e.g., XID 79 = GPU memory fault)
AI Cluster Orchestration
Slurm: HPC Workload Manager
- sbatch job.sh — submit a batch job (returns Job ID)
- srun python train.py — run interactive/parallel job
- squeue — view job queue and status
- scancel <jobid> — cancel a job
- sinfo — view node/partition status
- GRES (Generic Resource Scheduling): #SBATCH --gres=gpu:a100:4 — request 4 A100 GPUs
- Native to HPC; used in supercomputers and on-prem clusters
Kubernetes + GPU Operator
- GPU Operator: automates driver, CUDA toolkit, DCGM, and device plugin installation on GPU nodes
- Device Plugin: exposes nvidia.com/gpu as a schedulable Kubernetes resource
- GPU request in pod spec: resources.limits: nvidia.com/gpu: 2
- Nodes labeled with GPU type, memory, and MIG config for affinity scheduling
- Used for containerized, cloud-native AI workloads and inference serving
Run:ai — Advanced GPU Scheduling
- Kubernetes-native GPU orchestration with fairness and quota enforcement
- Fractional GPU: allocate 0.5 GPU to a workload (time-sharing)
- Quota management: teams/projects get GPU quota; over-quota preemption
- Topology-aware scheduling: places jobs on nodes with NVLink connectivity
- Gang scheduling: all pods start simultaneously or none do
- Dashboard shows GPU utilization per team, job, and node
Job Scheduling Concepts
- Gang scheduling: all GPUs allocated simultaneously (critical for distributed training — partial start = deadlock)
- Preemption: higher-priority job displaces lower-priority (checkpointing required)
- Backfill scheduling: small jobs fill gaps while large job waits for full allocation
- Priority queues: production > interactive > batch > best-effort
- Time limits: max walltime per job prevents queue starvation
GPU Virtualization: MIG vs. vGPU
| Feature | MIG (Multi-Instance GPU) | vGPU (Virtual GPU) |
|---|---|---|
| Isolation type | Hardware-level (dedicated SM partitions + VRAM) | Software time-slicing |
| Supported GPUs | A100, H100, A30 | Many NVIDIA GPUs (T4, A10, A100, etc.) |
| Max instances | Up to 7 on A100 80GB | Many VMs (limited by time-slice) |
| Memory isolation | Yes — each instance has dedicated VRAM | No — shared memory pool |
| Use cases | Multi-tenant inference, test/dev, small training | VDI, vCompute, remote workstations, mixed VM workloads |
| Hypervisor required | No (bare metal) | Yes (VMware vSphere, Citrix, KVM) |
| License required | No (hardware feature) | Yes (NVIDIA vGPU Software license) |
MIG Profiles (A100)
- 7g.40gb: 1 instance using full 80GB GPU (no MIG benefit — just full GPU)
- 4g.20gb: largest MIG slice — 4 SM partitions, 20GB VRAM
- 3g.20gb: 3 SM partitions, 20GB VRAM
- 2g.10gb: 2 SM partitions, 10GB VRAM
- 1g.5gb: smallest — 1 SM partition, 5GB VRAM; up to 7 on one A100
- Enable MIG: nvidia-smi -i 0 -mig 1
vGPU Profile Types
- vCompute (C-series): CUDA compute workloads on VMs, no display output
- vWS (Quadro vWS): workstation graphics, CAD, creative tools
- vApps (vPC): virtual desktop (VDI) — lighter GPU acceleration
- Profile name format: A10-4C = A10 GPU, 4GB vCompute profile
- Time-sliced: each VM gets GPU access in round-robin bursts
Data Center Management Essentials
Power Usage Effectiveness (PUE)
PUE = Total Facility Power ÷ IT Equipment Power
- PUE = 1.0: ideal — 100% power goes to IT, zero overhead
- PUE = 1.2: excellent (hyperscale data centers)
- PUE = 1.5: industry average
- PUE = 2.0: poor — half of all power is overhead (cooling, lighting, etc.)
- AI data centers target PUE < 1.3
Cooling Strategies for GPU Servers
- Air cooling: traditional CRAC/CRAH units — limited to ~30kW/rack for GPU servers
- Rear-door heat exchangers: capture hot exhaust at rack rear, up to 40kW/rack
- Liquid cooling (direct-to-chip): coolant directly to GPU cold plates — enables 100kW+ racks (DGX GB200 NVL72)
- Immersion cooling: servers submerged in dielectric fluid — most efficient
- H100 DGX: up to 10.2kW per server; GB200 NVL72: ~120kW per rack
Power & Rack Design
- Power density: AI GPU servers require 5-30kW per server vs. 1-3kW for CPU servers
- PDU (Power Distribution Unit): per-rack power metering, remote switching
- UPS: uninterruptible power supply — protects from outages during training runs
- Phase balancing: distribute 3-phase power evenly across PDUs
- Stranded power: available electrical capacity unused due to thermal limits
"80 is the Warning Number"
GPU temperature at or above 80°C = thermal throttling kicks in. Clock speeds drop to protect the hardware. In the exam, 80°C is the threshold between "acceptable heavy load" and "investigate immediately."
MIG = Hardware Knife, vGPU = Time-Share
MIG physically slices the GPU into isolated pieces with dedicated VRAM and SMs — hardware-level isolation. vGPU time-shares one GPU across multiple VMs using software — no memory isolation. MIG = knife cutting. vGPU = rotation schedule.
"Gang Scheduling = All or Nothing"
Distributed training needs all GPUs to start simultaneously. Gang scheduling holds resources until the full allocation is available, then launches all at once. Partial starts cause deadlock — workers wait forever for each other.
DCGM-Exporter = GPU Metrics → Prometheus
Remember the pipeline: DCGM-Exporter (DaemonSet) → Prometheus → Grafana. DCGM-Exporter scrapes GPU metrics on every node and exposes them on /metrics endpoint. This is how Kubernetes clusters get per-GPU, per-pod observability at scale.
PUE: Lower = Better, 1.0 = Perfect
PUE = Total Facility Power ÷ IT Power. A PUE of 1.0 is physically impossible (some overhead always exists) but represents 100% efficiency. Hyperscalers achieve ~1.2. Traditional data centers average ~1.5. For AI: always target <1.3.
Slurm = HPC Tracks, K8s = Cloud Containers
Slurm is the native HPC job scheduler — sbatch for batch, srun for interactive, GRES for GPU requests. It's the dominant choice for supercomputers and on-prem HPC. Kubernetes is for containerized, cloud-native AI workloads. Run:ai extends Kubernetes with GPU-specific fairness and quota controls.
Quiz Complete!
Click a card to flip it
What command continuously shows GPU stats refreshed every 2 seconds, and what flag enables persistent mode?
nvidia-smi -l 2 (loop every 2 seconds)Persistent mode:
nvidia-smi -pm 1 — keeps driver loaded to eliminate initialization latency for subsequent GPU queries or workload launches.Bonus:
nvidia-smi --query-gpu=name,temp.gpu,utilization.gpu,memory.used --format=csv exports custom metrics in CSV.
How many MIG instances can an A100 80GB support, and what is the smallest profile?
Smallest profile: 1g.5gb — 1 SM partition, 5GB VRAM, 1 compute engine.
Enable MIG:
nvidia-smi -i 0 -mig 1Key advantage: Hardware-level isolation — each instance has dedicated VRAM and compute, preventing noisy-neighbor interference.
What is the PUE formula, and what values represent excellent vs. poor efficiency?
Scale:
1.0 = theoretical perfect (impossible in practice)
1.2 = excellent (hyperscale data centers)
1.5 = industry average
2.0 = poor (50% of power is overhead)
AI target: <1.3 for GPU-dense facilities.
What is DCGM-Exporter and how does it enable GPU observability in Kubernetes?
Pipeline: DCGM-Exporter → Prometheus scrapes /metrics → Grafana visualizes dashboards
Metrics exposed: GPU utilization, memory usage, temperature, power, ECC errors, all labeled by GPU index and Kubernetes pod name.
What Slurm commands submit, view, and cancel jobs? How do you request 4 A100 GPUs?
sbatch train.shInteractive run:
srun python train.pyView queue:
squeueCancel:
scancel <jobid>Node status:
sinfoRequest 4 A100s:
#SBATCH --gres=gpu:a100:4(GRES = Generic Resource Scheduling)
What is the key architectural difference between NVIDIA vGPU and MIG isolation?
vGPU = Software time-slicing: Multiple VMs take turns accessing the full GPU in time slots. Memory is pooled (not isolated). Requires a hypervisor and NVIDIA vGPU license.
Rule: Need isolation → MIG. Need multi-VM on any GPU → vGPU.
Why is gang scheduling essential for distributed AI training, and what happens without it?
Without gang scheduling: Some workers start early and wait for others that haven't been allocated yet. All workers block on collective communication operations (AllReduce) → deadlock or massive idle time.
Solution: Scheduler holds the full GPU set until available, then releases all at once.
At what temperature does GPU thermal throttling begin, and what does the GPU do to compensate?
What happens: The GPU automatically reduces its SM clock frequency and memory clock to decrease power consumption and generate less heat.
Effect on workload: Training throughput drops — fewer FLOPS/second until temperature stabilizes.
Root causes: Inadequate cooling, high ambient temperature, blocked airflow, dust accumulation, or sustained 100% GPU utilization in a dense rack.
🖥️ GPU Monitoring Strategy
- Know the difference: nvidia-smi = per-node CLI; DCGM = fleet-scale enterprise monitoring. Exam will test which tool fits which scale scenario.
- Memorize thermal thresholds: 75°C = normal heavy load, 80°C+ = throttling begins, 85°C+ = critical. Any question about "alarm" or "investigate" = 80°C+.
- DCGM-Exporter → Prometheus → Grafana is the canonical Kubernetes GPU observability stack. Know each component's role.
- GPU Utilization % (SM activity) ≠ GPU Memory Utilization %. They measure different things. Low GPU util with high memory util = memory-bound workload.
- XID errors in dmesg are NVIDIA driver-level errors. XID 79 = GPU memory fault. These surface in DCGM health checks, not nvidia-smi output.
Official & Recommended Resources
-
NCA-AIIO Certification Page — NVIDIA Official exam page: objectives, registration, study resources, and exam guide download.
-
DCGM User Guide — NVIDIA Docs Complete reference for DCGM CLI (dcgmi), health checks, diagnostics, and DCGM-Exporter deployment.
-
MIG User Guide — NVIDIA Docs Definitive guide to enabling MIG, creating instances, available profiles on A100 and H100.
-
NVIDIA Virtual GPU (vGPU) User Guide vGPU software architecture, profile types (vCompute, vWS, vApps), hypervisor configuration, and licensing.
-
Slurm Workload Manager Documentation Official docs for sbatch, srun, squeue, GRES configuration, and scheduling policies.
-
NVIDIA GPU Operator for Kubernetes Automates GPU driver, CUDA toolkit, and DCGM-Exporter deployment across Kubernetes GPU nodes.