NCA-AIIO · AI Operations: Monitoring, Orchestration & Virtualization

nvidia-smi DCGM GPU Metrics Slurm Kubernetes Run:ai MIG vGPU Orchestration Data Center PUE Job Scheduling

AI Operations is the operational domain of the NCA-AIIO exam, covering 22% of all 50 questions. It tests your ability to monitor GPU health with nvidia-smi and DCGM, orchestrate AI workloads using Slurm and Kubernetes, and leverage GPU virtualization technologies like MIG and vGPU — all essential skills for running AI infrastructure at scale.

What You'll Master

nvidia-smi — GPU Monitoring CLI

The primary tool for GPU status, utilization, temperature, power draw, running processes, and driver/CUDA version. nvidia-smi -l 1 for continuous monitoring; --query-gpu for custom metric export.

DCGM — Enterprise GPU Health

Data Center GPU Manager provides enterprise-grade monitoring, diagnostics, and health checks across entire GPU fleets. Integrates with Prometheus/Grafana via DCGM-Exporter for Kubernetes-native observability.

Critical GPU Metrics

GPU Utilization % (SM activity), Memory Utilization %, Temperature (°C), Power Draw vs. TDP, SM Clock speed, and NVLink/PCIe bandwidth — understanding each and normal vs. alarm thresholds.

Cluster Orchestration Frameworks

Three key platforms: Slurm (HPC-native, sbatch/srun), Kubernetes + GPU Operator (containerized workloads), and Run:ai (advanced GPU quota management and topology-aware scheduling).

GPU Virtualization: MIG

Multi-Instance GPU provides hardware-level isolation — A100 supports up to 7 MIG instances with dedicated VRAM and compute. Ideal for inference serving and multi-tenant environments.

GPU Virtualization: vGPU

NVIDIA Virtual GPU uses software time-slicing to share one physical GPU across multiple VMs. Requires hypervisor support (VMware, Citrix). Profiles: vCompute, vWS, vApps for different use cases.

Domain Weight & Exam Focus

Sub-Topic Area	Exam Emphasis	Key Concepts to Know
GPU Monitoring (nvidia-smi & DCGM)	High	Commands, metrics, thresholds, DCGM-Exporter
Cluster Orchestration (Slurm, K8s, Run:ai)	High	Job submission, GPU allocation, scheduling policies
GPU Virtualization (MIG, vGPU)	High	Hardware vs. software isolation, profiles, use cases
Data Center Management	Medium	PUE, cooling, power density, rack design
Job Scheduling Concepts	Medium	Gang scheduling, preemption, backfill, queues

nvidia-smi: GPU Monitoring CLI

Core nvidia-smi Commands

nvidia-smi — snapshot: driver, CUDA, GPU name, temp, power, utilization
nvidia-smi -l 1 — live refresh every 1 second
nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv — custom CSV export
nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv — running GPU processes
nvidia-smi topo -m — GPU-to-GPU and GPU-to-CPU topology (NVLink, PCIe)
nvidia-smi -pm 1 — enable persistent mode (reduces driver load latency)

Key nvidia-smi Output Fields

GPU Util %: percentage of time SMs are busy (compute activity)
Mem-Usage / Total: VRAM used vs. total capacity in MiB
Temp: GPU die temperature in °C
Power/Limit: current draw vs. TDP (Thermal Design Power)
Perf: performance state — P0 (max) to P12 (min)
SM Clock / Mem Clock: operating frequencies in MHz

GPU Temperature Thresholds

30-60°C: idle or light workload — normal
60-75°C: moderate workload — normal
75-83°C: heavy workload — acceptable but monitor
80°C+: thermal throttling begins — clocks reduce to protect hardware
85°C+: critical — investigate cooling, airflow, ambient temp
Shutdown threshold: typically 90-95°C to prevent hardware damage

DCGM: Data Center GPU Manager

DCGM Overview

DCGM is NVIDIA's enterprise tool for managing GPU health and performance at scale across entire data center fleets. It provides: continuous health monitoring, active diagnostics, field watches, and REST API access. Available as open-source on GitHub.

dcgmi CLI Commands

dcgmi discovery -l — list all discovered GPUs
dcgmi health -g 0 -c — check health of GPU group 0
dcgmi diag -r 1 — run quick diagnostics (r1=quick, r3=extended)
dcgmi stats -e -g 0 — enable stats collection for group
dcgmi dmon -e 203,204 — monitor specific field IDs continuously

DCGM-Exporter for Kubernetes

DCGM-Exporter runs as a DaemonSet on every GPU node, exposing GPU metrics in Prometheus format at /metrics. Enables GPU observability in Grafana dashboards. Metrics include GPU utilization, memory, temperature, power, and ECC errors — all labeled per-GPU and per-pod.

DCGM Health Checks

PCIe health: bandwidth and error rate on the PCIe bus
NVLink health: inter-GPU link status and bandwidth
Memory health: ECC error counts (SBE = correctable, DBE = uncorrectable)
SM stress test: sustained compute stress to surface thermal/power issues
XID errors: NVIDIA driver-level errors logged to dmesg (e.g., XID 79 = GPU memory fault)

AI Cluster Orchestration

Slurm: HPC Workload Manager

sbatch job.sh — submit a batch job (returns Job ID)
srun python train.py — run interactive/parallel job
squeue — view job queue and status
scancel <jobid> — cancel a job
sinfo — view node/partition status
GRES (Generic Resource Scheduling): #SBATCH --gres=gpu:a100:4 — request 4 A100 GPUs
Native to HPC; used in supercomputers and on-prem clusters

Kubernetes + GPU Operator

GPU Operator: automates driver, CUDA toolkit, DCGM, and device plugin installation on GPU nodes
Device Plugin: exposes nvidia.com/gpu as a schedulable Kubernetes resource
GPU request in pod spec: resources.limits: nvidia.com/gpu: 2
Nodes labeled with GPU type, memory, and MIG config for affinity scheduling
Used for containerized, cloud-native AI workloads and inference serving

Run:ai — Advanced GPU Scheduling

Kubernetes-native GPU orchestration with fairness and quota enforcement
Fractional GPU: allocate 0.5 GPU to a workload (time-sharing)
Quota management: teams/projects get GPU quota; over-quota preemption
Topology-aware scheduling: places jobs on nodes with NVLink connectivity
Gang scheduling: all pods start simultaneously or none do
Dashboard shows GPU utilization per team, job, and node

Job Scheduling Concepts

Gang scheduling: all GPUs allocated simultaneously (critical for distributed training — partial start = deadlock)
Preemption: higher-priority job displaces lower-priority (checkpointing required)
Backfill scheduling: small jobs fill gaps while large job waits for full allocation
Priority queues: production > interactive > batch > best-effort
Time limits: max walltime per job prevents queue starvation

GPU Virtualization: MIG vs. vGPU

Feature	MIG (Multi-Instance GPU)	vGPU (Virtual GPU)
Isolation type	Hardware-level (dedicated SM partitions + VRAM)	Software time-slicing
Supported GPUs	A100, H100, A30	Many NVIDIA GPUs (T4, A10, A100, etc.)
Max instances	Up to 7 on A100 80GB	Many VMs (limited by time-slice)
Memory isolation	Yes — each instance has dedicated VRAM	No — shared memory pool
Use cases	Multi-tenant inference, test/dev, small training	VDI, vCompute, remote workstations, mixed VM workloads
Hypervisor required	No (bare metal)	Yes (VMware vSphere, Citrix, KVM)
License required	No (hardware feature)	Yes (NVIDIA vGPU Software license)

MIG Profiles (A100)

7g.40gb: 1 instance using full 80GB GPU (no MIG benefit — just full GPU)
4g.20gb: largest MIG slice — 4 SM partitions, 20GB VRAM
3g.20gb: 3 SM partitions, 20GB VRAM
2g.10gb: 2 SM partitions, 10GB VRAM
1g.5gb: smallest — 1 SM partition, 5GB VRAM; up to 7 on one A100
Enable MIG: nvidia-smi -i 0 -mig 1

vGPU Profile Types

vCompute (C-series): CUDA compute workloads on VMs, no display output
vWS (Quadro vWS): workstation graphics, CAD, creative tools
vApps (vPC): virtual desktop (VDI) — lighter GPU acceleration
Profile name format: A10-4C = A10 GPU, 4GB vCompute profile
Time-sliced: each VM gets GPU access in round-robin bursts

Data Center Management Essentials

Power Usage Effectiveness (PUE)

PUE = Total Facility Power ÷ IT Equipment Power

PUE = 1.0: ideal — 100% power goes to IT, zero overhead
PUE = 1.2: excellent (hyperscale data centers)
PUE = 1.5: industry average
PUE = 2.0: poor — half of all power is overhead (cooling, lighting, etc.)
AI data centers target PUE < 1.3

Cooling Strategies for GPU Servers

Air cooling: traditional CRAC/CRAH units — limited to ~30kW/rack for GPU servers
Rear-door heat exchangers: capture hot exhaust at rack rear, up to 40kW/rack
Liquid cooling (direct-to-chip): coolant directly to GPU cold plates — enables 100kW+ racks (DGX GB200 NVL72)
Immersion cooling: servers submerged in dielectric fluid — most efficient
H100 DGX: up to 10.2kW per server; GB200 NVL72: ~120kW per rack

Power & Rack Design

Power density: AI GPU servers require 5-30kW per server vs. 1-3kW for CPU servers
PDU (Power Distribution Unit): per-rack power metering, remote switching
UPS: uninterruptible power supply — protects from outages during training runs
Phase balancing: distribute 3-phase power evenly across PDUs
Stranded power: available electrical capacity unused due to thermal limits

Six memory hooks to help you remember the critical AI Operations distinctions for the NCA-AIIO exam.

🌡️

"80 is the Warning Number"

GPU temperature at or above 80°C = thermal throttling kicks in. Clock speeds drop to protect the hardware. In the exam, 80°C is the threshold between "acceptable heavy load" and "investigate immediately."

🔪

MIG = Hardware Knife, vGPU = Time-Share

MIG physically slices the GPU into isolated pieces with dedicated VRAM and SMs — hardware-level isolation. vGPU time-shares one GPU across multiple VMs using software — no memory isolation. MIG = knife cutting. vGPU = rotation schedule.

🎲

"Gang Scheduling = All or Nothing"

Distributed training needs all GPUs to start simultaneously. Gang scheduling holds resources until the full allocation is available, then launches all at once. Partial starts cause deadlock — workers wait forever for each other.

📡

DCGM-Exporter = GPU Metrics → Prometheus

Remember the pipeline: DCGM-Exporter (DaemonSet) → Prometheus → Grafana. DCGM-Exporter scrapes GPU metrics on every node and exposes them on /metrics endpoint. This is how Kubernetes clusters get per-GPU, per-pod observability at scale.

⚡

PUE: Lower = Better, 1.0 = Perfect

PUE = Total Facility Power ÷ IT Power. A PUE of 1.0 is physically impossible (some overhead always exists) but represents 100% efficiency. Hyperscalers achieve ~1.2. Traditional data centers average ~1.5. For AI: always target <1.3.

🛤️

Slurm = HPC Tracks, K8s = Cloud Containers

Slurm is the native HPC job scheduler — sbatch for batch, srun for interactive, GRES for GPU requests. It's the dominant choice for supercomputers and on-prem HPC. Kubernetes is for containerized, cloud-native AI workloads. Run:ai extends Kubernetes with GPU-specific fairness and quota controls.

8 flashcards covering core AI Operations concepts. Click any card to flip it and see the answer.

Click a card to flip it

nvidia-smi

What command continuously shows GPU stats refreshed every 2 seconds, and what flag enables persistent mode?

Continuous monitoring: nvidia-smi -l 2 (loop every 2 seconds)

Persistent mode: nvidia-smi -pm 1 — keeps driver loaded to eliminate initialization latency for subsequent GPU queries or workload launches.

Bonus: nvidia-smi --query-gpu=name,temp.gpu,utilization.gpu,memory.used --format=csv exports custom metrics in CSV.

MIG

How many MIG instances can an A100 80GB support, and what is the smallest profile?

Max instances: Up to 7 MIG instances on a single A100 80GB GPU.

Smallest profile: 1g.5gb — 1 SM partition, 5GB VRAM, 1 compute engine.

Enable MIG: nvidia-smi -i 0 -mig 1
Key advantage: Hardware-level isolation — each instance has dedicated VRAM and compute, preventing noisy-neighbor interference.

PUE

What is the PUE formula, and what values represent excellent vs. poor efficiency?

Formula: PUE = Total Facility Power ÷ IT Equipment Power

Scale:
1.0 = theoretical perfect (impossible in practice)
1.2 = excellent (hyperscale data centers)
1.5 = industry average
2.0 = poor (50% of power is overhead)

AI target: <1.3 for GPU-dense facilities.

DCGM

What is DCGM-Exporter and how does it enable GPU observability in Kubernetes?

DCGM-Exporter is a containerized daemon (DaemonSet) that runs on every GPU node and exposes GPU metrics at a /metrics HTTP endpoint in Prometheus format.

Pipeline: DCGM-Exporter → Prometheus scrapes /metrics → Grafana visualizes dashboards

Metrics exposed: GPU utilization, memory usage, temperature, power, ECC errors, all labeled by GPU index and Kubernetes pod name.

Slurm

What Slurm commands submit, view, and cancel jobs? How do you request 4 A100 GPUs?

Submit: sbatch train.sh
Interactive run: srun python train.py
View queue: squeue
Cancel: scancel <jobid>
Node status: sinfo

Request 4 A100s:
#SBATCH --gres=gpu:a100:4
(GRES = Generic Resource Scheduling)

vGPU vs MIG

What is the key architectural difference between NVIDIA vGPU and MIG isolation?

MIG = Hardware isolation: The A100/H100 physically partitions SMs and VRAM into separate instances. Each MIG instance has dedicated compute and memory — strict isolation, no sharing.

vGPU = Software time-slicing: Multiple VMs take turns accessing the full GPU in time slots. Memory is pooled (not isolated). Requires a hypervisor and NVIDIA vGPU license.

Rule: Need isolation → MIG. Need multi-VM on any GPU → vGPU.

Gang Scheduling

Why is gang scheduling essential for distributed AI training, and what happens without it?

Gang scheduling allocates all required GPUs simultaneously, so all distributed training workers start at the same time.

Without gang scheduling: Some workers start early and wait for others that haven't been allocated yet. All workers block on collective communication operations (AllReduce) → deadlock or massive idle time.

Solution: Scheduler holds the full GPU set until available, then releases all at once.

GPU Thermal Throttling

At what temperature does GPU thermal throttling begin, and what does the GPU do to compensate?

Throttling threshold: Approximately 80-83°C (varies by GPU model).

What happens: The GPU automatically reduces its SM clock frequency and memory clock to decrease power consumption and generate less heat.

Effect on workload: Training throughput drops — fewer FLOPS/second until temperature stabilizes.

Root causes: Inadequate cooling, high ambient temperature, blocked airflow, dust accumulation, or sustained 100% GPU utilization in a dense rack.

Select your focus area to get targeted exam strategies for AI Operations topics.

🖥️ GPU Monitoring Strategy

Know the difference: nvidia-smi = per-node CLI; DCGM = fleet-scale enterprise monitoring. Exam will test which tool fits which scale scenario.
Memorize thermal thresholds: 75°C = normal heavy load, 80°C+ = throttling begins, 85°C+ = critical. Any question about "alarm" or "investigate" = 80°C+.
DCGM-Exporter → Prometheus → Grafana is the canonical Kubernetes GPU observability stack. Know each component's role.
GPU Utilization % (SM activity) ≠ GPU Memory Utilization %. They measure different things. Low GPU util with high memory util = memory-bound workload.
XID errors in dmesg are NVIDIA driver-level errors. XID 79 = GPU memory fault. These surface in DCGM health checks, not nvidia-smi output.

⚙️ Orchestration Strategy

The three-platform framework: Slurm = HPC/on-prem (sbatch, GRES); Kubernetes + GPU Operator = cloud-native/containers; Run:ai = advanced GPU quota + topology awareness on top of Kubernetes.
For Slurm, know: sbatch (submit), srun (interactive), squeue (view), scancel (cancel), sinfo (nodes), and GRES syntax for GPU requests.
Kubernetes GPU request goes in pod spec under resources.limits: nvidia.com/gpu. The GPU Operator automates everything else (driver, DCGM, device plugin installation).
Run:ai differentiators: fractional GPU allocation (0.5 GPU), team/project quotas, preemption with gang scheduling, topology-aware placement for NVLink-connected GPUs.

🔪 Virtualization Strategy

The key exam axis: MIG = hardware isolation (dedicated VRAM + SMs, no hypervisor needed, A100/H100 only); vGPU = software time-slicing (shared memory, requires hypervisor + license, works on many GPUs).
MIG max on A100: 7 instances (smallest = 1g.5gb). Memorize the profile format: [SM-count]g.[VRAM]gb.
vGPU profile types: C-series (vCompute = CUDA on VMs), WS (workstation graphics), vApps/vPC (VDI). Match use case to profile type in exam scenarios.
Enable MIG: nvidia-smi -i 0 -mig 1 — requires reboot or GPU reset. MIG can't be enabled/disabled on live training GPUs.
vGPU requires hypervisor support (VMware vSphere, Citrix Hypervisor, KVM/RHEL) AND an NVIDIA vGPU software license — both required.

🏢 Data Center Strategy

PUE formula is testable: Total Facility Power ÷ IT Equipment Power. Remember: 1.0 = theoretical perfect, 1.2 = excellent, 1.5 = average, 2.0 = poor.
GPU servers are 5-10x denser in power than CPU servers. H100 DGX = up to 10.2kW per server — this drives all the special cooling requirements.
Cooling progression: air cooling (lowest density) → rear-door heat exchangers → direct liquid cooling (DLC) → immersion cooling (highest density). GB200 NVL72 racks need liquid cooling.
Stranded power = available electrical capacity you can't use because the cooling can't handle the heat load. Common in retrofitted data centers adding GPUs.

📋 Job Scheduling Strategy

Gang scheduling = all GPUs or none. Essential for distributed training to prevent deadlock. Know WHY: AllReduce collective operations require all workers to be alive simultaneously.
Preemption requires checkpointing — the displaced job must be able to save and resume state. Without checkpoints, preemption causes full job restart.
Backfill scheduling fills small jobs into gaps while a large job waits for full allocation — improves cluster utilization but requires the small jobs to finish before the large job's deadline.
Priority queue hierarchy for exam: production/SLA jobs > interactive > batch training > best-effort/dev. Preemption flows down this hierarchy.
Time limits in Slurm (#SBATCH --time=24:00:00) prevent queue monopolization and enable the scheduler to predict when slots will free up for backfill.

Official & Recommended Resources

NCA-AIIO Certification Page — NVIDIA Official exam page: objectives, registration, study resources, and exam guide download.
DCGM User Guide — NVIDIA Docs Complete reference for DCGM CLI (dcgmi), health checks, diagnostics, and DCGM-Exporter deployment.
MIG User Guide — NVIDIA Docs Definitive guide to enabling MIG, creating instances, available profiles on A100 and H100.
NVIDIA Virtual GPU (vGPU) User Guide vGPU software architecture, profile types (vCompute, vWS, vApps), hypervisor configuration, and licensing.
Slurm Workload Manager Documentation Official docs for sbatch, srun, squeue, GRES configuration, and scheduling policies.
NVIDIA GPU Operator for Kubernetes Automates GPU driver, CUDA toolkit, and DCGM-Exporter deployment across Kubernetes GPU nodes.