FlashGenius Logo FlashGenius
NCA-AIIO · Page 5 of 5 · AI Operations

AI Operations: Monitoring, Orchestration & Virtualization

NCA-AIIO · AI Operations Domain · 22% of Exam

nvidia-smi · DCGM · GPU Metrics · Slurm · Kubernetes · Run:ai · MIG · vGPU · Data Center Management

Study with Practice Tests →
nvidia-smi DCGM GPU Metrics Slurm Kubernetes Run:ai MIG vGPU Orchestration Data Center PUE Job Scheduling
AI Operations is the operational domain of the NCA-AIIO exam, covering 22% of all 50 questions. It tests your ability to monitor GPU health with nvidia-smi and DCGM, orchestrate AI workloads using Slurm and Kubernetes, and leverage GPU virtualization technologies like MIG and vGPU — all essential skills for running AI infrastructure at scale.

What You'll Master

nvidia-smi — GPU Monitoring CLI

The primary tool for GPU status, utilization, temperature, power draw, running processes, and driver/CUDA version. nvidia-smi -l 1 for continuous monitoring; --query-gpu for custom metric export.

DCGM — Enterprise GPU Health

Data Center GPU Manager provides enterprise-grade monitoring, diagnostics, and health checks across entire GPU fleets. Integrates with Prometheus/Grafana via DCGM-Exporter for Kubernetes-native observability.

Critical GPU Metrics

GPU Utilization % (SM activity), Memory Utilization %, Temperature (°C), Power Draw vs. TDP, SM Clock speed, and NVLink/PCIe bandwidth — understanding each and normal vs. alarm thresholds.

Cluster Orchestration Frameworks

Three key platforms: Slurm (HPC-native, sbatch/srun), Kubernetes + GPU Operator (containerized workloads), and Run:ai (advanced GPU quota management and topology-aware scheduling).

GPU Virtualization: MIG

Multi-Instance GPU provides hardware-level isolation — A100 supports up to 7 MIG instances with dedicated VRAM and compute. Ideal for inference serving and multi-tenant environments.

GPU Virtualization: vGPU

NVIDIA Virtual GPU uses software time-slicing to share one physical GPU across multiple VMs. Requires hypervisor support (VMware, Citrix). Profiles: vCompute, vWS, vApps for different use cases.

Domain Weight & Exam Focus

Sub-Topic AreaExam EmphasisKey Concepts to Know
GPU Monitoring (nvidia-smi & DCGM)HighCommands, metrics, thresholds, DCGM-Exporter
Cluster Orchestration (Slurm, K8s, Run:ai)HighJob submission, GPU allocation, scheduling policies
GPU Virtualization (MIG, vGPU)HighHardware vs. software isolation, profiles, use cases
Data Center ManagementMediumPUE, cooling, power density, rack design
Job Scheduling ConceptsMediumGang scheduling, preemption, backfill, queues

nvidia-smi: GPU Monitoring CLI

Core nvidia-smi Commands

  • nvidia-smi — snapshot: driver, CUDA, GPU name, temp, power, utilization
  • nvidia-smi -l 1 — live refresh every 1 second
  • nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv — custom CSV export
  • nvidia-smi --query-compute-apps=pid,name,used_memory --format=csv — running GPU processes
  • nvidia-smi topo -m — GPU-to-GPU and GPU-to-CPU topology (NVLink, PCIe)
  • nvidia-smi -pm 1 — enable persistent mode (reduces driver load latency)

Key nvidia-smi Output Fields

  • GPU Util %: percentage of time SMs are busy (compute activity)
  • Mem-Usage / Total: VRAM used vs. total capacity in MiB
  • Temp: GPU die temperature in °C
  • Power/Limit: current draw vs. TDP (Thermal Design Power)
  • Perf: performance state — P0 (max) to P12 (min)
  • SM Clock / Mem Clock: operating frequencies in MHz

GPU Temperature Thresholds

  • 30-60°C: idle or light workload — normal
  • 60-75°C: moderate workload — normal
  • 75-83°C: heavy workload — acceptable but monitor
  • 80°C+: thermal throttling begins — clocks reduce to protect hardware
  • 85°C+: critical — investigate cooling, airflow, ambient temp
  • Shutdown threshold: typically 90-95°C to prevent hardware damage

DCGM: Data Center GPU Manager

DCGM Overview

DCGM is NVIDIA's enterprise tool for managing GPU health and performance at scale across entire data center fleets. It provides: continuous health monitoring, active diagnostics, field watches, and REST API access. Available as open-source on GitHub.

dcgmi CLI Commands

  • dcgmi discovery -l — list all discovered GPUs
  • dcgmi health -g 0 -c — check health of GPU group 0
  • dcgmi diag -r 1 — run quick diagnostics (r1=quick, r3=extended)
  • dcgmi stats -e -g 0 — enable stats collection for group
  • dcgmi dmon -e 203,204 — monitor specific field IDs continuously

DCGM-Exporter for Kubernetes

DCGM-Exporter runs as a DaemonSet on every GPU node, exposing GPU metrics in Prometheus format at /metrics. Enables GPU observability in Grafana dashboards. Metrics include GPU utilization, memory, temperature, power, and ECC errors — all labeled per-GPU and per-pod.

DCGM Health Checks

  • PCIe health: bandwidth and error rate on the PCIe bus
  • NVLink health: inter-GPU link status and bandwidth
  • Memory health: ECC error counts (SBE = correctable, DBE = uncorrectable)
  • SM stress test: sustained compute stress to surface thermal/power issues
  • XID errors: NVIDIA driver-level errors logged to dmesg (e.g., XID 79 = GPU memory fault)

AI Cluster Orchestration

Slurm: HPC Workload Manager

  • sbatch job.sh — submit a batch job (returns Job ID)
  • srun python train.py — run interactive/parallel job
  • squeue — view job queue and status
  • scancel <jobid> — cancel a job
  • sinfo — view node/partition status
  • GRES (Generic Resource Scheduling): #SBATCH --gres=gpu:a100:4 — request 4 A100 GPUs
  • Native to HPC; used in supercomputers and on-prem clusters

Kubernetes + GPU Operator

  • GPU Operator: automates driver, CUDA toolkit, DCGM, and device plugin installation on GPU nodes
  • Device Plugin: exposes nvidia.com/gpu as a schedulable Kubernetes resource
  • GPU request in pod spec: resources.limits: nvidia.com/gpu: 2
  • Nodes labeled with GPU type, memory, and MIG config for affinity scheduling
  • Used for containerized, cloud-native AI workloads and inference serving

Run:ai — Advanced GPU Scheduling

  • Kubernetes-native GPU orchestration with fairness and quota enforcement
  • Fractional GPU: allocate 0.5 GPU to a workload (time-sharing)
  • Quota management: teams/projects get GPU quota; over-quota preemption
  • Topology-aware scheduling: places jobs on nodes with NVLink connectivity
  • Gang scheduling: all pods start simultaneously or none do
  • Dashboard shows GPU utilization per team, job, and node

Job Scheduling Concepts

  • Gang scheduling: all GPUs allocated simultaneously (critical for distributed training — partial start = deadlock)
  • Preemption: higher-priority job displaces lower-priority (checkpointing required)
  • Backfill scheduling: small jobs fill gaps while large job waits for full allocation
  • Priority queues: production > interactive > batch > best-effort
  • Time limits: max walltime per job prevents queue starvation

GPU Virtualization: MIG vs. vGPU

FeatureMIG (Multi-Instance GPU)vGPU (Virtual GPU)
Isolation typeHardware-level (dedicated SM partitions + VRAM)Software time-slicing
Supported GPUsA100, H100, A30Many NVIDIA GPUs (T4, A10, A100, etc.)
Max instancesUp to 7 on A100 80GBMany VMs (limited by time-slice)
Memory isolationYes — each instance has dedicated VRAMNo — shared memory pool
Use casesMulti-tenant inference, test/dev, small trainingVDI, vCompute, remote workstations, mixed VM workloads
Hypervisor requiredNo (bare metal)Yes (VMware vSphere, Citrix, KVM)
License requiredNo (hardware feature)Yes (NVIDIA vGPU Software license)

MIG Profiles (A100)

  • 7g.40gb: 1 instance using full 80GB GPU (no MIG benefit — just full GPU)
  • 4g.20gb: largest MIG slice — 4 SM partitions, 20GB VRAM
  • 3g.20gb: 3 SM partitions, 20GB VRAM
  • 2g.10gb: 2 SM partitions, 10GB VRAM
  • 1g.5gb: smallest — 1 SM partition, 5GB VRAM; up to 7 on one A100
  • Enable MIG: nvidia-smi -i 0 -mig 1

vGPU Profile Types

  • vCompute (C-series): CUDA compute workloads on VMs, no display output
  • vWS (Quadro vWS): workstation graphics, CAD, creative tools
  • vApps (vPC): virtual desktop (VDI) — lighter GPU acceleration
  • Profile name format: A10-4C = A10 GPU, 4GB vCompute profile
  • Time-sliced: each VM gets GPU access in round-robin bursts

Data Center Management Essentials

Power Usage Effectiveness (PUE)

PUE = Total Facility Power ÷ IT Equipment Power

  • PUE = 1.0: ideal — 100% power goes to IT, zero overhead
  • PUE = 1.2: excellent (hyperscale data centers)
  • PUE = 1.5: industry average
  • PUE = 2.0: poor — half of all power is overhead (cooling, lighting, etc.)
  • AI data centers target PUE < 1.3

Cooling Strategies for GPU Servers

  • Air cooling: traditional CRAC/CRAH units — limited to ~30kW/rack for GPU servers
  • Rear-door heat exchangers: capture hot exhaust at rack rear, up to 40kW/rack
  • Liquid cooling (direct-to-chip): coolant directly to GPU cold plates — enables 100kW+ racks (DGX GB200 NVL72)
  • Immersion cooling: servers submerged in dielectric fluid — most efficient
  • H100 DGX: up to 10.2kW per server; GB200 NVL72: ~120kW per rack

Power & Rack Design

  • Power density: AI GPU servers require 5-30kW per server vs. 1-3kW for CPU servers
  • PDU (Power Distribution Unit): per-rack power metering, remote switching
  • UPS: uninterruptible power supply — protects from outages during training runs
  • Phase balancing: distribute 3-phase power evenly across PDUs
  • Stranded power: available electrical capacity unused due to thermal limits
Six memory hooks to help you remember the critical AI Operations distinctions for the NCA-AIIO exam.
🌡️

"80 is the Warning Number"

GPU temperature at or above 80°C = thermal throttling kicks in. Clock speeds drop to protect the hardware. In the exam, 80°C is the threshold between "acceptable heavy load" and "investigate immediately."

🔪

MIG = Hardware Knife, vGPU = Time-Share

MIG physically slices the GPU into isolated pieces with dedicated VRAM and SMs — hardware-level isolation. vGPU time-shares one GPU across multiple VMs using software — no memory isolation. MIG = knife cutting. vGPU = rotation schedule.

🎲

"Gang Scheduling = All or Nothing"

Distributed training needs all GPUs to start simultaneously. Gang scheduling holds resources until the full allocation is available, then launches all at once. Partial starts cause deadlock — workers wait forever for each other.

📡

DCGM-Exporter = GPU Metrics → Prometheus

Remember the pipeline: DCGM-Exporter (DaemonSet) → Prometheus → Grafana. DCGM-Exporter scrapes GPU metrics on every node and exposes them on /metrics endpoint. This is how Kubernetes clusters get per-GPU, per-pod observability at scale.

PUE: Lower = Better, 1.0 = Perfect

PUE = Total Facility Power ÷ IT Power. A PUE of 1.0 is physically impossible (some overhead always exists) but represents 100% efficiency. Hyperscalers achieve ~1.2. Traditional data centers average ~1.5. For AI: always target <1.3.

🛤️

Slurm = HPC Tracks, K8s = Cloud Containers

Slurm is the native HPC job scheduler — sbatch for batch, srun for interactive, GRES for GPU requests. It's the dominant choice for supercomputers and on-prem HPC. Kubernetes is for containerized, cloud-native AI workloads. Run:ai extends Kubernetes with GPU-specific fairness and quota controls.

10 exam-style questions on AI Operations. Select your answer and click Check — you'll see instant feedback on each question.
Question 1 of 10
Which NVIDIA tool provides enterprise-level GPU health monitoring, diagnostics, and Prometheus integration for data center GPU fleets?
Question 2 of 10
What does MIG (Multi-Instance GPU) allow on supported NVIDIA GPUs like the A100?
Question 3 of 10
Which GPU metric indicates how frequently the GPU compute engines (Streaming Multiprocessors) are active over a sample period?
Question 4 of 10
A data center achieves a Power Usage Effectiveness (PUE) of 1.0. What does this mean?
Question 5 of 10
Which job scheduler is most commonly used in traditional HPC clusters and supports GPU allocation via GRES (Generic Resource Scheduling)?
Question 6 of 10
What is "gang scheduling" in the context of distributed AI training cluster management?
Question 7 of 10
How does NVIDIA vGPU differ from MIG in terms of resource isolation?
Question 8 of 10
Which nvidia-smi command would you use to continuously monitor GPU temperature, utilization, and power draw with automatic refresh every 1 second?
Question 9 of 10
The DCGM-Exporter is deployed as a Kubernetes DaemonSet. What is its primary function?
Question 10 of 10
A GPU in your cluster shows a temperature of 83°C under sustained training load. What is the most likely system behavior?

Quiz Complete!

0/10

8 flashcards covering core AI Operations concepts. Click any card to flip it and see the answer.

Click a card to flip it

nvidia-smi

What command continuously shows GPU stats refreshed every 2 seconds, and what flag enables persistent mode?

Continuous monitoring: nvidia-smi -l 2 (loop every 2 seconds)

Persistent mode: nvidia-smi -pm 1 — keeps driver loaded to eliminate initialization latency for subsequent GPU queries or workload launches.

Bonus: nvidia-smi --query-gpu=name,temp.gpu,utilization.gpu,memory.used --format=csv exports custom metrics in CSV.
MIG

How many MIG instances can an A100 80GB support, and what is the smallest profile?

Max instances: Up to 7 MIG instances on a single A100 80GB GPU.

Smallest profile: 1g.5gb — 1 SM partition, 5GB VRAM, 1 compute engine.

Enable MIG: nvidia-smi -i 0 -mig 1
Key advantage: Hardware-level isolation — each instance has dedicated VRAM and compute, preventing noisy-neighbor interference.
PUE

What is the PUE formula, and what values represent excellent vs. poor efficiency?

Formula: PUE = Total Facility Power ÷ IT Equipment Power

Scale:
1.0 = theoretical perfect (impossible in practice)
1.2 = excellent (hyperscale data centers)
1.5 = industry average
2.0 = poor (50% of power is overhead)

AI target: <1.3 for GPU-dense facilities.
DCGM

What is DCGM-Exporter and how does it enable GPU observability in Kubernetes?

DCGM-Exporter is a containerized daemon (DaemonSet) that runs on every GPU node and exposes GPU metrics at a /metrics HTTP endpoint in Prometheus format.

Pipeline: DCGM-Exporter → Prometheus scrapes /metrics → Grafana visualizes dashboards

Metrics exposed: GPU utilization, memory usage, temperature, power, ECC errors, all labeled by GPU index and Kubernetes pod name.
Slurm

What Slurm commands submit, view, and cancel jobs? How do you request 4 A100 GPUs?

Submit: sbatch train.sh
Interactive run: srun python train.py
View queue: squeue
Cancel: scancel <jobid>
Node status: sinfo

Request 4 A100s:
#SBATCH --gres=gpu:a100:4
(GRES = Generic Resource Scheduling)
vGPU vs MIG

What is the key architectural difference between NVIDIA vGPU and MIG isolation?

MIG = Hardware isolation: The A100/H100 physically partitions SMs and VRAM into separate instances. Each MIG instance has dedicated compute and memory — strict isolation, no sharing.

vGPU = Software time-slicing: Multiple VMs take turns accessing the full GPU in time slots. Memory is pooled (not isolated). Requires a hypervisor and NVIDIA vGPU license.

Rule: Need isolation → MIG. Need multi-VM on any GPU → vGPU.
Gang Scheduling

Why is gang scheduling essential for distributed AI training, and what happens without it?

Gang scheduling allocates all required GPUs simultaneously, so all distributed training workers start at the same time.

Without gang scheduling: Some workers start early and wait for others that haven't been allocated yet. All workers block on collective communication operations (AllReduce) → deadlock or massive idle time.

Solution: Scheduler holds the full GPU set until available, then releases all at once.
GPU Thermal Throttling

At what temperature does GPU thermal throttling begin, and what does the GPU do to compensate?

Throttling threshold: Approximately 80-83°C (varies by GPU model).

What happens: The GPU automatically reduces its SM clock frequency and memory clock to decrease power consumption and generate less heat.

Effect on workload: Training throughput drops — fewer FLOPS/second until temperature stabilizes.

Root causes: Inadequate cooling, high ambient temperature, blocked airflow, dust accumulation, or sustained 100% GPU utilization in a dense rack.
Select your focus area to get targeted exam strategies for AI Operations topics.

🖥️ GPU Monitoring Strategy

  • Know the difference: nvidia-smi = per-node CLI; DCGM = fleet-scale enterprise monitoring. Exam will test which tool fits which scale scenario.
  • Memorize thermal thresholds: 75°C = normal heavy load, 80°C+ = throttling begins, 85°C+ = critical. Any question about "alarm" or "investigate" = 80°C+.
  • DCGM-Exporter → Prometheus → Grafana is the canonical Kubernetes GPU observability stack. Know each component's role.
  • GPU Utilization % (SM activity) ≠ GPU Memory Utilization %. They measure different things. Low GPU util with high memory util = memory-bound workload.
  • XID errors in dmesg are NVIDIA driver-level errors. XID 79 = GPU memory fault. These surface in DCGM health checks, not nvidia-smi output.

Official & Recommended Resources

Ready to Pass the NCA-AIIO Exam?

Practice with exam-style questions across all 5 domains — AI & ML Foundations, NVIDIA Solutions, Infrastructure, Data Center Networking, and AI Operations.

Study with Practice Tests →