Kubernetes GPU scheduling, MIG & MPS, DCGM metrics, UFM fabric management, NCCL AllReduce tuning, and fault detection — the complete AI cluster operations stack.
Schedule GPU workloads efficiently, partition GPUs for multi-tenant use, and manage cluster-wide resources.
Collect, store, and visualize GPU and network metrics so problems are caught before jobs fail.
Identify GPU hardware errors, network link failures, and NCCL hangs before they corrupt training runs.
The network is only valuable if the cluster actually runs jobs correctly. NCP-AIN tests that you understand how workloads are scheduled onto the fabric, how GPU and network health are monitored together, and how communication libraries like NCCL interact with the IB and Ethernet infrastructure covered in Topics 1–4.
Kubernetes doesn't natively understand GPUs. The NVIDIA GPU Device Plugin runs as a DaemonSet on every GPU node and advertises GPU resources to the kubelet so pods can request them.
nvidia.com/gpu: 1| Key Kubernetes GPU Concepts | |
|---|---|
| Resource name | nvidia.com/gpu — integer, whole GPU units |
| DaemonSet | GPU device plugin runs on every GPU node |
| Node labels | accelerator=nvidia-tesla-h100 for node selection |
| Topology hints | NUMA-aware scheduling via Topology Manager |
| MIG resources | nvidia.com/mig-1g.10gb per MIG slice profile |
| Time-slicing | ConfigMap overrides — logical GPUs > physical GPUs |
| Health checks | DCGM health check integration via device plugin |
| NFD | Node Feature Discovery — labels nodes with GPU model, driver, CUDA ver |
| Feature | MIG — Multi-Instance GPU | MPS — Multi-Process Service | Time-Slicing |
|---|---|---|---|
| Isolation | Hardware — dedicated SM + VRAM | Software context sharing | Temporal — one at a time |
| Memory isolation | Fully isolated VRAM partitions | Shared (fault = all die) | Shared (no isolation) |
| Fault isolation | One slice crash ≠ others | One crash kills all clients | No fault isolation |
| Concurrency | True simultaneous execution | True simultaneous (shared SMs) | Sequential, not concurrent |
| GPU support | A100, H100, A30 (Ampere+) | All CUDA GPUs | All CUDA GPUs |
| Best for | Multi-tenant inference, CI/CD | Many small cooperative jobs | Dev/test only — oversubscription |
| K8s resource | nvidia.com/mig-* | Enabled via CUDA_MPS_PIPE_DIR | ConfigMap device plugin |
nvidia.com/mig-1g.10gb = 1 GPU instance slice (1/7th of H100), 10 GB VRAM. An H100 80GB can be sliced into 7 × 1g.10gb, or other combinations. MIG must be enabled on the GPU before any workloads run — it requires a GPU reset to toggle.
--gres=gpu:h100:8 to request 8 H100s# Submit a multi-GPU training job sbatch --nodes=4 --ntasks-per-node=8 \ --gres=gpu:h100:8 --partition=gpu \ train.sh # Check job queue squeue -u $USER --format="%.10i %.9P %.20j %.8T %.10M" # Check GPU node availability sinfo -p gpu --format="%N %G %t %m" # Interactive GPU session srun --gres=gpu:1 --pty bash
Intercepts container creation and injects GPU device files, NVIDIA libraries, and CUDA into the container's mount namespace — without the container image needing to bundle drivers. Required for both Docker (--gpus all) and Kubernetes (set runtimeClassName: nvidia). The toolkit maps NVIDIA_VISIBLE_DEVICES env var to specific GPU device files.
How does a Kubernetes pod request a GPU?
Resource request: nvidia.com/gpu: 1
The GPU Device Plugin DaemonSet advertises these resources to kubelet. For MIG slices: nvidia.com/mig-1g.10gb
What is the key advantage of MIG over MPS?
Hardware fault isolation
MIG gives each tenant dedicated SMs and VRAM — a fault in one slice cannot affect another. MPS shares the hardware context, so one crash kills all co-tenants.
What does NVIDIA Container Toolkit do?
Injects GPU into containers at runtime
Intercepts container creation, mounts GPU device files + NVIDIA libs into the namespace. Container images don't need GPU drivers — only CUDA runtime.
What is GRES in Slurm?
Generic Resource Scheduling
Slurm's mechanism for tracking non-CPU/RAM resources like GPUs. --gres=gpu:h100:8 requests 8 H100 GPUs. Slurm tracks GRES usage per node and enforces limits.
DCGM is the centralized daemon that collects GPU metrics, runs health diagnostics, and configures policies on all GPUs in a node. It communicates with GPUs via NVML (NVIDIA Management Library).
# List all GPUs managed by DCGM dcgmi discovery -l # Run health check on all GPUs (short = ~30s) dcgmi health -g 0 -c # group 0 = all GPUs dcgmi diag -g 0 -r 1 # r1=quick, r2=medium, r3=long (burn-in) # Watch live GPU metrics dcgmi dmon -e 203,204,1002,1003,1004 # SM util, mem util, temp, power, fan # Query specific fields dcgmi group -l # list groups dcgmi stats -g 0 -e 1001 # ECC double-bit errors # DCGM Exporter (runs as sidecar container) docker run -d --gpus all --rm \ -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:latest # Prometheus scrapes: http://<node>:9400/metrics
| Feature | UFM | OpenSM |
|---|---|---|
| GUI | ✅ Web UI | ❌ CLI only |
| Telemetry | Rich per-port metrics | Basic |
| REST API | ✅ Full API | ❌ |
| Scale | 10,000+ nodes | Smaller fabrics |
| SHARP mgmt | Integrated | Separate |
| License | Commercial | Open source |
GPU metrics, health diagnostics, policy enforcement. Source of truth for GPU health.
GPU LayerSidecar container — exposes DCGM metrics as Prometheus /metrics endpoint on port 9400.
ExportInfiniBand fabric telemetry — port BW, error counters, topology, events.
Network LayerScrapes DCGM Exporter + node-exporter. Stores time-series metrics. Fires alerts.
StorageDashboards querying Prometheus. NVIDIA provides pre-built DCGM dashboard templates.
VisualizationDistributed tracing for AI serving pipelines — trace inference request latency end-to-end.
TracingNCCL (NVIDIA Collective Communications Library) provides GPU-optimized implementations of collective operations — AllReduce, AllGather, ReduceScatter, Broadcast — used by every major deep learning framework (PyTorch, TensorFlow, JAX). NCCL automatically detects the network topology (NVLink, IB, RoCEv2) and selects the fastest algorithm.
GPUs form a logical ring. Each GPU sends to the next, receiving and accumulating partial results, until all GPUs hold the full sum. Scales linearly with N GPUs.
Binary tree reduction — GPUs reduce upward to root, then broadcast downward. Logarithmic latency scaling. Used for small messages or very large GPU counts.
Delegates AllReduce to SHARP-capable switches (Quantum-2 / Spectrum-4). Aggregation happens in the network — GPUs send/receive only final results.
| Protocol | Description |
|---|---|
| LL (Low Latency) | Inline flag in data — lowest latency, limited BW. Small ops. |
| LL128 | 128-byte granularity LL. Better BW than LL, still low latency. |
| Simple | Bulk transfer — maximum bandwidth. Large ops (>1 MB). Default for IB. |
NCCL automatically detects intra-node topology (NVLink, PCIe) and inter-node fabric (IB, RoCEv2) on startup. Enable verbose output to see what it found:
| Variable | Purpose & Key Values |
|---|---|
| NCCL_DEBUG | Logging level. INFO = topology + algo. TRACE = per-op verbose. Start here when NCCL hangs. |
| NCCL_DEBUG_SUBSYS | Filter by subsystem: ALL, INIT, GRAPH, TUNING, NET, IB. Use NET,IB for network debugging. |
| NCCL_IB_HCA | Comma-list of InfiniBand HCAs to use. E.g. mlx5_0,mlx5_2. Critical on multi-NIC nodes. |
| NCCL_IB_GID_INDEX | GID index for RoCEv2. Always set to 3 for RoCEv2; 0 = RoCEv1/IB native. |
| NCCL_SOCKET_IFNAME | Ethernet interface for non-RDMA transport fallback. E.g. eth0. |
| NCCL_IB_DISABLE | 0 = use IB/RoCEv2 (default). 1 = force TCP fallback — useful for debugging. |
| NCCL_ALGO | Force algorithm: Ring, Tree, CollNet (SHARP). Default = auto-selected. |
| NCCL_PROTO | Force protocol: LL, LL128, Simple. Default = auto-selected per message size. |
| NCCL_TOPO_FILE | Path to XML topology file — override auto-detection. Useful for non-standard topologies. |
| NCCL_CROSS_NIC | 0 = same NIC for send/recv (default). 1 = allow different NICs for ring traversal. |
| SHARP_COLL_ENABLE_SAT | 1 = enable SHARP AllReduce via NCCL CollNet plugin. |
When a training job hangs and all GPUs show 0% utilization: (1) Set NCCL_DEBUG=INFO and rerun — check if all ranks initialized successfully; (2) Check IB port errors with perfquery — credit starvation on VL0 causes deadlocks; (3) Verify NCCL_IB_GID_INDEX=3 on RoCEv2 fabrics; (4) Check for firewall rules blocking UDP 4791 (RoCEv2); (5) As a diagnostic, set NCCL_IB_DISABLE=1 — if training resumes over TCP, the RDMA path is the problem.
Xid errors appear in dmesg, nvidia-smi, and DCGM. They are the GPU driver's error reporting mechanism — every Xid has a specific meaning. These are the ones tested on NCP-AIN:
| Xid | Name | Severity | Meaning & Action |
|---|---|---|---|
| Xid 8 | GPU MMU fault | Critical | Memory management unit error — usually bad CUDA code (illegal access). Check application, may need GPU reset. |
| Xid 13 | Graphics engine exception | Warning | Graphics engine hit an exception. Common with compute+graphics mixed workloads. |
| Xid 31 | GPU memory page fault | Critical | GPU accessed unmapped memory — typically application bug. Check CUDA memcheck. |
| Xid 38 | Driver firmware error | Critical | Unrecoverable firmware fault. Requires GPU reset or node reboot. Investigate hardware. |
| Xid 43 | GPU stopped processing | Critical | GPU hung — no progress for watchdog timeout. Reset required. May indicate thermal or HW issue. |
| Xid 48 | Double-bit ECC error | Critical | Uncorrectable memory error (DBE). GPU VRAM defect. Node must be drained and GPU replaced. |
| Xid 63 | Row remapping pending | Warning | GPU remapping a failing memory row (single-bit ECC). GPU still operational. Monitor for Xid 48 progression. |
| Xid 79 | GPU has fallen off the bus | Critical | PCIe link lost. Check seating, PCIe slot, motherboard. Node reboot required. |
| Xid 92 | High single-bit ECC count | Warning | High volume of correctable ECC errors — early warning of memory degradation. Schedule GPU replacement. |
A double-bit ECC (DBE) error is uncorrectable — the GPU cannot recover the corrupted data. Any node showing Xid 48 must be drained immediately: kubectl drain <node> or scontrol update NodeName=<node> State=DRAIN. The GPU should be physically replaced, not just reset.
Checks: GPU init, PCIe bandwidth, memory bandwidth, power, clock. Run before every job. Catches dead GPUs, PCIe issues, power limit problems.
All Level 1 checks plus: memory stress test, P2P bandwidth, NVLink checks. Run daily or after hardware changes. Good for post-maintenance validation.
Full burn-in: all Level 2 plus extended memory, compute, PCIe, NVLink stress. Run on new hardware commissioning or after suspected hardware fault.
kubectl drain <node> --ignore-daemonsetsnvidia-smi -r — attempt GPU resetdcgmi diag -r 3NCCL_DEBUG=INFO — check which rank failed to initializeibstat on failed nodes — check port state (must be Active)perfquery — look for rising SymbolErrors or PortRcvErrorsibdiagnet -r — full fabric health sweepnvidia-smi -q -d PERFORMANCE — check P-state throttlingnvidia-smi -q -d TEMPERATUREnvidia-smi -pl 600| Counter | Meaning | Action |
|---|---|---|
| SymbolErrorCounter | Physical layer bit errors — cable/transceiver signal quality | Rising fast → replace cable or transceiver |
| PortRcvErrors | Incoming packets with errors (CRC, frame) | Check cable quality, retrain link |
| PortXmitDiscards | TX packets dropped (HOL blocking, buffer overflow) | Check congestion, PFC config, buffer sizes |
| LocalLinkIntegrityErrors | Link recovery events (link went down & recovered) | Investigate cable/connector; rising = imminent failure |
| ExcessiveBufferOverrunErrors | Receiver buffer overflow — PFC not effective | Review PFC headroom config; check MTU consistency |
| VL15Dropped | SM management messages dropped | SM overloaded or VL15 buffer too small; check sm_priority |
DCGM monitors GPU health (temperature, ECC errors, power), runs diagnostics (Level 1/2/3), and exports metrics to Prometheus via DCGM Exporter. It's the single source of truth for GPU health — if DCGM says a GPU is sick, it's sick.
Xid 48 = Double-Bit ECC error (uncorrectable). Any node showing Xid 48 must be drained and the GPU physically replaced. No amount of reset or rebooting fixes VRAM cell failure.
MIG builds hardware walls between GPU partitions — dedicated SMs, VRAM, caches. MPS shares the GPU context in software — faster to set up, but one crash kills all tenants. Use MIG for multi-tenant production.
Ring AllReduce is bandwidth-optimal for large messages (>1 MB). Tree AllReduce is latency-optimal for small messages (<256 KB). CollNet (SHARP) reduces traffic for both by doing aggregation in the network.
Whenever a distributed training job hangs or shows poor performance, the first step is always NCCL_DEBUG=INFO. It shows topology discovery, algorithm selection, and which ranks failed to initialize — usually points to the root cause immediately.
UFM manages InfiniBand fabric just like an air traffic control tower: discovers all nodes, assigns routes, monitors congestion, fires alerts on link failures. It replaces and extends OpenSM with a REST API and web UI.
DCGM_FI_DEV_GPU_UTIL (SM Utilization) is the primary indicator that a GPU is doing compute work. Target ≥ 80% during training. If SM util is low while a job is running, the GPU is starved — usually a data loading or communication bottleneck.
Kubernetes doesn't know about GPUs natively. The NVIDIA GPU Device Plugin DaemonSet teaches kubelet to advertise nvidia.com/gpu resources. Without it, pods requesting GPUs will pend forever regardless of available hardware.
Select your situation for targeted guidance.
nvcr.io/nvidia/k8s/dcgm-exporter./metrics for Prometheus to scrape.9400/metrics on all GPU nodes. Set scrape interval to 15s for real-time visibility.dcgmi diag -r 1 on all nodes before major training runs — prevents wasted GPU-hours on broken hardware.NCCL_DEBUG=INFO and rerun — output shows topology, algorithm selection, and the exact rank that failed to initialize.ibstat on all nodes — every port must be in State: Active, Physical state: LinkUp. Any port down = that rank will hang.perfquery -x <lid> 1 on suspected ports — PortRcvErrors or SymbolErrorCounter rising confirms a bad link. Replace cable.NCCL_IB_GID_INDEX=3 — GID index 0 routes as RoCEv1 and will fail on RoCEv2-only fabrics.NCCL_IB_HCA is set to all available NIC ports (e.g. mlx5_0,mlx5_2,mlx5_4,mlx5_6 for 4 NICs per node).NCCL_ALGO=Ring to force or increase per-layer gradient accumulation batch size.VL15Dropped counter with perfquery — rising VL15Dropped = SM management traffic being dropped, causing routing stalls.sudo nvidia-smi -mig 1 — requires GPU reset (terminates all running workloads on that GPU).sudo nvidia-smi mig -cgi 1g.10gb -C for 7 equal slices on H100 80GB. Check available profiles with nvidia-smi mig -lgip.nvidia-mig-manager DaemonSet) — it reads a ConfigMap specifying the desired MIG strategy (single or mixed) and applies it automatically.resources.limits: nvidia.com/mig-1g.10gb: 1 — the GPU device plugin advertises each slice as a separate schedulable resource.CUDA_MPS_PIPE_DIR and start the MPS server with nvidia-cuda-mps-control -d. All CUDA processes launched after this share the MPS context — good for many small inference servers on one GPU.nvidia-smi mig -dci, nvidia-smi mig -dgi), then sudo nvidia-smi -mig 0.dmesg | grep -i xid. Xid 48 (DBE ECC) = drain node immediately. Xid 79 (GPU off bus) = check PCIe seating, reboot.kubectl drain <node> --ignore-daemonsets --delete-emptydir-data — prevents new pods from scheduling on a faulty node.nvidia-smi -q -d ECC — look at "Volatile" and "Aggregate" DBE counts. Any DBE under "Volatile" means the error occurred this session.sudo nvidia-smi -r -i <gpu-id>. If reset fails, power cycle the node.dcgmi diag -g <gpu-group> -r 2 — must pass before re-adding node to the cluster.nvidia-smi --query-gpu=serial --format=csv to get the GPU serial number for the RMA ticket.kubectl uncordon <node>.