AI Cluster Orchestration & Observability

The AI Cluster Operations Stack

Five layers sit between bare hardware and a running training job — every layer has its own tools, metrics, and failure modes.

🏗️ AI Cluster Stack — Layer by Layer

Layer 5 — Workload Scheduling

Kubernetes
GPU device plugin

Slurm
HPC job scheduler

Run:ai
GPU quota mgmt

NVIDIA BCM
Base Command Mgr

Layer 4 — Container Runtime & GPU Partitioning

NVIDIA Container Toolkit

MIG
HW partitioning

MPS
Context sharing

Time-slicing
SW sharing

Layer 3 — Observability & Management

DCGM
GPU metrics

UFM
IB fabric mgr

Prometheus
Metrics DB

Grafana
Dashboards

DCGM Exporter

Layer 2 — Communication Libraries

NCCL
AllReduce/Gather

MPI
HPC comms

SHARP
In-network agg

RDMA
IB / RoCEv2

Layer 1 — Hardware Fabric

InfiniBand NDR/HDR

Spectrum-X Ethernet

NVLink / NVSwitch

PCIe Gen5

🎛️

Orchestration

Schedule GPU workloads efficiently, partition GPUs for multi-tenant use, and manage cluster-wide resources.

Kubernetes + GPU device plugin
Slurm for HPC-style batch jobs
MIG partitioning per tenant
MPS for multi-process GPU sharing
NVIDIA Container Toolkit runtime

📊

Observability

Collect, store, and visualize GPU and network metrics so problems are caught before jobs fail.

DCGM — GPU health & performance
DCGM Exporter → Prometheus
UFM — InfiniBand telemetry
Grafana dashboards
Per-flow network counters

🚨

Fault Detection

Identify GPU hardware errors, network link failures, and NCCL hangs before they corrupt training runs.

Xid errors from NVIDIA driver
DCGM health check levels
IB port error counters
NCCL timeout detection
Job checkpoint & recovery

💡

Why This Topic Matters for NCP-AIN

The network is only valuable if the cluster actually runs jobs correctly. NCP-AIN tests that you understand how workloads are scheduled onto the fabric, how GPU and network health are monitored together, and how communication libraries like NCCL interact with the IB and Ethernet infrastructure covered in Topics 1–4.

Workload Orchestration

How GPU jobs reach the hardware — from Kubernetes scheduling and the device plugin to MIG partitioning and MPS context sharing.

☸️ Kubernetes GPU Scheduling

NVIDIA GPU Device Plugin

Kubernetes doesn't natively understand GPUs. The NVIDIA GPU Device Plugin runs as a DaemonSet on every GPU node and advertises GPU resources to the kubelet so pods can request them.

Scheduling Flow

Pod requests
nvidia.com/gpu: 1

→

kube-scheduler
selects node

→

Device plugin
allocates GPU

→

Container gets
GPU env vars

Key Kubernetes GPU Concepts
Resource name	`nvidia.com/gpu` — integer, whole GPU units
DaemonSet	GPU device plugin runs on every GPU node
Node labels	`accelerator=nvidia-tesla-h100` for node selection
Topology hints	NUMA-aware scheduling via Topology Manager
MIG resources	`nvidia.com/mig-1g.10gb` per MIG slice profile
Time-slicing	ConfigMap overrides — logical GPUs > physical GPUs
Health checks	DCGM health check integration via device plugin
NFD	Node Feature Discovery — labels nodes with GPU model, driver, CUDA ver

GPU Sharing: MIG vs MPS vs Time-Slicing

Feature	MIG — Multi-Instance GPU	MPS — Multi-Process Service	Time-Slicing
Isolation	Hardware — dedicated SM + VRAM	Software context sharing	Temporal — one at a time
Memory isolation	Fully isolated VRAM partitions	Shared (fault = all die)	Shared (no isolation)
Fault isolation	One slice crash ≠ others	One crash kills all clients	No fault isolation
Concurrency	True simultaneous execution	True simultaneous (shared SMs)	Sequential, not concurrent
GPU support	A100, H100, A30 (Ampere+)	All CUDA GPUs	All CUDA GPUs
Best for	Multi-tenant inference, CI/CD	Many small cooperative jobs	Dev/test only — oversubscription
K8s resource	`nvidia.com/mig-*`	Enabled via CUDA_MPS_PIPE_DIR	ConfigMap device plugin

⚠️

MIG Profile Naming Convention

nvidia.com/mig-1g.10gb = 1 GPU instance slice (1/7th of H100), 10 GB VRAM. An H100 80GB can be sliced into 7 × 1g.10gb, or other combinations. MIG must be enabled on the GPU before any workloads run — it requires a GPU reset to toggle.

📋 Slurm Workload Manager

Key Slurm Concepts

Partition: Named group of nodes with shared policy (GPU partition, CPU partition, priority queues)
GRES (Generic Resources): GPU resource tracking — --gres=gpu:h100:8 to request 8 H100s
srun: Launch job steps interactively or in parallel within an allocation
sbatch: Submit batch script; queued until resources available
squeue: View pending & running jobs
sinfo: View node/partition state
MPS integration: Slurm can enable/disable CUDA MPS per-job automatically

Common Slurm GPU Commands

# Submit a multi-GPU training job
sbatch --nodes=4 --ntasks-per-node=8 \
  --gres=gpu:h100:8 --partition=gpu \
  train.sh

# Check job queue
squeue -u $USER --format="%.10i %.9P %.20j %.8T %.10M"

# Check GPU node availability
sinfo -p gpu --format="%N %G %t %m"

# Interactive GPU session
srun --gres=gpu:1 --pty bash

🐳

NVIDIA Container Toolkit (nvidia-container-runtime)

Intercepts container creation and injects GPU device files, NVIDIA libraries, and CUDA into the container's mount namespace — without the container image needing to bundle drivers. Required for both Docker (--gpus all) and Kubernetes (set runtimeClassName: nvidia). The toolkit maps NVIDIA_VISIBLE_DEVICES env var to specific GPU device files.

Orchestration Flash Cards — Click to Flip

☸️

How does a Kubernetes pod request a GPU?

Resource request: nvidia.com/gpu: 1

The GPU Device Plugin DaemonSet advertises these resources to kubelet. For MIG slices: nvidia.com/mig-1g.10gb

🔪

What is the key advantage of MIG over MPS?

Hardware fault isolation

MIG gives each tenant dedicated SMs and VRAM — a fault in one slice cannot affect another. MPS shares the hardware context, so one crash kills all co-tenants.

📦

What does NVIDIA Container Toolkit do?

Injects GPU into containers at runtime

Intercepts container creation, mounts GPU device files + NVIDIA libs into the namespace. Container images don't need GPU drivers — only CUDA runtime.

📋

What is GRES in Slurm?

Generic Resource Scheduling

Slurm's mechanism for tracking non-CPU/RAM resources like GPUs. --gres=gpu:h100:8 requests 8 H100 GPUs. Slurm tracks GRES usage per node and enforces limits.

GPU & Network Observability

DCGM for GPU health, DCGM Exporter for Prometheus, UFM for fabric telemetry — the full monitoring pipeline.

📊 NVIDIA DCGM — Data Center GPU Manager

What DCGM Does

DCGM is the centralized daemon that collects GPU metrics, runs health diagnostics, and configures policies on all GPUs in a node. It communicates with GPUs via NVML (NVIDIA Management Library).

dcgmi: CLI interface to DCGM daemon
DCGM Exporter: Sidecar that exposes DCGM metrics to Prometheus
Field groups: Named sets of metrics polled together
Health checks: 3 levels — short/medium/long diagnostic
Policy manager: Trigger actions on threshold violations (power, temperature)

Observability Pipeline

GPU hardware (NVML interface)

↓

DCGM daemon — collects metrics + health

↓

DCGM Exporter — exposes /metrics endpoint

↓

Prometheus — scrapes & stores time-series

↓

Grafana — dashboards, alerts, on-call

Key DCGM Metrics — Know These for the Exam

⚡

SM Utilization

DCGM_FI_DEV_GPU_UTIL

% of SMs active. Primary measure of GPU compute activity.

Target: ≥ 80%

💾

Memory Utilization

DCGM_FI_DEV_MEM_COPY_UTIL

% of memory bandwidth used. Bottleneck for transformer models.

Watch: > 90%

🌡️

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

Die temperature in °C. Throttling begins at thermal limit.

Throttle: ≥ 83°C (H100)

🔋

Power Draw

DCGM_FI_DEV_POWER_USAGE

Instantaneous power in watts. H100 SXM5 TDP = 700W.

Cap: TDP (700W for H100)

🔗

NVLink Bandwidth

DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL

Total NVLink TX+RX bandwidth. Key for NVSwitch-connected GPUs.

H100: up to 900 GB/s

❌

ECC Errors

DCGM_FI_DEV_ECC_DBE_VOL_TOTAL

Double-bit ECC errors (uncorrectable). Non-zero = GPU needs replacement.

Alert: any DBE > 0

🕐

PCIe Replay

DCGM_FI_DEV_PCIE_REPLAY_COUNTER

PCIe transaction retries. Rising count = link signal integrity problem.

Alert: any rapid increase

⏱️

Tensor Core Util

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Fraction of cycles the Tensor Cores are active. Key AI training health metric.

Target: ≥ 60% for LLM

Essential DCGM Commands

# List all GPUs managed by DCGM
dcgmi discovery -l

# Run health check on all GPUs (short = ~30s)
dcgmi health -g 0 -c          # group 0 = all GPUs
dcgmi diag -g 0 -r 1          # r1=quick, r2=medium, r3=long (burn-in)

# Watch live GPU metrics
dcgmi dmon -e 203,204,1002,1003,1004  # SM util, mem util, temp, power, fan

# Query specific fields
dcgmi group -l                # list groups
dcgmi stats -g 0 -e 1001      # ECC double-bit errors

# DCGM Exporter (runs as sidecar container)
docker run -d --gpus all --rm \
  -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:latest
# Prometheus scrapes: http://<node>:9400/metrics

🔗 UFM — Unified Fabric Manager (InfiniBand)

UFM Capabilities

Topology discovery: Auto-discovers all IB switches, HCAs, and cables — renders full fabric graph
Subnet management: Replaces OpenSM — manages LID assignment, routing, partitions
Congestion mgmt: Monitors FECN/BECN counters, adjusts QoS policies
Port telemetry: Real-time bandwidth, error, and latency per port
Event alerts: Link down, symbol errors, port errors → webhook/email
REST API: Programmable — query topology, push config, stream events

UFM vs OpenSM

Feature	UFM	OpenSM
GUI	✅ Web UI	❌ CLI only
Telemetry	Rich per-port metrics	Basic
REST API	✅ Full API	❌
Scale	10,000+ nodes	Smaller fabrics
SHARP mgmt	Integrated	Separate
License	Commercial	Open source

Full Observability Stack

🖥️

NVIDIA DCGM

GPU metrics, health diagnostics, policy enforcement. Source of truth for GPU health.

GPU Layer

📤

DCGM Exporter

Sidecar container — exposes DCGM metrics as Prometheus /metrics endpoint on port 9400.

Export

🔗

NVIDIA UFM

InfiniBand fabric telemetry — port BW, error counters, topology, events.

Network Layer

📊

Prometheus

Scrapes DCGM Exporter + node-exporter. Stores time-series metrics. Fires alerts.

Storage

📈

Grafana

Dashboards querying Prometheus. NVIDIA provides pre-built DCGM dashboard templates.

Visualization

🔭

OpenTelemetry

Distributed tracing for AI serving pipelines — trace inference request latency end-to-end.

Tracing

NCCL & Fabric Communication

How NCCL discovers topology, selects AllReduce algorithms, and maps onto InfiniBand and Ethernet fabrics.

📡

What is NCCL?

NCCL (NVIDIA Collective Communications Library) provides GPU-optimized implementations of collective operations — AllReduce, AllGather, ReduceScatter, Broadcast — used by every major deep learning framework (PyTorch, TensorFlow, JAX). NCCL automatically detects the network topology (NVLink, IB, RoCEv2) and selects the fastest algorithm.

NCCL AllReduce Algorithms

Ring AllReduce

Bandwidth Optimal

GPUs form a logical ring. Each GPU sends to the next, receiving and accumulating partial results, until all GPUs hold the full sum. Scales linearly with N GPUs.

Traffic: 2 × (N-1)/N × data size
Best for: large messages (>1 MB)
Default for multi-node IB/RoCEv2
Latency scales with ring size

Tree AllReduce

Latency Optimal

Binary tree reduction — GPUs reduce upward to root, then broadcast downward. Logarithmic latency scaling. Used for small messages or very large GPU counts.

Latency: O(log N) steps
Best for: small messages (<256 KB)
NCCL uses "double binary tree"
Lower BW efficiency than ring

CollNet (SHARP)

In-Network Compute

Delegates AllReduce to SHARP-capable switches (Quantum-2 / Spectrum-4). Aggregation happens in the network — GPUs send/receive only final results.

Enabled: NCCL_ALGO=CollNet
Requires SHARP throughout path
Up to N× traffic reduction
Best for large-scale (1000+ GPUs)

NCCL Protocols & Key Environment Variables

Protocol Selection

Protocol	Description
LL (Low Latency)	Inline flag in data — lowest latency, limited BW. Small ops.
LL128	128-byte granularity LL. Better BW than LL, still low latency.
Simple	Bulk transfer — maximum bandwidth. Large ops (>1 MB). Default for IB.

Topology Detection

NCCL automatically detects intra-node topology (NVLink, PCIe) and inter-node fabric (IB, RoCEv2) on startup. Enable verbose output to see what it found:

          # Show topology + algorithm choice

          NCCL_DEBUG=INFO torchrun ...

          # Force specific IB HCA

          NCCL_IB_HCA=mlx5_0,mlx5_1

          # Force RoCEv2 GID index

          NCCL_IB_GID_INDEX=3

          # Enable SHARP/CollNet

          NCCL_ALGO=CollNet

NCCL Environment Variable Reference

Variable	Purpose & Key Values
NCCL_DEBUG	Logging level. INFO = topology + algo. TRACE = per-op verbose. Start here when NCCL hangs.
NCCL_DEBUG_SUBSYS	Filter by subsystem: ALL, INIT, GRAPH, TUNING, NET, IB. Use NET,IB for network debugging.
NCCL_IB_HCA	Comma-list of InfiniBand HCAs to use. E.g. mlx5_0,mlx5_2. Critical on multi-NIC nodes.
NCCL_IB_GID_INDEX	GID index for RoCEv2. Always set to 3 for RoCEv2; 0 = RoCEv1/IB native.
NCCL_SOCKET_IFNAME	Ethernet interface for non-RDMA transport fallback. E.g. eth0.
NCCL_IB_DISABLE	0 = use IB/RoCEv2 (default). 1 = force TCP fallback — useful for debugging.
NCCL_ALGO	Force algorithm: Ring, Tree, CollNet (SHARP). Default = auto-selected.
NCCL_PROTO	Force protocol: LL, LL128, Simple. Default = auto-selected per message size.
NCCL_TOPO_FILE	Path to XML topology file — override auto-detection. Useful for non-standard topologies.
NCCL_CROSS_NIC	0 = same NIC for send/recv (default). 1 = allow different NICs for ring traversal.
SHARP_COLL_ENABLE_SAT	1 = enable SHARP AllReduce via NCCL CollNet plugin.

🔧

NCCL Hang Diagnosis Flow

When a training job hangs and all GPUs show 0% utilization: (1) Set NCCL_DEBUG=INFO and rerun — check if all ranks initialized successfully; (2) Check IB port errors with perfquery — credit starvation on VL0 causes deadlocks; (3) Verify NCCL_IB_GID_INDEX=3 on RoCEv2 fabrics; (4) Check for firewall rules blocking UDP 4791 (RoCEv2); (5) As a diagnostic, set NCCL_IB_DISABLE=1 — if training resumes over TCP, the RDMA path is the problem.

Fault Detection & Recovery

GPU hardware errors, network link failures, and job recovery — the runbook every AI cluster operator needs.

🚨 Xid Errors — GPU Driver Error Codes

Xid errors appear in dmesg, nvidia-smi, and DCGM. They are the GPU driver's error reporting mechanism — every Xid has a specific meaning. These are the ones tested on NCP-AIN:

Xid	Name	Severity	Meaning & Action
Xid 8	GPU MMU fault	Critical	Memory management unit error — usually bad CUDA code (illegal access). Check application, may need GPU reset.
Xid 13	Graphics engine exception	Warning	Graphics engine hit an exception. Common with compute+graphics mixed workloads.
Xid 31	GPU memory page fault	Critical	GPU accessed unmapped memory — typically application bug. Check CUDA memcheck.
Xid 38	Driver firmware error	Critical	Unrecoverable firmware fault. Requires GPU reset or node reboot. Investigate hardware.
Xid 43	GPU stopped processing	Critical	GPU hung — no progress for watchdog timeout. Reset required. May indicate thermal or HW issue.
Xid 48	Double-bit ECC error	Critical	Uncorrectable memory error (DBE). GPU VRAM defect. Node must be drained and GPU replaced.
Xid 63	Row remapping pending	Warning	GPU remapping a failing memory row (single-bit ECC). GPU still operational. Monitor for Xid 48 progression.
Xid 79	GPU has fallen off the bus	Critical	PCIe link lost. Check seating, PCIe slot, motherboard. Node reboot required.
Xid 92	High single-bit ECC count	Warning	High volume of correctable ECC errors — early warning of memory degradation. Schedule GPU replacement.

🔴

Xid 48 (Double-Bit ECC) = Replace the GPU

A double-bit ECC (DBE) error is uncorrectable — the GPU cannot recover the corrupted data. Any node showing Xid 48 must be drained immediately: kubectl drain <node> or scontrol update NodeName=<node> State=DRAIN. The GPU should be physically replaced, not just reset.

🩺 DCGM Diagnostic Levels

Level 1 — Quick (~30s)

Checks: GPU init, PCIe bandwidth, memory bandwidth, power, clock. Run before every job. Catches dead GPUs, PCIe issues, power limit problems.

dcgmi diag -g 0 -r 1

Level 2 — Medium (~2 min)

All Level 1 checks plus: memory stress test, P2P bandwidth, NVLink checks. Run daily or after hardware changes. Good for post-maintenance validation.

dcgmi diag -g 0 -r 2

Level 3 — Long (~10 min)

Full burn-in: all Level 2 plus extended memory, compute, PCIe, NVLink stress. Run on new hardware commissioning or after suspected hardware fault.

dcgmi diag -g 0 -r 3

🔄 Fault Response Runbook

🔴 GPU Hardware Fault

1. DCGM alert or Xid 48/79 in dmesg
2. Drain node: kubectl drain <node> --ignore-daemonsets
3. Kill any remaining jobs on node
4. nvidia-smi -r — attempt GPU reset
5. If reset fails → power cycle node
6. Re-run dcgmi diag -r 3
7. If DBE ECC → raise hardware replacement ticket
8. Uncordon node after clean diag pass

🟠 IB Link / NCCL Hang

1. Job hangs — all ranks at 0% GPU util
2. NCCL_DEBUG=INFO — check which rank failed to initialize
3. ibstat on failed nodes — check port state (must be Active)
4. perfquery — look for rising SymbolErrors or PortRcvErrors
5. ibdiagnet -r — full fabric health sweep
6. If link down: reseat cable, check transceiver
7. Restart UFM SM sweep if routing stale
8. Re-submit job with healthy nodes

🟢 Thermal Throttling

1. DCGM alert: GPU temp ≥ 83°C
2. nvidia-smi -q -d PERFORMANCE — check P-state throttling
3. Check fan speed: nvidia-smi -q -d TEMPERATURE
4. Check datacenter ambient temp and airflow
5. Check for GPU dust blockage
6. Lower power limit temporarily: nvidia-smi -pl 600
7. Investigate cooling infrastructure
8. Do not run full training until resolved

📡 IB Port Error Counter Quick Reference

Counter	Meaning	Action
SymbolErrorCounter	Physical layer bit errors — cable/transceiver signal quality	Rising fast → replace cable or transceiver
PortRcvErrors	Incoming packets with errors (CRC, frame)	Check cable quality, retrain link
PortXmitDiscards	TX packets dropped (HOL blocking, buffer overflow)	Check congestion, PFC config, buffer sizes
LocalLinkIntegrityErrors	Link recovery events (link went down & recovered)	Investigate cable/connector; rising = imminent failure
ExcessiveBufferOverrunErrors	Receiver buffer overflow — PFC not effective	Review PFC headroom config; check MTU consistency
VL15Dropped	SM management messages dropped	SM overloaded or VL15 buffer too small; check sm_priority

Practice Quiz

10 exam-style questions covering orchestration, DCGM, NCCL, UFM, and fault detection.

Memory Hooks & Cluster Advisor

Exam mnemonics for every key concept, plus a guided advisor for common cluster operations questions.

DCGM = GPU's Doctor

What DCGM Does

DCGM monitors GPU health (temperature, ECC errors, power), runs diagnostics (Level 1/2/3), and exports metrics to Prometheus via DCGM Exporter. It's the single source of truth for GPU health — if DCGM says a GPU is sick, it's sick.

Xid 48 = Replace It

The Critical Xid Error

Xid 48 = Double-Bit ECC error (uncorrectable). Any node showing Xid 48 must be drained and the GPU physically replaced. No amount of reset or rebooting fixes VRAM cell failure.

MIG = Hard Wall

MIG vs MPS

MIG builds hardware walls between GPU partitions — dedicated SMs, VRAM, caches. MPS shares the GPU context in software — faster to set up, but one crash kills all tenants. Use MIG for multi-tenant production.

Ring = Big, Tree = Small

NCCL Algorithm Selection

Ring AllReduce is bandwidth-optimal for large messages (>1 MB). Tree AllReduce is latency-optimal for small messages (<256 KB). CollNet (SHARP) reduces traffic for both by doing aggregation in the network.

NCCL_DEBUG=INFO First

Debugging NCCL Issues

Whenever a distributed training job hangs or shows poor performance, the first step is always NCCL_DEBUG=INFO. It shows topology discovery, algorithm selection, and which ranks failed to initialize — usually points to the root cause immediately.

UFM = IB's Control Tower

UFM Role

UFM manages InfiniBand fabric just like an air traffic control tower: discovers all nodes, assigns routes, monitors congestion, fires alerts on link failures. It replaces and extends OpenSM with a REST API and web UI.

SM Util = GPU Working

Key DCGM Metric

DCGM_FI_DEV_GPU_UTIL (SM Utilization) is the primary indicator that a GPU is doing compute work. Target ≥ 80% during training. If SM util is low while a job is running, the GPU is starved — usually a data loading or communication bottleneck.

K8s GPU = device plugin

Kubernetes GPU Scheduling

Kubernetes doesn't know about GPUs natively. The NVIDIA GPU Device Plugin DaemonSet teaches kubelet to advertise nvidia.com/gpu resources. Without it, pods requesting GPUs will pend forever regardless of available hardware.

🤖 Cluster Ops Advisor

Select your situation for targeted guidance.

What are you working on?

📊 DCGM & Prometheus Setup

Install DCGM on every GPU node: use MLNX-OFED package or NGC container nvcr.io/nvidia/k8s/dcgm-exporter.
Deploy DCGM Exporter as a Kubernetes DaemonSet — it exposes GPU metrics on port 9400 at /metrics for Prometheus to scrape.
Add a Prometheus scrape job targeting 9400/metrics on all GPU nodes. Set scrape interval to 15s for real-time visibility.
Import the NVIDIA-provided DCGM Grafana dashboard (dashboard ID 12239 on grafana.com) — gives SM util, memory BW, temperature, power, and ECC error panels out of the box.
Set up alerting rules: SM util < 10% for >5 min during expected training = data bottleneck; GPU temp > 82°C = thermal alert; any DBE ECC = immediate page.
Run dcgmi diag -r 1 on all nodes before major training runs — prevents wasted GPU-hours on broken hardware.
For NVLink health, add field 1009 (NVLink bandwidth) to your DCGM field group and track it against expected bandwidth per GPU topology.

🔗 NCCL Hang & Performance Diagnosis

First step always: set NCCL_DEBUG=INFO and rerun — output shows topology, algorithm selection, and the exact rank that failed to initialize.
For IB fabric: check ibstat on all nodes — every port must be in State: Active, Physical state: LinkUp. Any port down = that rank will hang.
Run perfquery -x <lid> 1 on suspected ports — PortRcvErrors or SymbolErrorCounter rising confirms a bad link. Replace cable.
For RoCEv2/Spectrum-X: verify NCCL_IB_GID_INDEX=3 — GID index 0 routes as RoCEv1 and will fail on RoCEv2-only fabrics.
Poor AllReduce bandwidth: check NCCL_IB_HCA is set to all available NIC ports (e.g. mlx5_0,mlx5_2,mlx5_4,mlx5_6 for 4 NICs per node).
NCCL using Tree instead of Ring unexpectedly: message size may be below Ring threshold — tune with NCCL_ALGO=Ring to force or increase per-layer gradient accumulation batch size.
For credit starvation (IB): check VL15Dropped counter with perfquery — rising VL15Dropped = SM management traffic being dropped, causing routing stalls.

🔪 MIG & MPS Configuration

Enable MIG mode on H100/A100: sudo nvidia-smi -mig 1 — requires GPU reset (terminates all running workloads on that GPU).
Create MIG instances: sudo nvidia-smi mig -cgi 1g.10gb -C for 7 equal slices on H100 80GB. Check available profiles with nvidia-smi mig -lgip.
In Kubernetes: deploy the NVIDIA MIG Manager (nvidia-mig-manager DaemonSet) — it reads a ConfigMap specifying the desired MIG strategy (single or mixed) and applies it automatically.
Request a MIG slice in a pod: resources.limits: nvidia.com/mig-1g.10gb: 1 — the GPU device plugin advertises each slice as a separate schedulable resource.
For MPS: set CUDA_MPS_PIPE_DIR and start the MPS server with nvidia-cuda-mps-control -d. All CUDA processes launched after this share the MPS context — good for many small inference servers on one GPU.
MIG strategy "mixed" allows different slice sizes on the same GPU (e.g. 2× 3g.40gb + 1× 1g.10gb). Strategy "single" enforces uniform slices — simpler for scheduler integration.
Disable MIG to return to full-GPU mode: destroy all instances first (nvidia-smi mig -dci, nvidia-smi mig -dgi), then sudo nvidia-smi -mig 0.

🚨 GPU Fault Response

Check dmesg for Xid errors: dmesg | grep -i xid. Xid 48 (DBE ECC) = drain node immediately. Xid 79 (GPU off bus) = check PCIe seating, reboot.
Drain the node before doing anything else: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data — prevents new pods from scheduling on a faulty node.
Verify ECC error counts: nvidia-smi -q -d ECC — look at "Volatile" and "Aggregate" DBE counts. Any DBE under "Volatile" means the error occurred this session.
Attempt GPU reset if no DBE: sudo nvidia-smi -r -i <gpu-id>. If reset fails, power cycle the node.
Re-validate with DCGM after reset: dcgmi diag -g <gpu-group> -r 2 — must pass before re-adding node to the cluster.
If the GPU keeps showing ECC errors after reset: schedule physical replacement. Run nvidia-smi --query-gpu=serial --format=csv to get the GPU serial number for the RMA ticket.
Uncordon the node only after a clean DCGM diagnostic pass: kubectl uncordon <node>.

AI Cluster Orchestration& Observability

🏗️ AI Cluster Stack — Layer by Layer

Orchestration

Observability

Fault Detection

Why This Topic Matters for NCP-AIN

☸️ Kubernetes GPU Scheduling

NVIDIA GPU Device Plugin

Scheduling Flow

GPU Sharing: MIG vs MPS vs Time-Slicing

MIG Profile Naming Convention

📋 Slurm Workload Manager

Key Slurm Concepts

Common Slurm GPU Commands

NVIDIA Container Toolkit (nvidia-container-runtime)

Orchestration Flash Cards — Click to Flip

📊 NVIDIA DCGM — Data Center GPU Manager

What DCGM Does

Observability Pipeline

Key DCGM Metrics — Know These for the Exam

Essential DCGM Commands

🔗 UFM — Unified Fabric Manager (InfiniBand)

UFM Capabilities

UFM vs OpenSM

Full Observability Stack

NVIDIA DCGM

DCGM Exporter

NVIDIA UFM

Prometheus

Grafana

OpenTelemetry

What is NCCL?

NCCL AllReduce Algorithms

Ring AllReduce

Tree AllReduce

CollNet (SHARP)

NCCL Protocols & Key Environment Variables

Protocol Selection

Topology Detection

NCCL Environment Variable Reference

NCCL Hang Diagnosis Flow

🚨 Xid Errors — GPU Driver Error Codes

Xid 48 (Double-Bit ECC) = Replace the GPU

🩺 DCGM Diagnostic Levels

Level 1 — Quick (~30s)

Level 2 — Medium (~2 min)

Level 3 — Long (~10 min)

🔄 Fault Response Runbook

🔴 GPU Hardware Fault

🟠 IB Link / NCCL Hang

🟢 Thermal Throttling

📡 IB Port Error Counter Quick Reference

🤖 Cluster Ops Advisor

What are you working on?

📊 DCGM & Prometheus Setup

🔗 NCCL Hang & Performance Diagnosis

🔪 MIG & MPS Configuration

🚨 GPU Fault Response

NCP-AIN Series Complete 🎉

AI Cluster Orchestration
& Observability