NCP-AIN · Topic 5 · Cluster Ops & Observability

AI Cluster Orchestration
& Observability

Kubernetes GPU scheduling, MIG & MPS, DCGM metrics, UFM fabric management, NCCL AllReduce tuning, and fault detection — the complete AI cluster operations stack.

5
Orchestration Layers
15+
DCGM Metrics
3
NCCL Algorithms
10
Practice Questions

The AI Cluster Operations Stack
Five layers sit between bare hardware and a running training job — every layer has its own tools, metrics, and failure modes.

🏗️ AI Cluster Stack — Layer by Layer

Layer 5 — Workload Scheduling
Kubernetes
GPU device plugin
Slurm
HPC job scheduler
Run:ai
GPU quota mgmt
NVIDIA BCM
Base Command Mgr
Layer 4 — Container Runtime & GPU Partitioning
NVIDIA Container Toolkit
MIG
HW partitioning
MPS
Context sharing
Time-slicing
SW sharing
Layer 3 — Observability & Management
DCGM
GPU metrics
UFM
IB fabric mgr
Prometheus
Metrics DB
Grafana
Dashboards
DCGM Exporter
Layer 2 — Communication Libraries
NCCL
AllReduce/Gather
MPI
HPC comms
SHARP
In-network agg
RDMA
IB / RoCEv2
Layer 1 — Hardware Fabric
InfiniBand NDR/HDR
Spectrum-X Ethernet
NVLink / NVSwitch
PCIe Gen5
🎛️

Orchestration

Schedule GPU workloads efficiently, partition GPUs for multi-tenant use, and manage cluster-wide resources.

  • Kubernetes + GPU device plugin
  • Slurm for HPC-style batch jobs
  • MIG partitioning per tenant
  • MPS for multi-process GPU sharing
  • NVIDIA Container Toolkit runtime
📊

Observability

Collect, store, and visualize GPU and network metrics so problems are caught before jobs fail.

  • DCGM — GPU health & performance
  • DCGM Exporter → Prometheus
  • UFM — InfiniBand telemetry
  • Grafana dashboards
  • Per-flow network counters
🚨

Fault Detection

Identify GPU hardware errors, network link failures, and NCCL hangs before they corrupt training runs.

  • Xid errors from NVIDIA driver
  • DCGM health check levels
  • IB port error counters
  • NCCL timeout detection
  • Job checkpoint & recovery
💡

Why This Topic Matters for NCP-AIN

The network is only valuable if the cluster actually runs jobs correctly. NCP-AIN tests that you understand how workloads are scheduled onto the fabric, how GPU and network health are monitored together, and how communication libraries like NCCL interact with the IB and Ethernet infrastructure covered in Topics 1–4.


Workload Orchestration
How GPU jobs reach the hardware — from Kubernetes scheduling and the device plugin to MIG partitioning and MPS context sharing.

☸️ Kubernetes GPU Scheduling

NVIDIA GPU Device Plugin

Kubernetes doesn't natively understand GPUs. The NVIDIA GPU Device Plugin runs as a DaemonSet on every GPU node and advertises GPU resources to the kubelet so pods can request them.

Scheduling Flow

Pod requests
nvidia.com/gpu: 1
kube-scheduler
selects node
Device plugin
allocates GPU
Container gets
GPU env vars
Key Kubernetes GPU Concepts
Resource namenvidia.com/gpu — integer, whole GPU units
DaemonSetGPU device plugin runs on every GPU node
Node labelsaccelerator=nvidia-tesla-h100 for node selection
Topology hintsNUMA-aware scheduling via Topology Manager
MIG resourcesnvidia.com/mig-1g.10gb per MIG slice profile
Time-slicingConfigMap overrides — logical GPUs > physical GPUs
Health checksDCGM health check integration via device plugin
NFDNode Feature Discovery — labels nodes with GPU model, driver, CUDA ver

GPU Sharing: MIG vs MPS vs Time-Slicing

Feature MIG — Multi-Instance GPU MPS — Multi-Process Service Time-Slicing
IsolationHardware — dedicated SM + VRAMSoftware context sharingTemporal — one at a time
Memory isolationFully isolated VRAM partitionsShared (fault = all die)Shared (no isolation)
Fault isolationOne slice crash ≠ othersOne crash kills all clientsNo fault isolation
ConcurrencyTrue simultaneous executionTrue simultaneous (shared SMs)Sequential, not concurrent
GPU supportA100, H100, A30 (Ampere+)All CUDA GPUsAll CUDA GPUs
Best forMulti-tenant inference, CI/CDMany small cooperative jobsDev/test only — oversubscription
K8s resourcenvidia.com/mig-*Enabled via CUDA_MPS_PIPE_DIRConfigMap device plugin
⚠️

MIG Profile Naming Convention

nvidia.com/mig-1g.10gb = 1 GPU instance slice (1/7th of H100), 10 GB VRAM. An H100 80GB can be sliced into 7 × 1g.10gb, or other combinations. MIG must be enabled on the GPU before any workloads run — it requires a GPU reset to toggle.

📋 Slurm Workload Manager

Key Slurm Concepts

Partition: Named group of nodes with shared policy (GPU partition, CPU partition, priority queues)
GRES (Generic Resources): GPU resource tracking — --gres=gpu:h100:8 to request 8 H100s
srun: Launch job steps interactively or in parallel within an allocation
sbatch: Submit batch script; queued until resources available
squeue: View pending & running jobs
sinfo: View node/partition state
MPS integration: Slurm can enable/disable CUDA MPS per-job automatically

Common Slurm GPU Commands

# Submit a multi-GPU training job
sbatch --nodes=4 --ntasks-per-node=8 \
  --gres=gpu:h100:8 --partition=gpu \
  train.sh

# Check job queue
squeue -u $USER --format="%.10i %.9P %.20j %.8T %.10M"

# Check GPU node availability
sinfo -p gpu --format="%N %G %t %m"

# Interactive GPU session
srun --gres=gpu:1 --pty bash
🐳

NVIDIA Container Toolkit (nvidia-container-runtime)

Intercepts container creation and injects GPU device files, NVIDIA libraries, and CUDA into the container's mount namespace — without the container image needing to bundle drivers. Required for both Docker (--gpus all) and Kubernetes (set runtimeClassName: nvidia). The toolkit maps NVIDIA_VISIBLE_DEVICES env var to specific GPU device files.


Orchestration Flash Cards — Click to Flip

☸️

How does a Kubernetes pod request a GPU?

Resource request: nvidia.com/gpu: 1

The GPU Device Plugin DaemonSet advertises these resources to kubelet. For MIG slices: nvidia.com/mig-1g.10gb

🔪

What is the key advantage of MIG over MPS?

Hardware fault isolation

MIG gives each tenant dedicated SMs and VRAM — a fault in one slice cannot affect another. MPS shares the hardware context, so one crash kills all co-tenants.

📦

What does NVIDIA Container Toolkit do?

Injects GPU into containers at runtime

Intercepts container creation, mounts GPU device files + NVIDIA libs into the namespace. Container images don't need GPU drivers — only CUDA runtime.

📋

What is GRES in Slurm?

Generic Resource Scheduling

Slurm's mechanism for tracking non-CPU/RAM resources like GPUs. --gres=gpu:h100:8 requests 8 H100 GPUs. Slurm tracks GRES usage per node and enforces limits.


GPU & Network Observability
DCGM for GPU health, DCGM Exporter for Prometheus, UFM for fabric telemetry — the full monitoring pipeline.

📊 NVIDIA DCGM — Data Center GPU Manager

What DCGM Does

DCGM is the centralized daemon that collects GPU metrics, runs health diagnostics, and configures policies on all GPUs in a node. It communicates with GPUs via NVML (NVIDIA Management Library).

dcgmi: CLI interface to DCGM daemon
DCGM Exporter: Sidecar that exposes DCGM metrics to Prometheus
Field groups: Named sets of metrics polled together
Health checks: 3 levels — short/medium/long diagnostic
Policy manager: Trigger actions on threshold violations (power, temperature)

Observability Pipeline

GPU hardware (NVML interface)
DCGM daemon — collects metrics + health
DCGM Exporter — exposes /metrics endpoint
Prometheus — scrapes & stores time-series
Grafana — dashboards, alerts, on-call

Key DCGM Metrics — Know These for the Exam

SM Utilization
DCGM_FI_DEV_GPU_UTIL
% of SMs active. Primary measure of GPU compute activity.
Target: ≥ 80%
💾
Memory Utilization
DCGM_FI_DEV_MEM_COPY_UTIL
% of memory bandwidth used. Bottleneck for transformer models.
Watch: > 90%
🌡️
GPU Temperature
DCGM_FI_DEV_GPU_TEMP
Die temperature in °C. Throttling begins at thermal limit.
Throttle: ≥ 83°C (H100)
🔋
Power Draw
DCGM_FI_DEV_POWER_USAGE
Instantaneous power in watts. H100 SXM5 TDP = 700W.
Cap: TDP (700W for H100)
🔗
NVLink Bandwidth
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
Total NVLink TX+RX bandwidth. Key for NVSwitch-connected GPUs.
H100: up to 900 GB/s
ECC Errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL
Double-bit ECC errors (uncorrectable). Non-zero = GPU needs replacement.
Alert: any DBE > 0
🕐
PCIe Replay
DCGM_FI_DEV_PCIE_REPLAY_COUNTER
PCIe transaction retries. Rising count = link signal integrity problem.
Alert: any rapid increase
⏱️
Tensor Core Util
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
Fraction of cycles the Tensor Cores are active. Key AI training health metric.
Target: ≥ 60% for LLM

Essential DCGM Commands

# List all GPUs managed by DCGM
dcgmi discovery -l

# Run health check on all GPUs (short = ~30s)
dcgmi health -g 0 -c          # group 0 = all GPUs
dcgmi diag -g 0 -r 1          # r1=quick, r2=medium, r3=long (burn-in)

# Watch live GPU metrics
dcgmi dmon -e 203,204,1002,1003,1004  # SM util, mem util, temp, power, fan

# Query specific fields
dcgmi group -l                # list groups
dcgmi stats -g 0 -e 1001      # ECC double-bit errors

# DCGM Exporter (runs as sidecar container)
docker run -d --gpus all --rm \
  -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:latest
# Prometheus scrapes: http://<node>:9400/metrics

🔗 UFM — Unified Fabric Manager (InfiniBand)

UFM Capabilities

Topology discovery: Auto-discovers all IB switches, HCAs, and cables — renders full fabric graph
Subnet management: Replaces OpenSM — manages LID assignment, routing, partitions
Congestion mgmt: Monitors FECN/BECN counters, adjusts QoS policies
Port telemetry: Real-time bandwidth, error, and latency per port
Event alerts: Link down, symbol errors, port errors → webhook/email
REST API: Programmable — query topology, push config, stream events

UFM vs OpenSM

FeatureUFMOpenSM
GUI✅ Web UI❌ CLI only
TelemetryRich per-port metricsBasic
REST API✅ Full API
Scale10,000+ nodesSmaller fabrics
SHARP mgmtIntegratedSeparate
LicenseCommercialOpen source

Full Observability Stack

🖥️

NVIDIA DCGM

GPU metrics, health diagnostics, policy enforcement. Source of truth for GPU health.

GPU Layer
📤

DCGM Exporter

Sidecar container — exposes DCGM metrics as Prometheus /metrics endpoint on port 9400.

Export
🔗

NVIDIA UFM

InfiniBand fabric telemetry — port BW, error counters, topology, events.

Network Layer
📊

Prometheus

Scrapes DCGM Exporter + node-exporter. Stores time-series metrics. Fires alerts.

Storage
📈

Grafana

Dashboards querying Prometheus. NVIDIA provides pre-built DCGM dashboard templates.

Visualization
🔭

OpenTelemetry

Distributed tracing for AI serving pipelines — trace inference request latency end-to-end.

Tracing

NCCL & Fabric Communication
How NCCL discovers topology, selects AllReduce algorithms, and maps onto InfiniBand and Ethernet fabrics.
📡

What is NCCL?

NCCL (NVIDIA Collective Communications Library) provides GPU-optimized implementations of collective operations — AllReduce, AllGather, ReduceScatter, Broadcast — used by every major deep learning framework (PyTorch, TensorFlow, JAX). NCCL automatically detects the network topology (NVLink, IB, RoCEv2) and selects the fastest algorithm.

NCCL AllReduce Algorithms

Ring AllReduce

Bandwidth Optimal

GPUs form a logical ring. Each GPU sends to the next, receiving and accumulating partial results, until all GPUs hold the full sum. Scales linearly with N GPUs.

  • Traffic: 2 × (N-1)/N × data size
  • Best for: large messages (>1 MB)
  • Default for multi-node IB/RoCEv2
  • Latency scales with ring size

Tree AllReduce

Latency Optimal

Binary tree reduction — GPUs reduce upward to root, then broadcast downward. Logarithmic latency scaling. Used for small messages or very large GPU counts.

  • Latency: O(log N) steps
  • Best for: small messages (<256 KB)
  • NCCL uses "double binary tree"
  • Lower BW efficiency than ring

CollNet (SHARP)

In-Network Compute

Delegates AllReduce to SHARP-capable switches (Quantum-2 / Spectrum-4). Aggregation happens in the network — GPUs send/receive only final results.

  • Enabled: NCCL_ALGO=CollNet
  • Requires SHARP throughout path
  • Up to N× traffic reduction
  • Best for large-scale (1000+ GPUs)

NCCL Protocols & Key Environment Variables

Protocol Selection

ProtocolDescription
LL (Low Latency)Inline flag in data — lowest latency, limited BW. Small ops.
LL128128-byte granularity LL. Better BW than LL, still low latency.
SimpleBulk transfer — maximum bandwidth. Large ops (>1 MB). Default for IB.

Topology Detection

NCCL automatically detects intra-node topology (NVLink, PCIe) and inter-node fabric (IB, RoCEv2) on startup. Enable verbose output to see what it found:

# Show topology + algorithm choice
NCCL_DEBUG=INFO torchrun ...

# Force specific IB HCA
NCCL_IB_HCA=mlx5_0,mlx5_1

# Force RoCEv2 GID index
NCCL_IB_GID_INDEX=3

# Enable SHARP/CollNet
NCCL_ALGO=CollNet

NCCL Environment Variable Reference

VariablePurpose & Key Values
NCCL_DEBUGLogging level. INFO = topology + algo. TRACE = per-op verbose. Start here when NCCL hangs.
NCCL_DEBUG_SUBSYSFilter by subsystem: ALL, INIT, GRAPH, TUNING, NET, IB. Use NET,IB for network debugging.
NCCL_IB_HCAComma-list of InfiniBand HCAs to use. E.g. mlx5_0,mlx5_2. Critical on multi-NIC nodes.
NCCL_IB_GID_INDEXGID index for RoCEv2. Always set to 3 for RoCEv2; 0 = RoCEv1/IB native.
NCCL_SOCKET_IFNAMEEthernet interface for non-RDMA transport fallback. E.g. eth0.
NCCL_IB_DISABLE0 = use IB/RoCEv2 (default). 1 = force TCP fallback — useful for debugging.
NCCL_ALGOForce algorithm: Ring, Tree, CollNet (SHARP). Default = auto-selected.
NCCL_PROTOForce protocol: LL, LL128, Simple. Default = auto-selected per message size.
NCCL_TOPO_FILEPath to XML topology file — override auto-detection. Useful for non-standard topologies.
NCCL_CROSS_NIC0 = same NIC for send/recv (default). 1 = allow different NICs for ring traversal.
SHARP_COLL_ENABLE_SAT1 = enable SHARP AllReduce via NCCL CollNet plugin.
🔧

NCCL Hang Diagnosis Flow

When a training job hangs and all GPUs show 0% utilization: (1) Set NCCL_DEBUG=INFO and rerun — check if all ranks initialized successfully; (2) Check IB port errors with perfquery — credit starvation on VL0 causes deadlocks; (3) Verify NCCL_IB_GID_INDEX=3 on RoCEv2 fabrics; (4) Check for firewall rules blocking UDP 4791 (RoCEv2); (5) As a diagnostic, set NCCL_IB_DISABLE=1 — if training resumes over TCP, the RDMA path is the problem.


Fault Detection & Recovery
GPU hardware errors, network link failures, and job recovery — the runbook every AI cluster operator needs.

🚨 Xid Errors — GPU Driver Error Codes

Xid errors appear in dmesg, nvidia-smi, and DCGM. They are the GPU driver's error reporting mechanism — every Xid has a specific meaning. These are the ones tested on NCP-AIN:

XidNameSeverityMeaning & Action
Xid 8 GPU MMU fault Critical Memory management unit error — usually bad CUDA code (illegal access). Check application, may need GPU reset.
Xid 13 Graphics engine exception Warning Graphics engine hit an exception. Common with compute+graphics mixed workloads.
Xid 31 GPU memory page fault Critical GPU accessed unmapped memory — typically application bug. Check CUDA memcheck.
Xid 38 Driver firmware error Critical Unrecoverable firmware fault. Requires GPU reset or node reboot. Investigate hardware.
Xid 43 GPU stopped processing Critical GPU hung — no progress for watchdog timeout. Reset required. May indicate thermal or HW issue.
Xid 48 Double-bit ECC error Critical Uncorrectable memory error (DBE). GPU VRAM defect. Node must be drained and GPU replaced.
Xid 63 Row remapping pending Warning GPU remapping a failing memory row (single-bit ECC). GPU still operational. Monitor for Xid 48 progression.
Xid 79 GPU has fallen off the bus Critical PCIe link lost. Check seating, PCIe slot, motherboard. Node reboot required.
Xid 92 High single-bit ECC count Warning High volume of correctable ECC errors — early warning of memory degradation. Schedule GPU replacement.
🔴

Xid 48 (Double-Bit ECC) = Replace the GPU

A double-bit ECC (DBE) error is uncorrectable — the GPU cannot recover the corrupted data. Any node showing Xid 48 must be drained immediately: kubectl drain <node> or scontrol update NodeName=<node> State=DRAIN. The GPU should be physically replaced, not just reset.

🩺 DCGM Diagnostic Levels

Level 1 — Quick (~30s)

Checks: GPU init, PCIe bandwidth, memory bandwidth, power, clock. Run before every job. Catches dead GPUs, PCIe issues, power limit problems.

dcgmi diag -g 0 -r 1

Level 2 — Medium (~2 min)

All Level 1 checks plus: memory stress test, P2P bandwidth, NVLink checks. Run daily or after hardware changes. Good for post-maintenance validation.

dcgmi diag -g 0 -r 2

Level 3 — Long (~10 min)

Full burn-in: all Level 2 plus extended memory, compute, PCIe, NVLink stress. Run on new hardware commissioning or after suspected hardware fault.

dcgmi diag -g 0 -r 3

🔄 Fault Response Runbook

🔴 GPU Hardware Fault

1. DCGM alert or Xid 48/79 in dmesg
2. Drain node: kubectl drain <node> --ignore-daemonsets
3. Kill any remaining jobs on node
4. nvidia-smi -r — attempt GPU reset
5. If reset fails → power cycle node
6. Re-run dcgmi diag -r 3
7. If DBE ECC → raise hardware replacement ticket
8. Uncordon node after clean diag pass

🟠 IB Link / NCCL Hang

1. Job hangs — all ranks at 0% GPU util
2. NCCL_DEBUG=INFO — check which rank failed to initialize
3. ibstat on failed nodes — check port state (must be Active)
4. perfquery — look for rising SymbolErrors or PortRcvErrors
5. ibdiagnet -r — full fabric health sweep
6. If link down: reseat cable, check transceiver
7. Restart UFM SM sweep if routing stale
8. Re-submit job with healthy nodes

🟢 Thermal Throttling

1. DCGM alert: GPU temp ≥ 83°C
2. nvidia-smi -q -d PERFORMANCE — check P-state throttling
3. Check fan speed: nvidia-smi -q -d TEMPERATURE
4. Check datacenter ambient temp and airflow
5. Check for GPU dust blockage
6. Lower power limit temporarily: nvidia-smi -pl 600
7. Investigate cooling infrastructure
8. Do not run full training until resolved

📡 IB Port Error Counter Quick Reference

CounterMeaningAction
SymbolErrorCounterPhysical layer bit errors — cable/transceiver signal qualityRising fast → replace cable or transceiver
PortRcvErrorsIncoming packets with errors (CRC, frame)Check cable quality, retrain link
PortXmitDiscardsTX packets dropped (HOL blocking, buffer overflow)Check congestion, PFC config, buffer sizes
LocalLinkIntegrityErrorsLink recovery events (link went down & recovered)Investigate cable/connector; rising = imminent failure
ExcessiveBufferOverrunErrorsReceiver buffer overflow — PFC not effectiveReview PFC headroom config; check MTU consistency
VL15DroppedSM management messages droppedSM overloaded or VL15 buffer too small; check sm_priority

Practice Quiz
10 exam-style questions covering orchestration, DCGM, NCCL, UFM, and fault detection.


Memory Hooks & Cluster Advisor
Exam mnemonics for every key concept, plus a guided advisor for common cluster operations questions.
DCGM = GPU's Doctor
What DCGM Does

DCGM monitors GPU health (temperature, ECC errors, power), runs diagnostics (Level 1/2/3), and exports metrics to Prometheus via DCGM Exporter. It's the single source of truth for GPU health — if DCGM says a GPU is sick, it's sick.

Xid 48 = Replace It
The Critical Xid Error

Xid 48 = Double-Bit ECC error (uncorrectable). Any node showing Xid 48 must be drained and the GPU physically replaced. No amount of reset or rebooting fixes VRAM cell failure.

MIG = Hard Wall
MIG vs MPS

MIG builds hardware walls between GPU partitions — dedicated SMs, VRAM, caches. MPS shares the GPU context in software — faster to set up, but one crash kills all tenants. Use MIG for multi-tenant production.

Ring = Big, Tree = Small
NCCL Algorithm Selection

Ring AllReduce is bandwidth-optimal for large messages (>1 MB). Tree AllReduce is latency-optimal for small messages (<256 KB). CollNet (SHARP) reduces traffic for both by doing aggregation in the network.

NCCL_DEBUG=INFO First
Debugging NCCL Issues

Whenever a distributed training job hangs or shows poor performance, the first step is always NCCL_DEBUG=INFO. It shows topology discovery, algorithm selection, and which ranks failed to initialize — usually points to the root cause immediately.

UFM = IB's Control Tower
UFM Role

UFM manages InfiniBand fabric just like an air traffic control tower: discovers all nodes, assigns routes, monitors congestion, fires alerts on link failures. It replaces and extends OpenSM with a REST API and web UI.

SM Util = GPU Working
Key DCGM Metric

DCGM_FI_DEV_GPU_UTIL (SM Utilization) is the primary indicator that a GPU is doing compute work. Target ≥ 80% during training. If SM util is low while a job is running, the GPU is starved — usually a data loading or communication bottleneck.

K8s GPU = device plugin
Kubernetes GPU Scheduling

Kubernetes doesn't know about GPUs natively. The NVIDIA GPU Device Plugin DaemonSet teaches kubelet to advertise nvidia.com/gpu resources. Without it, pods requesting GPUs will pend forever regardless of available hardware.

🤖 Cluster Ops Advisor

Select your situation for targeted guidance.

What are you working on?

📊 DCGM & Prometheus Setup

  • Install DCGM on every GPU node: use MLNX-OFED package or NGC container nvcr.io/nvidia/k8s/dcgm-exporter.
  • Deploy DCGM Exporter as a Kubernetes DaemonSet — it exposes GPU metrics on port 9400 at /metrics for Prometheus to scrape.
  • Add a Prometheus scrape job targeting 9400/metrics on all GPU nodes. Set scrape interval to 15s for real-time visibility.
  • Import the NVIDIA-provided DCGM Grafana dashboard (dashboard ID 12239 on grafana.com) — gives SM util, memory BW, temperature, power, and ECC error panels out of the box.
  • Set up alerting rules: SM util < 10% for >5 min during expected training = data bottleneck; GPU temp > 82°C = thermal alert; any DBE ECC = immediate page.
  • Run dcgmi diag -r 1 on all nodes before major training runs — prevents wasted GPU-hours on broken hardware.
  • For NVLink health, add field 1009 (NVLink bandwidth) to your DCGM field group and track it against expected bandwidth per GPU topology.

🔗 NCCL Hang & Performance Diagnosis

  • First step always: set NCCL_DEBUG=INFO and rerun — output shows topology, algorithm selection, and the exact rank that failed to initialize.
  • For IB fabric: check ibstat on all nodes — every port must be in State: Active, Physical state: LinkUp. Any port down = that rank will hang.
  • Run perfquery -x <lid> 1 on suspected ports — PortRcvErrors or SymbolErrorCounter rising confirms a bad link. Replace cable.
  • For RoCEv2/Spectrum-X: verify NCCL_IB_GID_INDEX=3 — GID index 0 routes as RoCEv1 and will fail on RoCEv2-only fabrics.
  • Poor AllReduce bandwidth: check NCCL_IB_HCA is set to all available NIC ports (e.g. mlx5_0,mlx5_2,mlx5_4,mlx5_6 for 4 NICs per node).
  • NCCL using Tree instead of Ring unexpectedly: message size may be below Ring threshold — tune with NCCL_ALGO=Ring to force or increase per-layer gradient accumulation batch size.
  • For credit starvation (IB): check VL15Dropped counter with perfquery — rising VL15Dropped = SM management traffic being dropped, causing routing stalls.

🔪 MIG & MPS Configuration

  • Enable MIG mode on H100/A100: sudo nvidia-smi -mig 1 — requires GPU reset (terminates all running workloads on that GPU).
  • Create MIG instances: sudo nvidia-smi mig -cgi 1g.10gb -C for 7 equal slices on H100 80GB. Check available profiles with nvidia-smi mig -lgip.
  • In Kubernetes: deploy the NVIDIA MIG Manager (nvidia-mig-manager DaemonSet) — it reads a ConfigMap specifying the desired MIG strategy (single or mixed) and applies it automatically.
  • Request a MIG slice in a pod: resources.limits: nvidia.com/mig-1g.10gb: 1 — the GPU device plugin advertises each slice as a separate schedulable resource.
  • For MPS: set CUDA_MPS_PIPE_DIR and start the MPS server with nvidia-cuda-mps-control -d. All CUDA processes launched after this share the MPS context — good for many small inference servers on one GPU.
  • MIG strategy "mixed" allows different slice sizes on the same GPU (e.g. 2× 3g.40gb + 1× 1g.10gb). Strategy "single" enforces uniform slices — simpler for scheduler integration.
  • Disable MIG to return to full-GPU mode: destroy all instances first (nvidia-smi mig -dci, nvidia-smi mig -dgi), then sudo nvidia-smi -mig 0.

🚨 GPU Fault Response

  • Check dmesg for Xid errors: dmesg | grep -i xid. Xid 48 (DBE ECC) = drain node immediately. Xid 79 (GPU off bus) = check PCIe seating, reboot.
  • Drain the node before doing anything else: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data — prevents new pods from scheduling on a faulty node.
  • Verify ECC error counts: nvidia-smi -q -d ECC — look at "Volatile" and "Aggregate" DBE counts. Any DBE under "Volatile" means the error occurred this session.
  • Attempt GPU reset if no DBE: sudo nvidia-smi -r -i <gpu-id>. If reset fails, power cycle the node.
  • Re-validate with DCGM after reset: dcgmi diag -g <gpu-group> -r 2 — must pass before re-adding node to the cluster.
  • If the GPU keeps showing ECC errors after reset: schedule physical replacement. Run nvidia-smi --query-gpu=serial --format=csv to get the GPU serial number for the RMA ticket.
  • Uncordon the node only after a clean DCGM diagnostic pass: kubectl uncordon <node>.

NCP-AIN Series Complete 🎉

All 5 topics covered — you're ready to sit the exam. Review any topic or start a full practice exam.

Start Free on FlashGenius