NCP-AII · Topic 5 of 5

AI Software Stack & Cluster Deployment

Master CUDA, NGC containers, cluster orchestration, distributed training, and inference serving for NVIDIA AI infrastructure.

CUDA & Driver Stack NGC Containers SLURM & Kubernetes Distributed Training NCCL Collectives Triton Inference Server
Study with Flashcards →
9.0
H100 Compute Cap.
7
MIG Instances (H100)
4
Parallelism Dimensions
NCCL
Collective Comms
Pyxis
SLURM Container Plugin

AI Software Stack & Cluster Deployment

The NVIDIA AI software stack is a layered architecture from GPU silicon to containerized applications. Every NCP-AII certified professional must understand how CUDA, libraries, frameworks, containers, schedulers, and distributed training fit together to deploy AI at cluster scale.

🏗️ NVIDIA AI Software Stack — Full Layer View
Applications / Models
LLMs, Vision, Speech, Recommender systems, Custom models
AI Frameworks
PyTorch · TensorFlow · JAX · PaddlePaddle
NVIDIA Libraries
cuDNN · NCCL · cuBLAS · cuSPARSE · RAPIDS · TensorRT · Triton
CUDA Toolkit
CUDA Runtime · CUDA Driver API · NVCC compiler · Profiler (Nsight)
NVIDIA Driver
Kernel module (nvidia.ko) · User-space libraries · DCGM agent
GPU Hardware
H100 / H200 / B200 · NVLink · HBM3 · MIG partitions

Each layer depends on the one below. CUDA toolkit version must be compatible with the installed driver. Frameworks call cuDNN/NCCL through CUDA. NGC containers package a consistent, validated snapshot of all layers above the driver.

⚡ CUDA Compute Capabilities
GPUCompute Cap.
V1007.0
A1008.0
H1009.0
B20010.0
📦 NGC Container Benefits

Pre-built, GPU-optimized containers for PyTorch, TensorFlow, TensorRT, and more. Updated monthly. Validated on NVIDIA hardware. Eliminates dependency management — pull and run.

🔀 MIG on H100

Multi-Instance GPU: partition one H100 into up to 7 independent GPU instances, each with dedicated SMs, HBM slice, L2, and memory controllers. Enables multi-tenant inference workloads.

Core Theme: The NCP-AII software stack questions center on three areas: (1) CUDA/driver compatibility and MIG, (2) container + scheduler deployment patterns (SLURM with Pyxis, Kubernetes with GPU device plugin), and (3) distributed training collectives (NCCL all-reduce, ring vs tree algorithms) and parallelism strategies (DDP, FSDP, tensor, pipeline, 3D).

CUDA & the NVIDIA Software Stack

CUDA is the programming model and platform that exposes GPU parallelism to developers. Understanding CUDA's architecture, compute capabilities, driver/toolkit compatibility, and profiling tools is foundational.

🔧 CUDA Architecture Fundamentals

CUDA Toolkit vs CUDA Driver: Two distinct components. The driver (installed with GPU driver package) is the low-level kernel interface. The toolkit (nvcc, runtime libraries) compiles and links CUDA code. Toolkit version must be ≤ driver version for compatibility.

Compute Capability: A version number (major.minor) identifying GPU architecture features. Code compiled for compute capability 9.0 can only run on H100 or newer. Always compile for the target GPU's compute capability.

CUDA Runtime vs Driver API: Runtime API is higher-level (implicit context management). Driver API is lower-level (explicit control). Most frameworks use Runtime API.

# Check driver and CUDA toolkit versions
nvidia-smi
# Shows: Driver Version, CUDA Version (max supported)

nvcc --version
# Shows: installed CUDA toolkit version

# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap \
  --format=csv,noheader

# Compile for H100 (sm_90)
nvcc -arch=sm_90 kernel.cu -o kernel
🔀 MIG — Multi-Instance GPU

MIG partitions a single physical GPU into multiple isolated GPU Instances (GIs), each with:

  • Dedicated SM partition (no sharing between instances)
  • Dedicated HBM memory slice
  • Dedicated L2 cache partition
  • Dedicated memory controllers and bandwidth
  • Full fault isolation — one instance's errors don't affect others

H100 MIG profiles: 1g.10gb (7×), 2g.20gb (3×), 3g.40gb (2×), 4g.40gb (1×), 7g.80gb (1× — full GPU). The "g" = GPC slices; "gb" = HBM allocation.

Use case: Running multiple independent inference workloads on one GPU; multi-tenant environments where isolation is required.

# Enable MIG mode on GPU 0
sudo nvidia-smi -i 0 \
  -mig 1

# List available MIG profiles
sudo nvidia-smi mig \
  -lgip

# Create 7 × 1g.10gb instances on GPU 0
sudo nvidia-smi mig \
  -cgi 19,19,19,19,19,19,19 -C

# List all MIG instances
sudo nvidia-smi mig -lgi

# Disable MIG mode
sudo nvidia-smi -i 0 -mig 0

Note: MIG mode requires a reboot on some systems. MIG instances cannot share NVLink — inter-instance communication goes through the host CPU/PCIe path.

🔍 NVIDIA Profiling & Monitoring Tools
ToolPurposeKey Use
nvidia-smiGPU management & monitoringTemperature, power, utilization, driver version, MIG management
DCGMCluster-scale GPU healthHealth checks, power capping, telemetry export to Prometheus
Nsight SystemsSystem-level profilingCPU/GPU timeline, CUDA API calls, NVLink/PCIe traffic
Nsight ComputeKernel-level profilingSM utilization, memory bandwidth, roofline analysis per kernel
Nsight DL DesignerDL model profilingLayer-by-layer performance in neural network graphs
NVTXAnnotation SDKMark application regions for Nsight Systems timeline correlation
📚 Key NVIDIA Libraries
LibraryDomainKey Function
cuDNNDeep LearningGPU-accelerated convolutions, attention, normalization — used by PyTorch/TF backends
NCCLCollective CommsAll-reduce, broadcast, scatter/gather across GPUs — core of distributed training
cuBLASLinear AlgebraDense matrix multiply (GEMM) — the inner loop of transformer attention and FFN layers
cuSPARSESparse Linear AlgebraSpMM, SpMV for sparse model acceleration
TensorRTInference OptimizationLayer fusion, precision calibration (FP8/INT8), engine serialization for deployment
RAPIDS cuDF / cuMLData ScienceGPU-accelerated DataFrames and ML pipelines (ETL for training data)
cuFile (GDS)Storage I/ODirect NVMe→GPU DMA bypassing CPU — eliminates storage I/O bottleneck

Containers & Cluster Orchestration

NGC containers package the full validated software stack. SLURM and Kubernetes are the two dominant schedulers for GPU cluster workloads — each with distinct GPU-aware tooling.

📦 NGC — NVIDIA GPU Cloud Containers

NGC provides pre-built, optimized containers from nvcr.io (NVIDIA Container Registry). Each container includes framework + CUDA toolkit + cuDNN + libraries, validated against specific GPU generations.

Key container families: PyTorch, TensorFlow, JAX, TensorRT, Triton Inference Server, NeMo (LLM training), RAPIDS, CUDA base images.

Versioning: Tagged as YY.MM (e.g., 24.01) indicating release month. Monthly releases track upstream framework + NVIDIA library updates.

# Pull NGC PyTorch container
docker pull \
  nvcr.io/nvidia/pytorch:24.01-py3

# Run with GPU access
docker run --gpus all \
  --rm -it \
  nvcr.io/nvidia/pytorch:24.01-py3

# Run with specific GPU + NVMe mount
docker run --gpus '"device=0,1"' \
  -v /lustre/data:/data \
  -v /lustre/checkpoints:/ckpt \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python train.py
⚙️ SLURM — GPU Cluster Job Scheduling

SLURM (Simple Linux Utility for Resource Management) is the dominant HPC/AI cluster scheduler. GPU resources are managed through GRES (Generic Resource Scheduling).

Pyxis: SLURM plugin (from NVIDIA) that integrates container execution natively into srun/sbatch. Replaces the need for docker run wrappers. Uses Enroot as the container runtime underneath.

Enroot: Lightweight container runtime that converts OCI images (Docker/NGC) to unprivileged sandboxes — optimized for HPC shared filesystems.

#!/bin/bash
#SBATCH --job-name=llm-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --partition=dgx-h100

# Launch with Pyxis (--container-image)
srun \
  --container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
  --container-mounts=/lustre/data:/data \
  python -m torch.distributed.run \
    --nproc_per_node=8 \
    --nnodes=4 \
    train.py
☸️ Kubernetes — GPU-Aware Container Orchestration

NVIDIA GPU Device Plugin: Kubernetes DaemonSet that advertises GPUs as allocatable resources (nvidia.com/gpu). Required on every GPU node. Enables GPU requests in Pod specs.

MIG-aware Scheduling: The GPU device plugin supports MIG device advertisement. Pods can request specific MIG profiles (e.g., nvidia.com/mig-1g.10gb).

NVIDIA Operator: Kubernetes operator that automates deployment of GPU driver, device plugin, DCGM exporter, and MIG manager — the recommended production deployment path for GPU clusters on Kubernetes.

Kubeflow: ML workflow orchestration on Kubernetes — manages training jobs (PyTorchJob, TFJob), hyperparameter tuning, and pipelines.

# Pod spec requesting 2 H100 GPUs
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 2
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all

---
# Request MIG instance instead:
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1
📊 SLURM vs Kubernetes for GPU Workloads
DimensionSLURMKubernetes
Primary UseHPC / large-scale trainingMicroservices / inference serving
Container RuntimeEnroot + Pyxis plugincontainerd / Docker + GPU device plugin
Job TypesBatch (sbatch), interactive (srun)Pods, Jobs, CronJobs, custom CRDs
GPU SchedulingGRES (--gres=gpu:N)resource requests (nvidia.com/gpu: N)
MIG SupportVia GRES naming conventionMIG device plugin profiles
Multi-node JobsNative (--nodes, --ntasks)Requires PyTorchJob/MPI Operator
AdoptionDominant in HPC/supercomputersDominant in cloud-native/MLOps

Distributed Training

Training large AI models requires distributing computation across multiple GPUs and nodes. NCCL handles the collective communications; DDP, FSDP, tensor, and pipeline parallelism handle different ways to split the model and data.

📡 NCCL — NVIDIA Collective Communications Library

NCCL implements optimized GPU-to-GPU collective operations — the backbone of all distributed deep learning. It automatically selects the best transport: NVLink (intra-node), InfiniBand RDMA (inter-node), or Ethernet.

Key collectives:

  • All-Reduce: Sum (or mean) a tensor across all GPUs, distribute result back to all. Used in DDP gradient synchronization.
  • All-Gather: Each GPU contributes a shard; all GPUs receive the concatenated result. Used in FSDP parameter reconstruction.
  • Reduce-Scatter: Sum across GPUs, then scatter one shard to each. Used in FSDP gradient reduction.
  • Broadcast: One GPU sends a tensor to all others.

Ring algorithm: Default all-reduce topology. Each GPU sends to its neighbor in a ring — O(N) data volume but O(1) bandwidth overhead regardless of GPU count. Optimal for NVLink rings.

# NCCL environment variables for tuning

# Use InfiniBand interface hca0
export NCCL_IB_HCA=mlx5_0

# Specify network interface for socket fallback
export NCCL_SOCKET_IFNAME=ib0

# Enable NCCL debug logging
export NCCL_DEBUG=INFO

# Disable IB (force Ethernet)
export NCCL_IB_DISABLE=1

# Set number of rings (often = NVLink lanes)
export NCCL_P2P_LEVEL=NVL

# Verify NCCL is using NVLink
NCCL_DEBUG=INFO python train.py \
  2>&1 | grep -i nvlink
🔀 Parallelism Strategies
Data Parallelism (DDP)
torch.nn.parallel.DistributedDataParallel
Full model replica on each GPU. Each GPU processes a different mini-batch. Gradients all-reduced after backward pass. Simplest approach — works when model fits in one GPU's HBM.
Fully Sharded (FSDP)
torch.distributed.fsdp.FullyShardedDataParallel
Shards model parameters, gradients, and optimizer states across GPUs. All-gather before forward, reduce-scatter after backward. Enables training models far larger than single-GPU HBM. LLaMA 70B: ~700 GB total → 87.5 GB/GPU across 8.
Tensor Parallelism (TP)
Megatron-LM, NeMo
Splits individual layer weight matrices across GPUs. Column-parallel and row-parallel linear layers. Requires all-reduce within each layer — needs fast NVLink intra-node bandwidth. Typically used within a node.
Pipeline Parallelism (PP)
GPipe, PipeDream, Megatron-LM
Assigns consecutive transformer layers to different GPUs (stages). Micro-batches flow through stages. Reduces inter-GPU communication to activations only (no gradient all-reduce across stages). Works well across nodes over IB.

3D Parallelism combines all three: Tensor Parallelism (intra-node, NVLink), Pipeline Parallelism (inter-node, InfiniBand), and Data Parallelism (across TP×PP groups). Used for frontier-scale LLM training (GPT-4 class models) on thousands of GPUs.

🚀 Launching Distributed Training
# Single-node, 8-GPU DDP with torchrun
torchrun \
  --nproc_per_node=8 \
  train_ddp.py \
  --batch-size 256

# Multi-node (4 nodes, 8 GPUs each = 32 total) via SLURM+Pyxis
srun \
  --container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
  --container-mounts=/lustre:/data \
  torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
  train_ddp.py

# FSDP wrapping in Python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

model = FSDP(
    model,
    auto_wrap_policy=transformer_auto_wrap_policy,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
    device_id=torch.cuda.current_device(),
)
📊 Parallelism Strategy Selection Guide
Model SizeGPU CountRecommended StrategyKey Constraint
<80 GB (fits 1× H100)1–8DDPBatch size / throughput
80–640 GB8–64FSDPHBM per GPU
640 GB – few TB64–256FSDP + Tensor ParallelismNVLink bandwidth
Multi-TB (GPT-4 class)256–10,000+3D Parallelism (TP+PP+DP)IB bandwidth, pipeline bubbles

Inference Serving & MLOps

Getting trained models into production requires optimized inference engines, serving infrastructure, and monitoring. NVIDIA's inference stack centers on TensorRT, Triton Inference Server, and NIM.

⚡ TensorRT — Inference Optimization

TensorRT takes a trained model (ONNX, PyTorch, TensorFlow) and produces an optimized serialized engine for a specific GPU target. Optimizations include:

  • Layer fusion: Combines sequential ops (e.g., Conv + BN + ReLU → single kernel)
  • Precision calibration: FP32 → FP16 → INT8 → FP8 with calibration dataset
  • Kernel auto-selection: Benchmarks alternative CUDA kernels, picks fastest for target GPU
  • Memory optimization: Reuses buffers across layers to minimize HBM allocation

Engine files are GPU-specific: An engine built for H100 (SM 9.0) cannot run on A100 (SM 8.0). Always build on the deployment target GPU.

# Export PyTorch model to ONNX
torch.onnx.export(model, dummy_input,
  "model.onnx",
  opset_version=17,
  input_names=["input"],
  output_names=["output"],
  dynamic_axes={"input": {0: "batch"}})

# Build TensorRT engine (FP16)
trtexec \
  --onnx=model.onnx \
  --fp16 \
  --saveEngine=model_fp16.engine \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:32x3x224x224 \
  --maxShapes=input:64x3x224x224

# Run benchmark
trtexec \
  --loadEngine=model_fp16.engine \
  --batch=32 --iterations=100
🚀 Triton Inference Server

Triton (not to be confused with the Triton kernel language) is NVIDIA's production inference serving system. It serves models from multiple frameworks simultaneously and exposes HTTP and gRPC endpoints.

Key features:

  • Dynamic batching: Automatically groups requests from multiple clients into batches for higher GPU utilization
  • Concurrent model execution: Multiple models run simultaneously on different GPU contexts
  • Model ensembles: Chain multiple models in a DAG (e.g., pre-process → model → post-process) as a single endpoint
  • Backends: TensorRT, ONNX Runtime, PyTorch (LibTorch), TensorFlow, Python, FIL (forests)
  • Metrics: Prometheus-compatible endpoint for latency, throughput, queue depth per model
# Model repository structure
models/
  resnet50/
    config.pbtxt        # model config
    1/
      model.plan        # TensorRT engine
  bert-large/
    config.pbtxt
    1/
      model.onnx

# Launch Triton server
docker run --gpus all \
  -p 8000:8000 \   # HTTP
  -p 8001:8001 \   # gRPC
  -p 8002:8002 \   # Metrics
  -v ./models:/models \
  nvcr.io/nvidia/tritonserver:24.01-py3 \
  tritonserver \
    --model-repository=/models

# Health check
curl localhost:8000/v2/health/ready
🧩 NIM — NVIDIA Inference Microservices

NIM packages optimized models (LLMs, vision, speech) as containerized microservices with pre-built TensorRT engines. A single docker run deploys a production-ready inference endpoint — no manual TensorRT engine building required.

What's inside a NIM container: Triton Inference Server + pre-built TensorRT-LLM engines + model weights + API server (OpenAI-compatible REST API). Supports continuous batching for LLMs (token-by-token generation with dynamic slot allocation).

TensorRT-LLM: NVIDIA's LLM-specific inference library. Implements paged KV cache, in-flight batching (continuous batching), speculative decoding, and FP8 quantization for H100. Substantially higher token throughput vs vanilla PyTorch inference.

# Deploy Llama 3 70B NIM
docker run --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  -v /local/nim/cache:/opt/nim/.cache \
  nvcr.io/nim/meta/llama3-70b-instruct:latest

# Inference via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama3-70b-instruct",
    "messages": [{"role":"user","content":"Hello"}],
    "max_tokens": 200
  }'
📊 MLOps — Model Lifecycle on GPU Clusters
PhaseTool / ServiceNVIDIA Integration
Experiment trackingMLflow, Weights & Biases (W&B)Auto-logs GPU metrics, loss curves, hyperparams
Distributed training orchestrationSLURM + Pyxis, Kubeflow, Ray TrainNative GPU scheduling; NCCL for collectives
Model registryMLflow Registry, NGC Private RegistryStore TensorRT engines and NIM-ready checkpoints
Hyperparameter tuningOptuna, Ray Tune, NNIRuns on GPU cluster; early stopping via DCGM metrics
Inference optimizationTensorRT, TensorRT-LLMTarget GPU SM architecture; FP8 calibration on H100
ServingTriton Inference Server, NIMDynamic batching; Prometheus/Grafana monitoring
Cluster monitoringDCGM + Prometheus + GrafanaGPU health, power, thermal, utilization dashboards

Practice Quiz

10 questions covering CUDA, MIG, containers, SLURM, NCCL, distributed training, and inference serving. Select an answer to reveal the explanation.

Question 1 of 10
Which NVIDIA library provides GPU-accelerated collective communications (all-reduce, all-gather) for distributed training?
AcuBLAS
BNCCL
CcuDNN
DTensorRT
NCCL (NVIDIA Collective Communications Library) implements all-reduce, all-gather, reduce-scatter, broadcast, and scatter/gather across GPUs — the backbone of all distributed deep learning. cuBLAS handles dense matrix multiply (GEMM). cuDNN handles DL primitives like convolutions. TensorRT is an inference optimizer.
Question 2 of 10
What is the CUDA compute capability of the NVIDIA H100 GPU?
A9.0
B8.0
C7.0
D10.0
H100 = compute capability 9.0. The progression: V100 = 7.0, A100 = 8.0, H100 = 9.0, B200 = 10.0. Compute capability determines which CUDA features and Tensor Core generations are available. Code must be compiled for the correct SM architecture (e.g., -arch=sm_90 for H100).
Question 3 of 10
Which SLURM plugin allows running NGC containers natively in sbatch/srun jobs without requiring Docker daemon access?
ADCGM
BMIG Manager
CPyxis
DKubeflow
Pyxis is the NVIDIA SLURM plugin that integrates container execution via --container-image in sbatch/srun. It uses Enroot as the underlying container runtime — converting Docker/OCI images to unprivileged sandboxes optimized for HPC shared filesystems. DCGM is for GPU monitoring. Kubeflow runs on Kubernetes, not SLURM.
Question 4 of 10
Which distributed training strategy shards model parameters, gradients, AND optimizer states across GPUs to reduce per-GPU memory?
ADDP (DistributedDataParallel)
BFSDP (FullyShardedDataParallel)
CPipeline Parallelism
DModel Replication
FSDP shards parameters + gradients + optimizer states. Each GPU holds only 1/N of each — enabling training of models far larger than a single GPU's HBM. DDP keeps a full model copy on every GPU. Pipeline Parallelism assigns different layers to different GPUs but doesn't shard within a layer. For LLaMA 70B: ~700 GB total / 8 GPUs = ~87.5 GB/rank with FSDP.
Question 5 of 10
What is the maximum number of MIG instances that can be created on a single H100 GPU using the smallest profile?
A7
B4
C8
D16
H100 supports up to 7 MIG instances using the smallest profile (1g.10gb — 1 GPC slice, 10 GB HBM). Each instance gets dedicated SMs, HBM slice, L2, and memory controllers — full hardware isolation. The H100 has 7 GPC slices available for MIG. A100 also supports 7 MIG instances. MIG instances cannot communicate via NVLink; they must go through the host.
Question 6 of 10
Triton Inference Server's dynamic batching feature is primarily used to:
ACompress model weights to INT8 precision automatically
BDistribute a single model across multiple GPU nodes
CGroup concurrent client requests into batches to improve GPU utilization
DAutomatically select the optimal parallelism strategy for training
Triton's dynamic batching collects inference requests arriving from multiple clients within a configurable time window and groups them into a single batch before sending to the GPU — dramatically improving GPU utilization for low-to-medium traffic scenarios where individual request batch size is 1. This converts many small, GPU-inefficient requests into a single large, efficient batch execution.
Question 7 of 10
Which NVIDIA tool provides cluster-scale GPU health monitoring, power capping, and Prometheus-compatible telemetry export?
ANsight Systems
BDCGM
Cnvidia-smi
DTensorBoard
DCGM (Data Center GPU Manager) is the cluster-scale GPU management solution — health checks, power capping, diagnostic tests, policy enforcement, and Prometheus metrics export. nvidia-smi is per-node; useful for manual inspection but not cluster-scale. Nsight Systems is a developer profiling tool, not a production monitoring system. TensorBoard is for training metrics, not GPU infrastructure.
Question 8 of 10
NGC containers are tagged with a YY.MM versioning scheme (e.g., 24.01). What does this tag represent?
AThe year and month of the NGC container release
BThe CUDA toolkit version packaged inside
CThe minimum GPU driver version required
DThe compute capability of the target GPU
NGC container tags follow YY.MM format — year and month of the release (e.g., 24.01 = January 2024). Each monthly release includes updated CUDA toolkit, cuDNN, framework versions, and NVIDIA library updates, validated together on NVIDIA hardware. The CUDA version inside is specified separately in the tag suffix or documentation, not the YY.MM number itself.
Question 9 of 10
In 3D parallelism, which dimension splits individual layer weight matrices across GPUs within a single node?
APipeline Parallelism
BData Parallelism
CTensor Parallelism
DGradient Checkpointing
Tensor Parallelism (TP) splits individual weight matrices (e.g., the Q, K, V projections in attention) column-wise and row-wise across GPUs within a node. It requires an all-reduce after each matrix multiply — needs fast NVLink intra-node bandwidth. Pipeline Parallelism assigns whole layers to different GPUs. Data Parallelism replicates the model with different data shards. Gradient checkpointing is a memory technique, not a parallelism strategy.
Question 10 of 10
Which environment variable configures NCCL to use a specific InfiniBand HCA for inter-node collective communications?
ANCCL_SOCKET_IFNAME
BNCCL_IB_HCA
CNCCL_P2P_LEVEL
DNCCL_IB_DISABLE
NCCL_IB_HCA specifies which InfiniBand HCA (Host Channel Adapter) to use — e.g., export NCCL_IB_HCA=mlx5_0. NCCL_SOCKET_IFNAME specifies the Ethernet interface for TCP/socket fallback. NCCL_P2P_LEVEL controls the peer-to-peer transport hierarchy (NVL, PXB, SYS). NCCL_IB_DISABLE=1 forces Ethernet, disabling InfiniBand entirely.
0/10
Questions Correct

Review the explanations above for any missed questions.

Memory Hooks & Advisor

Mnemonics, patterns, and quick-reference guidance for the most exam-critical software stack and distributed training concepts.

🏗️
Stack Order (Bottom Up)
GPU → Driver → CUDA → Libraries (cuDNN/NCCL/cuBLAS) → Framework (PyTorch) → NGC Container → App. Each layer depends on the one below. Driver must be ≥ toolkit version.
"Hardware Drives CUDA's Libraries, Frameworks Navigate Containers to Apps"
🔢
Compute Capabilities
V100=7.0, A100=8.0, H100=9.0, B200=10.0. Each major version = new architecture. H100's SM 9.0 unlocks FP8 Tensor Cores and Transformer Engine. Compile with -arch=sm_90 for H100.
"7-8-9-10: Volta, Ampere, Hopper, Blackwell"
🔀
MIG = 7 Max on H100
One H100 → up to 7 isolated GPU instances (1g.10gb each). Each gets dedicated SMs, HBM, L2, memory controllers. No NVLink between instances. Used for multi-tenant inference.
"7 slices of the GPU pie — each fully isolated"
📡
NCCL All-Reduce = Ring
All-reduce sums gradients across GPUs. Ring algorithm: each GPU sends to neighbor in ring. O(N) total data, O(1) bandwidth overhead per GPU. Preferred for NVLink rings. All-gather + Reduce-Scatter = FSDP pattern.
"Ring-around the GPUs, gradients all sync up"
🔀
DDP vs FSDP
DDP: full model copy per GPU, all-reduce gradients. Works when model fits in one GPU's HBM. FSDP: shards params+grads+optimizer states — model can be 8× larger than one GPU's HBM. LLaMA 70B requires FSDP (or bigger).
"DDP = copies; FSDP = shards — pick by model size"
📦
Pyxis = SLURM + Containers
Pyxis is the NVIDIA plugin that adds --container-image to sbatch/srun. Uses Enroot runtime underneath. Enables GPU cluster users to run NGC containers without Docker daemon access. Essential for HPC+AI.
"Pyxis = SLURM's container bridge to NGC"
🚀
Triton Dynamic Batching
Triton groups concurrent single-request inference calls into batches before GPU execution. Turns N × (batch=1) requests into one (batch=N) GPU call. Critical for maximizing GPU utilization in production inference.
"Triton batches stragglers — no GPU sits idle"
🧩
3D Parallelism Mapping
TP (within node, NVLink) + PP (across nodes, IB) + DP (across TP×PP groups). Tensor splits layers, Pipeline splits depth, Data replicates the TP+PP group. All three combined for trillion-parameter models.
"TP=width, PP=depth, DP=replicas"
🃏 Flashcards — Click to Flip
Concept

H100 CUDA Compute Capability

Tap to reveal
Answer

9.0
V100=7.0 · A100=8.0 · H100=9.0 · B200=10.0

Concept

Max MIG Instances on H100

Tap to reveal
Answer

7 instances (1g.10gb profile)
Each has dedicated SMs, HBM, L2, memory controllers

Library

What does NCCL provide?

Tap to reveal
Answer

GPU collective communications
All-reduce · All-gather · Reduce-scatter · Broadcast

Tool

SLURM plugin for NGC containers

Tap to reveal
Answer

Pyxis (uses Enroot runtime)
Adds --container-image to sbatch/srun

Strategy

DDP vs FSDP — key difference

Tap to reveal
Answer

DDP: full model copy per GPU
FSDP: shards params + grads + optimizer states → fits much larger models

Tool

Triton Inference Server — dynamic batching

Tap to reveal
Answer

Groups concurrent client requests into batches before GPU execution — maximizes GPU utilization at inference time

Variable

NCCL env var for InfiniBand HCA

Tap to reveal
Answer

NCCL_IB_HCA=mlx5_0
(vs NCCL_SOCKET_IFNAME for Ethernet fallback)

Concept

3D Parallelism dimensions

Tap to reveal
Answer

Tensor (intra-node, NVLink) + Pipeline (inter-node, IB) + Data (across TP×PP groups)

🤖 Expert Advisor — Ask a Category
CUDA & MIG
Containers & Schedulers
Distributed Training
Inference & Triton
Cluster MLOps

⚡ CUDA & MIG

  • Compute capability identifies GPU architecture features: V100=7.0, A100=8.0, H100=9.0, B200=10.0. Always compile for the deployment GPU's SM version (-arch=sm_90 for H100).
  • CUDA driver version must be ≥ CUDA toolkit version. The nvidia-smi "CUDA Version" shows the max driver-supported toolkit version — not what's installed.
  • MIG partitions one physical GPU into up to 7 isolated instances (H100: 1g.10gb × 7). Each instance has dedicated SMs, HBM, L2, and memory controllers — full fault isolation.
  • MIG instances cannot communicate via NVLink. Inter-instance traffic routes through host CPU/PCIe. Use MIG for multi-tenant inference, not distributed training.
  • Enable MIG with sudo nvidia-smi -i 0 -mig 1. Create instances with sudo nvidia-smi mig -cgi <profile_id> -C. List profiles with sudo nvidia-smi mig -lgip.
  • DCGM supports MIG-aware health monitoring — each MIG instance reported independently. Kubernetes GPU device plugin also exposes MIG instances as schedulable resources.

📦 Containers & Schedulers

  • NGC containers at nvcr.io package framework + CUDA + cuDNN + libraries validated together. Tagged as YY.MM (year.month). Pull exact tag matching your GPU driver version support.
  • SLURM + Pyxis: add --container-image=nvcr.io/nvidia/pytorch:24.01-py3 and --container-mounts=/lustre:/data to sbatch/srun. Pyxis handles image conversion via Enroot at job start.
  • SLURM GPU allocation: #SBATCH --gres=gpu:8 (8 GPUs per node). Use --gpus-per-task=1 for fine-grained per-rank GPU assignment.
  • Kubernetes GPU device plugin: DaemonSet that advertises nvidia.com/gpu as resource. Request GPUs with resources.limits["nvidia.com/gpu"]: 2 in Pod spec.
  • NVIDIA Operator: single Helm chart that deploys driver, device plugin, DCGM exporter, MIG manager on Kubernetes GPU nodes — preferred production deployment method.
  • Enroot: converts Docker/OCI images to unprivileged squashfs containers. Works without root daemon — compatible with HPC multi-user systems. Pyxis = SLURM integration layer on top of Enroot.

🔀 Distributed Training

  • NCCL all-reduce uses ring algorithm by default: O(N) data, O(1) bandwidth per GPU regardless of GPU count. Alternative: tree algorithm (lower latency for small messages).
  • DDP (torch.nn.parallel.DistributedDataParallel): full model copy per GPU, data sharded, gradients all-reduced after backward. Simplest — use when model fits in one GPU HBM.
  • FSDP (FullyShardedDataParallel): shards params+grads+optimizer states. Uses all-gather before forward (reconstruct params), reduce-scatter after backward (accumulate grad shards). LLaMA 70B: 700 GB ÷ 8 = ~87.5 GB/rank.
  • Tensor Parallelism: splits weight matrices within layers across GPUs (Megatron-style). Needs NVLink bandwidth; typically intra-node only. All-reduce required after each matmul.
  • Pipeline Parallelism: assigns consecutive transformer layers to different GPU stages. Micro-batches flow through the pipeline. Communication = activations only. Suitable for inter-node (IB). Introduces pipeline "bubble" idle time.
  • 3D Parallelism: TP (within node, NVLink) × PP (between nodes, IB) × DP (across pipeline+tensor groups). Used for trillion-parameter frontier model training on thousands of GPUs.

🚀 Inference & Triton

  • TensorRT engine files are GPU-specific — an engine built for H100 (sm_90) will not run on A100 (sm_80). Always build TensorRT engines on the production target hardware.
  • TensorRT workflow: export PyTorch/TF model → ONNX → trtexec (or Python API) → .engine file. Specify precision (--fp16, --int8) and dynamic shape ranges at build time.
  • Triton backends: TensorRT (.plan), ONNX Runtime (.onnx), LibTorch (.pt), TensorFlow SavedModel, Python (custom logic), FIL (forest models).
  • Triton dynamic batching: groups requests arriving within a time window into a single batch. Configure with max_queue_delay_microseconds and preferred_batch_size in config.pbtxt.
  • NIM (NVIDIA Inference Microservices): pre-built TensorRT-LLM engines + Triton + OpenAI-compatible API in a single container. Supports continuous batching for LLMs. Deploy with docker run, connect to NGC API key for model weights.
  • TensorRT-LLM features for H100: paged KV cache (handles variable-length sequences efficiently), in-flight batching (continuous batching), speculative decoding, FP8 quantization via Transformer Engine.

📊 Cluster MLOps

  • DCGM (Data Center GPU Manager): cluster-scale GPU health monitoring. Exports metrics to Prometheus via dcgm-exporter sidecar. Supports health checks, power capping (dcgmi policy), and field group telemetry.
  • Experiment tracking: W&B and MLflow auto-log GPU metrics (utilization, temperature, memory) alongside training loss, learning rate, and custom metrics. Essential for reproducibility.
  • Checkpoint strategy: save to local NVMe (fast) via GDS, then async copy to Lustre/WEKA (shared), then optionally to S3 (archival). Use FSDP state_dict_type=SHARDED_STATE_DICT for parallel save.
  • Model registry: NGC Private Registry stores team-internal containers and model artifacts. TensorRT engines stored with GPU-version metadata — prevents deploying wrong-arch engine.
  • Power capping at cluster scale: dcgmi policy -g <group> --set 0,<power_limit_W>. Useful for shared cluster fairness and preventing PSU overload during simultaneous GPU boost.
  • SLURM accounting: sacct -j <jobid> --format=JobID,GPUUtil,MaxRSS,Elapsed. Track GPU utilization per job for cluster efficiency auditing — identify underutilized GPU allocations.
Unlock Full Flashcard Deck on FlashGenius →