AI Software Stack & Cluster Deployment

The NVIDIA AI software stack is a layered architecture from GPU silicon to containerized applications. Every NCP-AII certified professional must understand how CUDA, libraries, frameworks, containers, schedulers, and distributed training fit together to deploy AI at cluster scale.

🏗️ NVIDIA AI Software Stack — Full Layer View

Applications / Models

LLMs, Vision, Speech, Recommender systems, Custom models

AI Frameworks

PyTorch · TensorFlow · JAX · PaddlePaddle

NVIDIA Libraries

cuDNN · NCCL · cuBLAS · cuSPARSE · RAPIDS · TensorRT · Triton

CUDA Toolkit

CUDA Runtime · CUDA Driver API · NVCC compiler · Profiler (Nsight)

NVIDIA Driver

Kernel module (nvidia.ko) · User-space libraries · DCGM agent

GPU Hardware

H100 / H200 / B200 · NVLink · HBM3 · MIG partitions

Each layer depends on the one below. CUDA toolkit version must be compatible with the installed driver. Frameworks call cuDNN/NCCL through CUDA. NGC containers package a consistent, validated snapshot of all layers above the driver.

⚡ CUDA Compute Capabilities

GPU	Compute Cap.
V100	7.0
A100	8.0
H100	9.0
B200	10.0

📦 NGC Container Benefits

Pre-built, GPU-optimized containers for PyTorch, TensorFlow, TensorRT, and more. Updated monthly. Validated on NVIDIA hardware. Eliminates dependency management — pull and run.

🔀 MIG on H100

Multi-Instance GPU: partition one H100 into up to 7 independent GPU instances, each with dedicated SMs, HBM slice, L2, and memory controllers. Enables multi-tenant inference workloads.

Core Theme: The NCP-AII software stack questions center on three areas: (1) CUDA/driver compatibility and MIG, (2) container + scheduler deployment patterns (SLURM with Pyxis, Kubernetes with GPU device plugin), and (3) distributed training collectives (NCCL all-reduce, ring vs tree algorithms) and parallelism strategies (DDP, FSDP, tensor, pipeline, 3D).

CUDA & the NVIDIA Software Stack

CUDA is the programming model and platform that exposes GPU parallelism to developers. Understanding CUDA's architecture, compute capabilities, driver/toolkit compatibility, and profiling tools is foundational.

🔧 CUDA Architecture Fundamentals

CUDA Toolkit vs CUDA Driver: Two distinct components. The driver (installed with GPU driver package) is the low-level kernel interface. The toolkit (nvcc, runtime libraries) compiles and links CUDA code. Toolkit version must be ≤ driver version for compatibility.

Compute Capability: A version number (major.minor) identifying GPU architecture features. Code compiled for compute capability 9.0 can only run on H100 or newer. Always compile for the target GPU's compute capability.

CUDA Runtime vs Driver API: Runtime API is higher-level (implicit context management). Driver API is lower-level (explicit control). Most frameworks use Runtime API.

# Check driver and CUDA toolkit versions
nvidia-smi
# Shows: Driver Version, CUDA Version (max supported)

nvcc --version
# Shows: installed CUDA toolkit version

# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap \
  --format=csv,noheader

# Compile for H100 (sm_90)
nvcc -arch=sm_90 kernel.cu -o kernel

🔀 MIG — Multi-Instance GPU

MIG partitions a single physical GPU into multiple isolated GPU Instances (GIs), each with:

Dedicated SM partition (no sharing between instances)
Dedicated HBM memory slice
Dedicated L2 cache partition
Dedicated memory controllers and bandwidth
Full fault isolation — one instance's errors don't affect others

H100 MIG profiles: 1g.10gb (7×), 2g.20gb (3×), 3g.40gb (2×), 4g.40gb (1×), 7g.80gb (1× — full GPU). The "g" = GPC slices; "gb" = HBM allocation.

Use case: Running multiple independent inference workloads on one GPU; multi-tenant environments where isolation is required.

# Enable MIG mode on GPU 0
sudo nvidia-smi -i 0 \
  -mig 1

# List available MIG profiles
sudo nvidia-smi mig \
  -lgip

# Create 7 × 1g.10gb instances on GPU 0
sudo nvidia-smi mig \
  -cgi 19,19,19,19,19,19,19 -C

# List all MIG instances
sudo nvidia-smi mig -lgi

# Disable MIG mode
sudo nvidia-smi -i 0 -mig 0

Note: MIG mode requires a reboot on some systems. MIG instances cannot share NVLink — inter-instance communication goes through the host CPU/PCIe path.

🔍 NVIDIA Profiling & Monitoring Tools

Tool	Purpose	Key Use
`nvidia-smi`	GPU management & monitoring	Temperature, power, utilization, driver version, MIG management
DCGM	Cluster-scale GPU health	Health checks, power capping, telemetry export to Prometheus
Nsight Systems	System-level profiling	CPU/GPU timeline, CUDA API calls, NVLink/PCIe traffic
Nsight Compute	Kernel-level profiling	SM utilization, memory bandwidth, roofline analysis per kernel
Nsight DL Designer	DL model profiling	Layer-by-layer performance in neural network graphs
NVTX	Annotation SDK	Mark application regions for Nsight Systems timeline correlation

📚 Key NVIDIA Libraries

Library	Domain	Key Function
cuDNN	Deep Learning	GPU-accelerated convolutions, attention, normalization — used by PyTorch/TF backends
NCCL	Collective Comms	All-reduce, broadcast, scatter/gather across GPUs — core of distributed training
cuBLAS	Linear Algebra	Dense matrix multiply (GEMM) — the inner loop of transformer attention and FFN layers
cuSPARSE	Sparse Linear Algebra	SpMM, SpMV for sparse model acceleration
TensorRT	Inference Optimization	Layer fusion, precision calibration (FP8/INT8), engine serialization for deployment
RAPIDS cuDF / cuML	Data Science	GPU-accelerated DataFrames and ML pipelines (ETL for training data)
cuFile (GDS)	Storage I/O	Direct NVMe→GPU DMA bypassing CPU — eliminates storage I/O bottleneck

Containers & Cluster Orchestration

NGC containers package the full validated software stack. SLURM and Kubernetes are the two dominant schedulers for GPU cluster workloads — each with distinct GPU-aware tooling.

📦 NGC — NVIDIA GPU Cloud Containers

NGC provides pre-built, optimized containers from nvcr.io (NVIDIA Container Registry). Each container includes framework + CUDA toolkit + cuDNN + libraries, validated against specific GPU generations.

Key container families: PyTorch, TensorFlow, JAX, TensorRT, Triton Inference Server, NeMo (LLM training), RAPIDS, CUDA base images.

Versioning: Tagged as YY.MM (e.g., 24.01) indicating release month. Monthly releases track upstream framework + NVIDIA library updates.

# Pull NGC PyTorch container
docker pull \
  nvcr.io/nvidia/pytorch:24.01-py3

# Run with GPU access
docker run --gpus all \
  --rm -it \
  nvcr.io/nvidia/pytorch:24.01-py3

# Run with specific GPU + NVMe mount
docker run --gpus '"device=0,1"' \
  -v /lustre/data:/data \
  -v /lustre/checkpoints:/ckpt \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python train.py

⚙️ SLURM — GPU Cluster Job Scheduling

SLURM (Simple Linux Utility for Resource Management) is the dominant HPC/AI cluster scheduler. GPU resources are managed through GRES (Generic Resource Scheduling).

Pyxis: SLURM plugin (from NVIDIA) that integrates container execution natively into srun/sbatch. Replaces the need for docker run wrappers. Uses Enroot as the container runtime underneath.

Enroot: Lightweight container runtime that converts OCI images (Docker/NGC) to unprivileged sandboxes — optimized for HPC shared filesystems.

#!/bin/bash
#SBATCH --job-name=llm-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --partition=dgx-h100

# Launch with Pyxis (--container-image)
srun \
  --container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
  --container-mounts=/lustre/data:/data \
  python -m torch.distributed.run \
    --nproc_per_node=8 \
    --nnodes=4 \
    train.py

☸️ Kubernetes — GPU-Aware Container Orchestration

NVIDIA GPU Device Plugin: Kubernetes DaemonSet that advertises GPUs as allocatable resources (nvidia.com/gpu). Required on every GPU node. Enables GPU requests in Pod specs.

MIG-aware Scheduling: The GPU device plugin supports MIG device advertisement. Pods can request specific MIG profiles (e.g., nvidia.com/mig-1g.10gb).

NVIDIA Operator: Kubernetes operator that automates deployment of GPU driver, device plugin, DCGM exporter, and MIG manager — the recommended production deployment path for GPU clusters on Kubernetes.

Kubeflow: ML workflow orchestration on Kubernetes — manages training jobs (PyTorchJob, TFJob), hyperparameter tuning, and pipelines.

# Pod spec requesting 2 H100 GPUs
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: trainer
    image: nvcr.io/nvidia/pytorch:24.01-py3
    resources:
      limits:
        nvidia.com/gpu: 2
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all

---
# Request MIG instance instead:
    resources:
      limits:
        nvidia.com/mig-1g.10gb: 1

📊 SLURM vs Kubernetes for GPU Workloads

Dimension	SLURM	Kubernetes
Primary Use	HPC / large-scale training	Microservices / inference serving
Container Runtime	Enroot + Pyxis plugin	containerd / Docker + GPU device plugin
Job Types	Batch (sbatch), interactive (srun)	Pods, Jobs, CronJobs, custom CRDs
GPU Scheduling	GRES (--gres=gpu:N)	resource requests (nvidia.com/gpu: N)
MIG Support	Via GRES naming convention	MIG device plugin profiles
Multi-node Jobs	Native (--nodes, --ntasks)	Requires PyTorchJob/MPI Operator
Adoption	Dominant in HPC/supercomputers	Dominant in cloud-native/MLOps

Distributed Training

Training large AI models requires distributing computation across multiple GPUs and nodes. NCCL handles the collective communications; DDP, FSDP, tensor, and pipeline parallelism handle different ways to split the model and data.

📡 NCCL — NVIDIA Collective Communications Library

NCCL implements optimized GPU-to-GPU collective operations — the backbone of all distributed deep learning. It automatically selects the best transport: NVLink (intra-node), InfiniBand RDMA (inter-node), or Ethernet.

Key collectives:

All-Reduce: Sum (or mean) a tensor across all GPUs, distribute result back to all. Used in DDP gradient synchronization.
All-Gather: Each GPU contributes a shard; all GPUs receive the concatenated result. Used in FSDP parameter reconstruction.
Reduce-Scatter: Sum across GPUs, then scatter one shard to each. Used in FSDP gradient reduction.
Broadcast: One GPU sends a tensor to all others.

Ring algorithm: Default all-reduce topology. Each GPU sends to its neighbor in a ring — O(N) data volume but O(1) bandwidth overhead regardless of GPU count. Optimal for NVLink rings.

# NCCL environment variables for tuning

# Use InfiniBand interface hca0
export NCCL_IB_HCA=mlx5_0

# Specify network interface for socket fallback
export NCCL_SOCKET_IFNAME=ib0

# Enable NCCL debug logging
export NCCL_DEBUG=INFO

# Disable IB (force Ethernet)
export NCCL_IB_DISABLE=1

# Set number of rings (often = NVLink lanes)
export NCCL_P2P_LEVEL=NVL

# Verify NCCL is using NVLink
NCCL_DEBUG=INFO python train.py \
  2>&1 | grep -i nvlink

🔀 Parallelism Strategies

Data Parallelism (DDP)

torch.nn.parallel.DistributedDataParallel

Full model replica on each GPU. Each GPU processes a different mini-batch. Gradients all-reduced after backward pass. Simplest approach — works when model fits in one GPU's HBM.

Fully Sharded (FSDP)

torch.distributed.fsdp.FullyShardedDataParallel

Shards model parameters, gradients, and optimizer states across GPUs. All-gather before forward, reduce-scatter after backward. Enables training models far larger than single-GPU HBM. LLaMA 70B: ~700 GB total → 87.5 GB/GPU across 8.

Tensor Parallelism (TP)

Megatron-LM, NeMo

Splits individual layer weight matrices across GPUs. Column-parallel and row-parallel linear layers. Requires all-reduce within each layer — needs fast NVLink intra-node bandwidth. Typically used within a node.

Pipeline Parallelism (PP)

GPipe, PipeDream, Megatron-LM

Assigns consecutive transformer layers to different GPUs (stages). Micro-batches flow through stages. Reduces inter-GPU communication to activations only (no gradient all-reduce across stages). Works well across nodes over IB.

3D Parallelism combines all three: Tensor Parallelism (intra-node, NVLink), Pipeline Parallelism (inter-node, InfiniBand), and Data Parallelism (across TP×PP groups). Used for frontier-scale LLM training (GPT-4 class models) on thousands of GPUs.

🚀 Launching Distributed Training

# Single-node, 8-GPU DDP with torchrun
torchrun \
  --nproc_per_node=8 \
  train_ddp.py \
  --batch-size 256

# Multi-node (4 nodes, 8 GPUs each = 32 total) via SLURM+Pyxis
srun \
  --container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
  --container-mounts=/lustre:/data \
  torchrun \
    --nproc_per_node=8 \
    --nnodes=4 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
  train_ddp.py

# FSDP wrapping in Python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

model = FSDP(
    model,
    auto_wrap_policy=transformer_auto_wrap_policy,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
    device_id=torch.cuda.current_device(),
)

📊 Parallelism Strategy Selection Guide

Model Size	GPU Count	Recommended Strategy	Key Constraint
<80 GB (fits 1× H100)	1–8	DDP	Batch size / throughput
80–640 GB	8–64	FSDP	HBM per GPU
640 GB – few TB	64–256	FSDP + Tensor Parallelism	NVLink bandwidth
Multi-TB (GPT-4 class)	256–10,000+	3D Parallelism (TP+PP+DP)	IB bandwidth, pipeline bubbles

Inference Serving & MLOps

Getting trained models into production requires optimized inference engines, serving infrastructure, and monitoring. NVIDIA's inference stack centers on TensorRT, Triton Inference Server, and NIM.

⚡ TensorRT — Inference Optimization

TensorRT takes a trained model (ONNX, PyTorch, TensorFlow) and produces an optimized serialized engine for a specific GPU target. Optimizations include:

Layer fusion: Combines sequential ops (e.g., Conv + BN + ReLU → single kernel)
Precision calibration: FP32 → FP16 → INT8 → FP8 with calibration dataset
Kernel auto-selection: Benchmarks alternative CUDA kernels, picks fastest for target GPU
Memory optimization: Reuses buffers across layers to minimize HBM allocation

Engine files are GPU-specific: An engine built for H100 (SM 9.0) cannot run on A100 (SM 8.0). Always build on the deployment target GPU.

# Export PyTorch model to ONNX
torch.onnx.export(model, dummy_input,
  "model.onnx",
  opset_version=17,
  input_names=["input"],
  output_names=["output"],
  dynamic_axes={"input": {0: "batch"}})

# Build TensorRT engine (FP16)
trtexec \
  --onnx=model.onnx \
  --fp16 \
  --saveEngine=model_fp16.engine \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:32x3x224x224 \
  --maxShapes=input:64x3x224x224

# Run benchmark
trtexec \
  --loadEngine=model_fp16.engine \
  --batch=32 --iterations=100

🚀 Triton Inference Server

Triton (not to be confused with the Triton kernel language) is NVIDIA's production inference serving system. It serves models from multiple frameworks simultaneously and exposes HTTP and gRPC endpoints.

Key features:

Dynamic batching: Automatically groups requests from multiple clients into batches for higher GPU utilization
Concurrent model execution: Multiple models run simultaneously on different GPU contexts
Model ensembles: Chain multiple models in a DAG (e.g., pre-process → model → post-process) as a single endpoint
Backends: TensorRT, ONNX Runtime, PyTorch (LibTorch), TensorFlow, Python, FIL (forests)
Metrics: Prometheus-compatible endpoint for latency, throughput, queue depth per model

# Model repository structure
models/
  resnet50/
    config.pbtxt        # model config
    1/
      model.plan        # TensorRT engine
  bert-large/
    config.pbtxt
    1/
      model.onnx

# Launch Triton server
docker run --gpus all \
  -p 8000:8000 \   # HTTP
  -p 8001:8001 \   # gRPC
  -p 8002:8002 \   # Metrics
  -v ./models:/models \
  nvcr.io/nvidia/tritonserver:24.01-py3 \
  tritonserver \
    --model-repository=/models

# Health check
curl localhost:8000/v2/health/ready

🧩 NIM — NVIDIA Inference Microservices

NIM packages optimized models (LLMs, vision, speech) as containerized microservices with pre-built TensorRT engines. A single docker run deploys a production-ready inference endpoint — no manual TensorRT engine building required.

What's inside a NIM container: Triton Inference Server + pre-built TensorRT-LLM engines + model weights + API server (OpenAI-compatible REST API). Supports continuous batching for LLMs (token-by-token generation with dynamic slot allocation).

TensorRT-LLM: NVIDIA's LLM-specific inference library. Implements paged KV cache, in-flight batching (continuous batching), speculative decoding, and FP8 quantization for H100. Substantially higher token throughput vs vanilla PyTorch inference.

# Deploy Llama 3 70B NIM
docker run --gpus all \
  -e NGC_API_KEY=$NGC_API_KEY \
  -p 8000:8000 \
  -v /local/nim/cache:/opt/nim/.cache \
  nvcr.io/nim/meta/llama3-70b-instruct:latest

# Inference via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama3-70b-instruct",
    "messages": [{"role":"user","content":"Hello"}],
    "max_tokens": 200
  }'

📊 MLOps — Model Lifecycle on GPU Clusters

Phase	Tool / Service	NVIDIA Integration
Experiment tracking	MLflow, Weights & Biases (W&B)	Auto-logs GPU metrics, loss curves, hyperparams
Distributed training orchestration	SLURM + Pyxis, Kubeflow, Ray Train	Native GPU scheduling; NCCL for collectives
Model registry	MLflow Registry, NGC Private Registry	Store TensorRT engines and NIM-ready checkpoints
Hyperparameter tuning	Optuna, Ray Tune, NNI	Runs on GPU cluster; early stopping via DCGM metrics
Inference optimization	TensorRT, TensorRT-LLM	Target GPU SM architecture; FP8 calibration on H100
Serving	Triton Inference Server, NIM	Dynamic batching; Prometheus/Grafana monitoring
Cluster monitoring	DCGM + Prometheus + Grafana	GPU health, power, thermal, utilization dashboards

Practice Quiz

10 questions covering CUDA, MIG, containers, SLURM, NCCL, distributed training, and inference serving. Select an answer to reveal the explanation.

Question 1 of 10

Which NVIDIA library provides GPU-accelerated collective communications (all-reduce, all-gather) for distributed training?

AcuBLAS

BNCCL

CcuDNN

DTensorRT

NCCL (NVIDIA Collective Communications Library) implements all-reduce, all-gather, reduce-scatter, broadcast, and scatter/gather across GPUs — the backbone of all distributed deep learning. cuBLAS handles dense matrix multiply (GEMM). cuDNN handles DL primitives like convolutions. TensorRT is an inference optimizer.

Question 2 of 10

What is the CUDA compute capability of the NVIDIA H100 GPU?

A9.0

B8.0

C7.0

D10.0

H100 = compute capability 9.0. The progression: V100 = 7.0, A100 = 8.0, H100 = 9.0, B200 = 10.0. Compute capability determines which CUDA features and Tensor Core generations are available. Code must be compiled for the correct SM architecture (e.g., -arch=sm_90 for H100).

Question 3 of 10

Which SLURM plugin allows running NGC containers natively in sbatch/srun jobs without requiring Docker daemon access?

ADCGM

BMIG Manager

CPyxis

DKubeflow

Pyxis is the NVIDIA SLURM plugin that integrates container execution via --container-image in sbatch/srun. It uses Enroot as the underlying container runtime — converting Docker/OCI images to unprivileged sandboxes optimized for HPC shared filesystems. DCGM is for GPU monitoring. Kubeflow runs on Kubernetes, not SLURM.

Question 4 of 10

Which distributed training strategy shards model parameters, gradients, AND optimizer states across GPUs to reduce per-GPU memory?

ADDP (DistributedDataParallel)

BFSDP (FullyShardedDataParallel)

CPipeline Parallelism

DModel Replication

FSDP shards parameters + gradients + optimizer states. Each GPU holds only 1/N of each — enabling training of models far larger than a single GPU's HBM. DDP keeps a full model copy on every GPU. Pipeline Parallelism assigns different layers to different GPUs but doesn't shard within a layer. For LLaMA 70B: ~700 GB total / 8 GPUs = ~87.5 GB/rank with FSDP.

Question 5 of 10

What is the maximum number of MIG instances that can be created on a single H100 GPU using the smallest profile?

D16

H100 supports up to 7 MIG instances using the smallest profile (1g.10gb — 1 GPC slice, 10 GB HBM). Each instance gets dedicated SMs, HBM slice, L2, and memory controllers — full hardware isolation. The H100 has 7 GPC slices available for MIG. A100 also supports 7 MIG instances. MIG instances cannot communicate via NVLink; they must go through the host.

Question 6 of 10

Triton Inference Server's dynamic batching feature is primarily used to:

ACompress model weights to INT8 precision automatically

BDistribute a single model across multiple GPU nodes

CGroup concurrent client requests into batches to improve GPU utilization

DAutomatically select the optimal parallelism strategy for training

Triton's dynamic batching collects inference requests arriving from multiple clients within a configurable time window and groups them into a single batch before sending to the GPU — dramatically improving GPU utilization for low-to-medium traffic scenarios where individual request batch size is 1. This converts many small, GPU-inefficient requests into a single large, efficient batch execution.

Question 7 of 10

Which NVIDIA tool provides cluster-scale GPU health monitoring, power capping, and Prometheus-compatible telemetry export?

ANsight Systems

BDCGM

Cnvidia-smi

DTensorBoard

DCGM (Data Center GPU Manager) is the cluster-scale GPU management solution — health checks, power capping, diagnostic tests, policy enforcement, and Prometheus metrics export. nvidia-smi is per-node; useful for manual inspection but not cluster-scale. Nsight Systems is a developer profiling tool, not a production monitoring system. TensorBoard is for training metrics, not GPU infrastructure.

Question 8 of 10

NGC containers are tagged with a YY.MM versioning scheme (e.g., 24.01). What does this tag represent?

AThe year and month of the NGC container release

BThe CUDA toolkit version packaged inside

CThe minimum GPU driver version required

DThe compute capability of the target GPU

NGC container tags follow YY.MM format — year and month of the release (e.g., 24.01 = January 2024). Each monthly release includes updated CUDA toolkit, cuDNN, framework versions, and NVIDIA library updates, validated together on NVIDIA hardware. The CUDA version inside is specified separately in the tag suffix or documentation, not the YY.MM number itself.

Question 9 of 10

In 3D parallelism, which dimension splits individual layer weight matrices across GPUs within a single node?

APipeline Parallelism

BData Parallelism

CTensor Parallelism

DGradient Checkpointing

Tensor Parallelism (TP) splits individual weight matrices (e.g., the Q, K, V projections in attention) column-wise and row-wise across GPUs within a node. It requires an all-reduce after each matrix multiply — needs fast NVLink intra-node bandwidth. Pipeline Parallelism assigns whole layers to different GPUs. Data Parallelism replicates the model with different data shards. Gradient checkpointing is a memory technique, not a parallelism strategy.

Question 10 of 10

Which environment variable configures NCCL to use a specific InfiniBand HCA for inter-node collective communications?

ANCCL_SOCKET_IFNAME

BNCCL_IB_HCA

CNCCL_P2P_LEVEL

DNCCL_IB_DISABLE

NCCL_IB_HCA specifies which InfiniBand HCA (Host Channel Adapter) to use — e.g., export NCCL_IB_HCA=mlx5_0. NCCL_SOCKET_IFNAME specifies the Ethernet interface for TCP/socket fallback. NCCL_P2P_LEVEL controls the peer-to-peer transport hierarchy (NVL, PXB, SYS). NCCL_IB_DISABLE=1 forces Ethernet, disabling InfiniBand entirely.

0/10

Questions Correct

Review the explanations above for any missed questions.

Memory Hooks & Advisor

Mnemonics, patterns, and quick-reference guidance for the most exam-critical software stack and distributed training concepts.

🏗️

Stack Order (Bottom Up)

GPU → Driver → CUDA → Libraries (cuDNN/NCCL/cuBLAS) → Framework (PyTorch) → NGC Container → App. Each layer depends on the one below. Driver must be ≥ toolkit version.

"Hardware Drives CUDA's Libraries, Frameworks Navigate Containers to Apps"

🔢

Compute Capabilities

V100=7.0, A100=8.0, H100=9.0, B200=10.0. Each major version = new architecture. H100's SM 9.0 unlocks FP8 Tensor Cores and Transformer Engine. Compile with -arch=sm_90 for H100.

"7-8-9-10: Volta, Ampere, Hopper, Blackwell"

🔀

MIG = 7 Max on H100

One H100 → up to 7 isolated GPU instances (1g.10gb each). Each gets dedicated SMs, HBM, L2, memory controllers. No NVLink between instances. Used for multi-tenant inference.

"7 slices of the GPU pie — each fully isolated"

📡

NCCL All-Reduce = Ring

All-reduce sums gradients across GPUs. Ring algorithm: each GPU sends to neighbor in ring. O(N) total data, O(1) bandwidth overhead per GPU. Preferred for NVLink rings. All-gather + Reduce-Scatter = FSDP pattern.

"Ring-around the GPUs, gradients all sync up"

🔀

DDP vs FSDP

DDP: full model copy per GPU, all-reduce gradients. Works when model fits in one GPU's HBM. FSDP: shards params+grads+optimizer states — model can be 8× larger than one GPU's HBM. LLaMA 70B requires FSDP (or bigger).

"DDP = copies; FSDP = shards — pick by model size"

📦

Pyxis = SLURM + Containers

Pyxis is the NVIDIA plugin that adds --container-image to sbatch/srun. Uses Enroot runtime underneath. Enables GPU cluster users to run NGC containers without Docker daemon access. Essential for HPC+AI.

"Pyxis = SLURM's container bridge to NGC"

🚀

Triton Dynamic Batching

Triton groups concurrent single-request inference calls into batches before GPU execution. Turns N × (batch=1) requests into one (batch=N) GPU call. Critical for maximizing GPU utilization in production inference.

"Triton batches stragglers — no GPU sits idle"

🧩

3D Parallelism Mapping

TP (within node, NVLink) + PP (across nodes, IB) + DP (across TP×PP groups). Tensor splits layers, Pipeline splits depth, Data replicates the TP+PP group. All three combined for trillion-parameter models.

"TP=width, PP=depth, DP=replicas"

🃏 Flashcards — Click to Flip

Concept

H100 CUDA Compute Capability

Tap to reveal

Answer

9.0
V100=7.0 · A100=8.0 · H100=9.0 · B200=10.0

Concept

Max MIG Instances on H100

Tap to reveal

Answer

7 instances (1g.10gb profile)
Each has dedicated SMs, HBM, L2, memory controllers

Library

What does NCCL provide?

Tap to reveal

Answer

GPU collective communications
All-reduce · All-gather · Reduce-scatter · Broadcast

Tool

SLURM plugin for NGC containers

Tap to reveal

Answer

Pyxis (uses Enroot runtime)
Adds --container-image to sbatch/srun

Strategy

DDP vs FSDP — key difference

Tap to reveal

Answer

DDP: full model copy per GPU
FSDP: shards params + grads + optimizer states → fits much larger models

Tool

Triton Inference Server — dynamic batching

Tap to reveal

Answer

Groups concurrent client requests into batches before GPU execution — maximizes GPU utilization at inference time

Variable

NCCL env var for InfiniBand HCA

Tap to reveal

Answer

NCCL_IB_HCA=mlx5_0
(vs NCCL_SOCKET_IFNAME for Ethernet fallback)

Concept

3D Parallelism dimensions

Tap to reveal

Answer

Tensor (intra-node, NVLink) + Pipeline (inter-node, IB) + Data (across TP×PP groups)

🤖 Expert Advisor — Ask a Category

CUDA & MIG

Containers & Schedulers

Distributed Training

Inference & Triton

Cluster MLOps

⚡ CUDA & MIG

Compute capability identifies GPU architecture features: V100=7.0, A100=8.0, H100=9.0, B200=10.0. Always compile for the deployment GPU's SM version (-arch=sm_90 for H100).
CUDA driver version must be ≥ CUDA toolkit version. The nvidia-smi "CUDA Version" shows the max driver-supported toolkit version — not what's installed.
MIG partitions one physical GPU into up to 7 isolated instances (H100: 1g.10gb × 7). Each instance has dedicated SMs, HBM, L2, and memory controllers — full fault isolation.
MIG instances cannot communicate via NVLink. Inter-instance traffic routes through host CPU/PCIe. Use MIG for multi-tenant inference, not distributed training.
Enable MIG with sudo nvidia-smi -i 0 -mig 1. Create instances with sudo nvidia-smi mig -cgi <profile_id> -C. List profiles with sudo nvidia-smi mig -lgip.
DCGM supports MIG-aware health monitoring — each MIG instance reported independently. Kubernetes GPU device plugin also exposes MIG instances as schedulable resources.

📦 Containers & Schedulers

NGC containers at nvcr.io package framework + CUDA + cuDNN + libraries validated together. Tagged as YY.MM (year.month). Pull exact tag matching your GPU driver version support.
SLURM + Pyxis: add --container-image=nvcr.io/nvidia/pytorch:24.01-py3 and --container-mounts=/lustre:/data to sbatch/srun. Pyxis handles image conversion via Enroot at job start.
SLURM GPU allocation: #SBATCH --gres=gpu:8 (8 GPUs per node). Use --gpus-per-task=1 for fine-grained per-rank GPU assignment.
Kubernetes GPU device plugin: DaemonSet that advertises nvidia.com/gpu as resource. Request GPUs with resources.limits["nvidia.com/gpu"]: 2 in Pod spec.
NVIDIA Operator: single Helm chart that deploys driver, device plugin, DCGM exporter, MIG manager on Kubernetes GPU nodes — preferred production deployment method.
Enroot: converts Docker/OCI images to unprivileged squashfs containers. Works without root daemon — compatible with HPC multi-user systems. Pyxis = SLURM integration layer on top of Enroot.

🔀 Distributed Training

NCCL all-reduce uses ring algorithm by default: O(N) data, O(1) bandwidth per GPU regardless of GPU count. Alternative: tree algorithm (lower latency for small messages).
DDP (torch.nn.parallel.DistributedDataParallel): full model copy per GPU, data sharded, gradients all-reduced after backward. Simplest — use when model fits in one GPU HBM.
FSDP (FullyShardedDataParallel): shards params+grads+optimizer states. Uses all-gather before forward (reconstruct params), reduce-scatter after backward (accumulate grad shards). LLaMA 70B: 700 GB ÷ 8 = ~87.5 GB/rank.
Tensor Parallelism: splits weight matrices within layers across GPUs (Megatron-style). Needs NVLink bandwidth; typically intra-node only. All-reduce required after each matmul.
Pipeline Parallelism: assigns consecutive transformer layers to different GPU stages. Micro-batches flow through the pipeline. Communication = activations only. Suitable for inter-node (IB). Introduces pipeline "bubble" idle time.
3D Parallelism: TP (within node, NVLink) × PP (between nodes, IB) × DP (across pipeline+tensor groups). Used for trillion-parameter frontier model training on thousands of GPUs.

🚀 Inference & Triton

TensorRT engine files are GPU-specific — an engine built for H100 (sm_90) will not run on A100 (sm_80). Always build TensorRT engines on the production target hardware.
TensorRT workflow: export PyTorch/TF model → ONNX → trtexec (or Python API) → .engine file. Specify precision (--fp16, --int8) and dynamic shape ranges at build time.
Triton backends: TensorRT (.plan), ONNX Runtime (.onnx), LibTorch (.pt), TensorFlow SavedModel, Python (custom logic), FIL (forest models).
Triton dynamic batching: groups requests arriving within a time window into a single batch. Configure with max_queue_delay_microseconds and preferred_batch_size in config.pbtxt.
NIM (NVIDIA Inference Microservices): pre-built TensorRT-LLM engines + Triton + OpenAI-compatible API in a single container. Supports continuous batching for LLMs. Deploy with docker run, connect to NGC API key for model weights.
TensorRT-LLM features for H100: paged KV cache (handles variable-length sequences efficiently), in-flight batching (continuous batching), speculative decoding, FP8 quantization via Transformer Engine.

📊 Cluster MLOps

DCGM (Data Center GPU Manager): cluster-scale GPU health monitoring. Exports metrics to Prometheus via dcgm-exporter sidecar. Supports health checks, power capping (dcgmi policy), and field group telemetry.
Experiment tracking: W&B and MLflow auto-log GPU metrics (utilization, temperature, memory) alongside training loss, learning rate, and custom metrics. Essential for reproducibility.
Checkpoint strategy: save to local NVMe (fast) via GDS, then async copy to Lustre/WEKA (shared), then optionally to S3 (archival). Use FSDP state_dict_type=SHARDED_STATE_DICT for parallel save.
Model registry: NGC Private Registry stores team-internal containers and model artifacts. TensorRT engines stored with GPU-version metadata — prevents deploying wrong-arch engine.
Power capping at cluster scale: dcgmi policy -g <group> --set 0,<power_limit_W>. Useful for shared cluster fairness and preventing PSU overload during simultaneous GPU boost.
SLURM accounting: sacct -j <jobid> --format=JobID,GPUUtil,MaxRSS,Elapsed. Track GPU utilization per job for cluster efficiency auditing — identify underutilized GPU allocations.

Unlock Full Flashcard Deck on FlashGenius →