AI Software Stack & Cluster Deployment
The NVIDIA AI software stack is a layered architecture from GPU silicon to containerized applications. Every NCP-AII certified professional must understand how CUDA, libraries, frameworks, containers, schedulers, and distributed training fit together to deploy AI at cluster scale.
Each layer depends on the one below. CUDA toolkit version must be compatible with the installed driver. Frameworks call cuDNN/NCCL through CUDA. NGC containers package a consistent, validated snapshot of all layers above the driver.
| GPU | Compute Cap. |
|---|---|
| V100 | 7.0 |
| A100 | 8.0 |
| H100 | 9.0 |
| B200 | 10.0 |
Pre-built, GPU-optimized containers for PyTorch, TensorFlow, TensorRT, and more. Updated monthly. Validated on NVIDIA hardware. Eliminates dependency management — pull and run.
Multi-Instance GPU: partition one H100 into up to 7 independent GPU instances, each with dedicated SMs, HBM slice, L2, and memory controllers. Enables multi-tenant inference workloads.
Core Theme: The NCP-AII software stack questions center on three areas: (1) CUDA/driver compatibility and MIG, (2) container + scheduler deployment patterns (SLURM with Pyxis, Kubernetes with GPU device plugin), and (3) distributed training collectives (NCCL all-reduce, ring vs tree algorithms) and parallelism strategies (DDP, FSDP, tensor, pipeline, 3D).
CUDA & the NVIDIA Software Stack
CUDA is the programming model and platform that exposes GPU parallelism to developers. Understanding CUDA's architecture, compute capabilities, driver/toolkit compatibility, and profiling tools is foundational.
CUDA Toolkit vs CUDA Driver: Two distinct components. The driver (installed with GPU driver package) is the low-level kernel interface. The toolkit (nvcc, runtime libraries) compiles and links CUDA code. Toolkit version must be ≤ driver version for compatibility.
Compute Capability: A version number (major.minor) identifying GPU architecture features. Code compiled for compute capability 9.0 can only run on H100 or newer. Always compile for the target GPU's compute capability.
CUDA Runtime vs Driver API: Runtime API is higher-level (implicit context management). Driver API is lower-level (explicit control). Most frameworks use Runtime API.
# Check driver and CUDA toolkit versions
nvidia-smi
# Shows: Driver Version, CUDA Version (max supported)
nvcc --version
# Shows: installed CUDA toolkit version
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap \
--format=csv,noheader
# Compile for H100 (sm_90)
nvcc -arch=sm_90 kernel.cu -o kernel
MIG partitions a single physical GPU into multiple isolated GPU Instances (GIs), each with:
- Dedicated SM partition (no sharing between instances)
- Dedicated HBM memory slice
- Dedicated L2 cache partition
- Dedicated memory controllers and bandwidth
- Full fault isolation — one instance's errors don't affect others
H100 MIG profiles: 1g.10gb (7×), 2g.20gb (3×), 3g.40gb (2×), 4g.40gb (1×), 7g.80gb (1× — full GPU). The "g" = GPC slices; "gb" = HBM allocation.
Use case: Running multiple independent inference workloads on one GPU; multi-tenant environments where isolation is required.
# Enable MIG mode on GPU 0
sudo nvidia-smi -i 0 \
-mig 1
# List available MIG profiles
sudo nvidia-smi mig \
-lgip
# Create 7 × 1g.10gb instances on GPU 0
sudo nvidia-smi mig \
-cgi 19,19,19,19,19,19,19 -C
# List all MIG instances
sudo nvidia-smi mig -lgi
# Disable MIG mode
sudo nvidia-smi -i 0 -mig 0
Note: MIG mode requires a reboot on some systems. MIG instances cannot share NVLink — inter-instance communication goes through the host CPU/PCIe path.
| Tool | Purpose | Key Use |
|---|---|---|
nvidia-smi | GPU management & monitoring | Temperature, power, utilization, driver version, MIG management |
| DCGM | Cluster-scale GPU health | Health checks, power capping, telemetry export to Prometheus |
| Nsight Systems | System-level profiling | CPU/GPU timeline, CUDA API calls, NVLink/PCIe traffic |
| Nsight Compute | Kernel-level profiling | SM utilization, memory bandwidth, roofline analysis per kernel |
| Nsight DL Designer | DL model profiling | Layer-by-layer performance in neural network graphs |
| NVTX | Annotation SDK | Mark application regions for Nsight Systems timeline correlation |
| Library | Domain | Key Function |
|---|---|---|
| cuDNN | Deep Learning | GPU-accelerated convolutions, attention, normalization — used by PyTorch/TF backends |
| NCCL | Collective Comms | All-reduce, broadcast, scatter/gather across GPUs — core of distributed training |
| cuBLAS | Linear Algebra | Dense matrix multiply (GEMM) — the inner loop of transformer attention and FFN layers |
| cuSPARSE | Sparse Linear Algebra | SpMM, SpMV for sparse model acceleration |
| TensorRT | Inference Optimization | Layer fusion, precision calibration (FP8/INT8), engine serialization for deployment |
| RAPIDS cuDF / cuML | Data Science | GPU-accelerated DataFrames and ML pipelines (ETL for training data) |
| cuFile (GDS) | Storage I/O | Direct NVMe→GPU DMA bypassing CPU — eliminates storage I/O bottleneck |
Containers & Cluster Orchestration
NGC containers package the full validated software stack. SLURM and Kubernetes are the two dominant schedulers for GPU cluster workloads — each with distinct GPU-aware tooling.
NGC provides pre-built, optimized containers from nvcr.io (NVIDIA Container Registry). Each container includes framework + CUDA toolkit + cuDNN + libraries, validated against specific GPU generations.
Key container families: PyTorch, TensorFlow, JAX, TensorRT, Triton Inference Server, NeMo (LLM training), RAPIDS, CUDA base images.
Versioning: Tagged as YY.MM (e.g., 24.01) indicating release month. Monthly releases track upstream framework + NVIDIA library updates.
# Pull NGC PyTorch container
docker pull \
nvcr.io/nvidia/pytorch:24.01-py3
# Run with GPU access
docker run --gpus all \
--rm -it \
nvcr.io/nvidia/pytorch:24.01-py3
# Run with specific GPU + NVMe mount
docker run --gpus '"device=0,1"' \
-v /lustre/data:/data \
-v /lustre/checkpoints:/ckpt \
nvcr.io/nvidia/pytorch:24.01-py3 \
python train.py
SLURM (Simple Linux Utility for Resource Management) is the dominant HPC/AI cluster scheduler. GPU resources are managed through GRES (Generic Resource Scheduling).
Pyxis: SLURM plugin (from NVIDIA) that integrates container execution natively into srun/sbatch. Replaces the need for docker run wrappers. Uses Enroot as the container runtime underneath.
Enroot: Lightweight container runtime that converts OCI images (Docker/NGC) to unprivileged sandboxes — optimized for HPC shared filesystems.
#!/bin/bash
#SBATCH --job-name=llm-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --time=24:00:00
#SBATCH --partition=dgx-h100
# Launch with Pyxis (--container-image)
srun \
--container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
--container-mounts=/lustre/data:/data \
python -m torch.distributed.run \
--nproc_per_node=8 \
--nnodes=4 \
train.py
NVIDIA GPU Device Plugin: Kubernetes DaemonSet that advertises GPUs as allocatable resources (nvidia.com/gpu). Required on every GPU node. Enables GPU requests in Pod specs.
MIG-aware Scheduling: The GPU device plugin supports MIG device advertisement. Pods can request specific MIG profiles (e.g., nvidia.com/mig-1g.10gb).
NVIDIA Operator: Kubernetes operator that automates deployment of GPU driver, device plugin, DCGM exporter, and MIG manager — the recommended production deployment path for GPU clusters on Kubernetes.
Kubeflow: ML workflow orchestration on Kubernetes — manages training jobs (PyTorchJob, TFJob), hyperparameter tuning, and pipelines.
# Pod spec requesting 2 H100 GPUs
apiVersion: v1
kind: Pod
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
limits:
nvidia.com/gpu: 2
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
---
# Request MIG instance instead:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
| Dimension | SLURM | Kubernetes |
|---|---|---|
| Primary Use | HPC / large-scale training | Microservices / inference serving |
| Container Runtime | Enroot + Pyxis plugin | containerd / Docker + GPU device plugin |
| Job Types | Batch (sbatch), interactive (srun) | Pods, Jobs, CronJobs, custom CRDs |
| GPU Scheduling | GRES (--gres=gpu:N) | resource requests (nvidia.com/gpu: N) |
| MIG Support | Via GRES naming convention | MIG device plugin profiles |
| Multi-node Jobs | Native (--nodes, --ntasks) | Requires PyTorchJob/MPI Operator |
| Adoption | Dominant in HPC/supercomputers | Dominant in cloud-native/MLOps |
Distributed Training
Training large AI models requires distributing computation across multiple GPUs and nodes. NCCL handles the collective communications; DDP, FSDP, tensor, and pipeline parallelism handle different ways to split the model and data.
NCCL implements optimized GPU-to-GPU collective operations — the backbone of all distributed deep learning. It automatically selects the best transport: NVLink (intra-node), InfiniBand RDMA (inter-node), or Ethernet.
Key collectives:
- All-Reduce: Sum (or mean) a tensor across all GPUs, distribute result back to all. Used in DDP gradient synchronization.
- All-Gather: Each GPU contributes a shard; all GPUs receive the concatenated result. Used in FSDP parameter reconstruction.
- Reduce-Scatter: Sum across GPUs, then scatter one shard to each. Used in FSDP gradient reduction.
- Broadcast: One GPU sends a tensor to all others.
Ring algorithm: Default all-reduce topology. Each GPU sends to its neighbor in a ring — O(N) data volume but O(1) bandwidth overhead regardless of GPU count. Optimal for NVLink rings.
# NCCL environment variables for tuning
# Use InfiniBand interface hca0
export NCCL_IB_HCA=mlx5_0
# Specify network interface for socket fallback
export NCCL_SOCKET_IFNAME=ib0
# Enable NCCL debug logging
export NCCL_DEBUG=INFO
# Disable IB (force Ethernet)
export NCCL_IB_DISABLE=1
# Set number of rings (often = NVLink lanes)
export NCCL_P2P_LEVEL=NVL
# Verify NCCL is using NVLink
NCCL_DEBUG=INFO python train.py \
2>&1 | grep -i nvlink
3D Parallelism combines all three: Tensor Parallelism (intra-node, NVLink), Pipeline Parallelism (inter-node, InfiniBand), and Data Parallelism (across TP×PP groups). Used for frontier-scale LLM training (GPT-4 class models) on thousands of GPUs.
# Single-node, 8-GPU DDP with torchrun
torchrun \
--nproc_per_node=8 \
train_ddp.py \
--batch-size 256
# Multi-node (4 nodes, 8 GPUs each = 32 total) via SLURM+Pyxis
srun \
--container-image=nvcr.io/nvidia/pytorch:24.01-py3 \
--container-mounts=/lustre:/data \
torchrun \
--nproc_per_node=8 \
--nnodes=4 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train_ddp.py
# FSDP wrapping in Python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
model = FSDP(
model,
auto_wrap_policy=transformer_auto_wrap_policy,
mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
device_id=torch.cuda.current_device(),
)
| Model Size | GPU Count | Recommended Strategy | Key Constraint |
|---|---|---|---|
| <80 GB (fits 1× H100) | 1–8 | DDP | Batch size / throughput |
| 80–640 GB | 8–64 | FSDP | HBM per GPU |
| 640 GB – few TB | 64–256 | FSDP + Tensor Parallelism | NVLink bandwidth |
| Multi-TB (GPT-4 class) | 256–10,000+ | 3D Parallelism (TP+PP+DP) | IB bandwidth, pipeline bubbles |
Inference Serving & MLOps
Getting trained models into production requires optimized inference engines, serving infrastructure, and monitoring. NVIDIA's inference stack centers on TensorRT, Triton Inference Server, and NIM.
TensorRT takes a trained model (ONNX, PyTorch, TensorFlow) and produces an optimized serialized engine for a specific GPU target. Optimizations include:
- Layer fusion: Combines sequential ops (e.g., Conv + BN + ReLU → single kernel)
- Precision calibration: FP32 → FP16 → INT8 → FP8 with calibration dataset
- Kernel auto-selection: Benchmarks alternative CUDA kernels, picks fastest for target GPU
- Memory optimization: Reuses buffers across layers to minimize HBM allocation
Engine files are GPU-specific: An engine built for H100 (SM 9.0) cannot run on A100 (SM 8.0). Always build on the deployment target GPU.
# Export PyTorch model to ONNX
torch.onnx.export(model, dummy_input,
"model.onnx",
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch"}})
# Build TensorRT engine (FP16)
trtexec \
--onnx=model.onnx \
--fp16 \
--saveEngine=model_fp16.engine \
--minShapes=input:1x3x224x224 \
--optShapes=input:32x3x224x224 \
--maxShapes=input:64x3x224x224
# Run benchmark
trtexec \
--loadEngine=model_fp16.engine \
--batch=32 --iterations=100
Triton (not to be confused with the Triton kernel language) is NVIDIA's production inference serving system. It serves models from multiple frameworks simultaneously and exposes HTTP and gRPC endpoints.
Key features:
- Dynamic batching: Automatically groups requests from multiple clients into batches for higher GPU utilization
- Concurrent model execution: Multiple models run simultaneously on different GPU contexts
- Model ensembles: Chain multiple models in a DAG (e.g., pre-process → model → post-process) as a single endpoint
- Backends: TensorRT, ONNX Runtime, PyTorch (LibTorch), TensorFlow, Python, FIL (forests)
- Metrics: Prometheus-compatible endpoint for latency, throughput, queue depth per model
# Model repository structure
models/
resnet50/
config.pbtxt # model config
1/
model.plan # TensorRT engine
bert-large/
config.pbtxt
1/
model.onnx
# Launch Triton server
docker run --gpus all \
-p 8000:8000 \ # HTTP
-p 8001:8001 \ # gRPC
-p 8002:8002 \ # Metrics
-v ./models:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver \
--model-repository=/models
# Health check
curl localhost:8000/v2/health/ready
NIM packages optimized models (LLMs, vision, speech) as containerized microservices with pre-built TensorRT engines. A single docker run deploys a production-ready inference endpoint — no manual TensorRT engine building required.
What's inside a NIM container: Triton Inference Server + pre-built TensorRT-LLM engines + model weights + API server (OpenAI-compatible REST API). Supports continuous batching for LLMs (token-by-token generation with dynamic slot allocation).
TensorRT-LLM: NVIDIA's LLM-specific inference library. Implements paged KV cache, in-flight batching (continuous batching), speculative decoding, and FP8 quantization for H100. Substantially higher token throughput vs vanilla PyTorch inference.
# Deploy Llama 3 70B NIM
docker run --gpus all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
-v /local/nim/cache:/opt/nim/.cache \
nvcr.io/nim/meta/llama3-70b-instruct:latest
# Inference via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama3-70b-instruct",
"messages": [{"role":"user","content":"Hello"}],
"max_tokens": 200
}'
| Phase | Tool / Service | NVIDIA Integration |
|---|---|---|
| Experiment tracking | MLflow, Weights & Biases (W&B) | Auto-logs GPU metrics, loss curves, hyperparams |
| Distributed training orchestration | SLURM + Pyxis, Kubeflow, Ray Train | Native GPU scheduling; NCCL for collectives |
| Model registry | MLflow Registry, NGC Private Registry | Store TensorRT engines and NIM-ready checkpoints |
| Hyperparameter tuning | Optuna, Ray Tune, NNI | Runs on GPU cluster; early stopping via DCGM metrics |
| Inference optimization | TensorRT, TensorRT-LLM | Target GPU SM architecture; FP8 calibration on H100 |
| Serving | Triton Inference Server, NIM | Dynamic batching; Prometheus/Grafana monitoring |
| Cluster monitoring | DCGM + Prometheus + Grafana | GPU health, power, thermal, utilization dashboards |
Practice Quiz
10 questions covering CUDA, MIG, containers, SLURM, NCCL, distributed training, and inference serving. Select an answer to reveal the explanation.
-arch=sm_90 for H100).--container-image in sbatch/srun. It uses Enroot as the underlying container runtime — converting Docker/OCI images to unprivileged sandboxes optimized for HPC shared filesystems. DCGM is for GPU monitoring. Kubeflow runs on Kubernetes, not SLURM.24.01 = January 2024). Each monthly release includes updated CUDA toolkit, cuDNN, framework versions, and NVIDIA library updates, validated together on NVIDIA hardware. The CUDA version inside is specified separately in the tag suffix or documentation, not the YY.MM number itself.export NCCL_IB_HCA=mlx5_0. NCCL_SOCKET_IFNAME specifies the Ethernet interface for TCP/socket fallback. NCCL_P2P_LEVEL controls the peer-to-peer transport hierarchy (NVL, PXB, SYS). NCCL_IB_DISABLE=1 forces Ethernet, disabling InfiniBand entirely.Review the explanations above for any missed questions.
Memory Hooks & Advisor
Mnemonics, patterns, and quick-reference guidance for the most exam-critical software stack and distributed training concepts.
H100 CUDA Compute Capability
9.0
V100=7.0 · A100=8.0 · H100=9.0 · B200=10.0
Max MIG Instances on H100
7 instances (1g.10gb profile)
Each has dedicated SMs, HBM, L2, memory controllers
What does NCCL provide?
GPU collective communications
All-reduce · All-gather · Reduce-scatter · Broadcast
SLURM plugin for NGC containers
Pyxis (uses Enroot runtime)
Adds --container-image to sbatch/srun
DDP vs FSDP — key difference
DDP: full model copy per GPU
FSDP: shards params + grads + optimizer states → fits much larger models
Triton Inference Server — dynamic batching
Groups concurrent client requests into batches before GPU execution — maximizes GPU utilization at inference time
NCCL env var for InfiniBand HCA
NCCL_IB_HCA=mlx5_0
(vs NCCL_SOCKET_IFNAME for Ethernet fallback)
3D Parallelism dimensions
Tensor (intra-node, NVLink) + Pipeline (inter-node, IB) + Data (across TP×PP groups)
⚡ CUDA & MIG
- Compute capability identifies GPU architecture features: V100=7.0, A100=8.0, H100=9.0, B200=10.0. Always compile for the deployment GPU's SM version (
-arch=sm_90for H100). - CUDA driver version must be ≥ CUDA toolkit version. The nvidia-smi "CUDA Version" shows the max driver-supported toolkit version — not what's installed.
- MIG partitions one physical GPU into up to 7 isolated instances (H100: 1g.10gb × 7). Each instance has dedicated SMs, HBM, L2, and memory controllers — full fault isolation.
- MIG instances cannot communicate via NVLink. Inter-instance traffic routes through host CPU/PCIe. Use MIG for multi-tenant inference, not distributed training.
- Enable MIG with
sudo nvidia-smi -i 0 -mig 1. Create instances withsudo nvidia-smi mig -cgi <profile_id> -C. List profiles withsudo nvidia-smi mig -lgip. - DCGM supports MIG-aware health monitoring — each MIG instance reported independently. Kubernetes GPU device plugin also exposes MIG instances as schedulable resources.
📦 Containers & Schedulers
- NGC containers at
nvcr.iopackage framework + CUDA + cuDNN + libraries validated together. Tagged as YY.MM (year.month). Pull exact tag matching your GPU driver version support. - SLURM + Pyxis: add
--container-image=nvcr.io/nvidia/pytorch:24.01-py3and--container-mounts=/lustre:/datato sbatch/srun. Pyxis handles image conversion via Enroot at job start. - SLURM GPU allocation:
#SBATCH --gres=gpu:8(8 GPUs per node). Use--gpus-per-task=1for fine-grained per-rank GPU assignment. - Kubernetes GPU device plugin: DaemonSet that advertises
nvidia.com/gpuas resource. Request GPUs withresources.limits["nvidia.com/gpu"]: 2in Pod spec. - NVIDIA Operator: single Helm chart that deploys driver, device plugin, DCGM exporter, MIG manager on Kubernetes GPU nodes — preferred production deployment method.
- Enroot: converts Docker/OCI images to unprivileged squashfs containers. Works without root daemon — compatible with HPC multi-user systems. Pyxis = SLURM integration layer on top of Enroot.
🔀 Distributed Training
- NCCL all-reduce uses ring algorithm by default: O(N) data, O(1) bandwidth per GPU regardless of GPU count. Alternative: tree algorithm (lower latency for small messages).
- DDP (
torch.nn.parallel.DistributedDataParallel): full model copy per GPU, data sharded, gradients all-reduced after backward. Simplest — use when model fits in one GPU HBM. - FSDP (
FullyShardedDataParallel): shards params+grads+optimizer states. Uses all-gather before forward (reconstruct params), reduce-scatter after backward (accumulate grad shards). LLaMA 70B: 700 GB ÷ 8 = ~87.5 GB/rank. - Tensor Parallelism: splits weight matrices within layers across GPUs (Megatron-style). Needs NVLink bandwidth; typically intra-node only. All-reduce required after each matmul.
- Pipeline Parallelism: assigns consecutive transformer layers to different GPU stages. Micro-batches flow through the pipeline. Communication = activations only. Suitable for inter-node (IB). Introduces pipeline "bubble" idle time.
- 3D Parallelism: TP (within node, NVLink) × PP (between nodes, IB) × DP (across pipeline+tensor groups). Used for trillion-parameter frontier model training on thousands of GPUs.
🚀 Inference & Triton
- TensorRT engine files are GPU-specific — an engine built for H100 (sm_90) will not run on A100 (sm_80). Always build TensorRT engines on the production target hardware.
- TensorRT workflow: export PyTorch/TF model → ONNX →
trtexec(or Python API) →.enginefile. Specify precision (--fp16,--int8) and dynamic shape ranges at build time. - Triton backends: TensorRT (
.plan), ONNX Runtime (.onnx), LibTorch (.pt), TensorFlow SavedModel, Python (custom logic), FIL (forest models). - Triton dynamic batching: groups requests arriving within a time window into a single batch. Configure with
max_queue_delay_microsecondsandpreferred_batch_sizeinconfig.pbtxt. - NIM (NVIDIA Inference Microservices): pre-built TensorRT-LLM engines + Triton + OpenAI-compatible API in a single container. Supports continuous batching for LLMs. Deploy with
docker run, connect to NGC API key for model weights. - TensorRT-LLM features for H100: paged KV cache (handles variable-length sequences efficiently), in-flight batching (continuous batching), speculative decoding, FP8 quantization via Transformer Engine.
📊 Cluster MLOps
- DCGM (Data Center GPU Manager): cluster-scale GPU health monitoring. Exports metrics to Prometheus via
dcgm-exportersidecar. Supports health checks, power capping (dcgmi policy), and field group telemetry. - Experiment tracking: W&B and MLflow auto-log GPU metrics (utilization, temperature, memory) alongside training loss, learning rate, and custom metrics. Essential for reproducibility.
- Checkpoint strategy: save to local NVMe (fast) via GDS, then async copy to Lustre/WEKA (shared), then optionally to S3 (archival). Use FSDP
state_dict_type=SHARDED_STATE_DICTfor parallel save. - Model registry: NGC Private Registry stores team-internal containers and model artifacts. TensorRT engines stored with GPU-version metadata — prevents deploying wrong-arch engine.
- Power capping at cluster scale:
dcgmi policy -g <group> --set 0,<power_limit_W>. Useful for shared cluster fairness and preventing PSU overload during simultaneous GPU boost. - SLURM accounting:
sacct -j <jobid> --format=JobID,GPUUtil,MaxRSS,Elapsed. Track GPU utilization per job for cluster efficiency auditing — identify underutilized GPU allocations.