NVMe SSDs, Lustre & WEKA parallel filesystems, GPUDirect Storage, DALI pipelines, and checkpoint strategies. The I/O layer that keeps your GPUs fed and your training runs alive.
GPU compute has scaled faster than storage bandwidth over the past decade. A DGX H100 can process data at ~32 PFLOPS FP8, but can only read from disk at ~50 GB/s from local NVMe. This creates the I/O wall: without a high-throughput storage architecture, expensive GPUs sit idle waiting for data. Solving the storage problem is what separates a well-designed AI cluster from an expensive bottleneck.
| Storage Technology | Bandwidth | Capacity | Latency | Best AI Use Case |
|---|---|---|---|---|
| NVMe Gen 4 SSD (per drive) | ~7 GB/s seq read | 3–8 TB/drive | ~100 µs | Local checkpoint staging, scratch |
| NVMe Gen 5 SSD (per drive) | ~14 GB/s seq read | 4–8 TB/drive | ~50 µs | Ultra-fast local scratch, GDS-enabled |
| Lustre (cluster) | 100s GB/s–1 TB/s | Petabytes | 1–5 ms | Training datasets, shared checkpoints |
| WEKA (cluster) | Up to 1+ TB/s | Petabytes | <1 ms | All-flash AI, native S3, GDS-optimized |
| NFS v4 | 1–5 GB/s | Varies | 1–10 ms | Not recommended — MDS bottleneck |
| AWS S3 / Object | 1–10 GB/s | Unlimited | 10–100 ms | Raw dataset storage, model archives |
NVMe (Non-Volatile Memory Express) is the protocol standard for connecting SSDs over PCIe. It replaced older SATA and SAS protocols by taking advantage of PCIe lanes for massive parallelism. Understanding NVMe is foundational to AI storage design — it's the fastest per-node storage available.
| Parameter | Value |
|---|---|
| Drive count | 8× NVMe SSDs |
| Capacity per drive | 3.84 TB |
| Total capacity | ~30 TB |
| NVMe generation | PCIe Gen 4 NVMe |
| Sequential read (per drive) | ~7 GB/s |
| Aggregate sequential read | ~50–56 GB/s (all 8 drives parallel) |
| Aggregate sequential write | ~40–48 GB/s |
| Random 4K IOPS (per drive) | ~1M IOPS |
| Access latency | ~100 µs |
| Primary use | Checkpoint staging, dataset cache, scratch |
NVMe supports up to 65,535 I/O queues with up to 65,535 commands per queue. Compare this to AHCI (SATA protocol): only 1 queue with 32 commands. This parallelism is why NVMe can hit 1M+ IOPS while SATA tops out at ~100K IOPS.
An NVMe namespace is a logical partition of an NVMe drive — analogous to a disk partition but more flexible. Supports multiple namespaces per device. In multi-tenant environments (MIG or virtualized), different namespaces can be assigned to different workloads for isolation.
NVMe-oF extends the NVMe protocol over a network fabric (RDMA/RoCEv2 or Fibre Channel), allowing remote NVMe drives to appear as local devices with near-local latency. Used in high-performance shared storage arrays for AI clusters. The WEKA filesystem uses NVMe-oF internally between nodes.
A single DGX H100 has ~50 GB/s of local NVMe bandwidth. A DGX SuperPOD with 32 nodes has 32 × 50 GB/s = 1.6 TB/s of local storage — but those are 32 separate filesystems. Training a large model across all 256 GPUs requires shared access to training data and checkpoint directories. Parallel filesystems solve this by distributing storage across many nodes while presenting a unified namespace to all clients.
Lustre is the most widely deployed HPC and AI filesystem. It separates metadata (file names, permissions, directory structure) from data (file contents), scaling each independently. Clients access files by first talking to the MDS for metadata, then reading/writing data directly from OSS nodes in parallel.
| Component | Role | Exam Key Fact |
|---|---|---|
| MDS (Metadata Server) | Manages file metadata: names, permissions, inode info, which OSTs hold file data | Single point of failure if no HA pair |
| MDT (Metadata Target) | Storage device on MDS — holds metadata. Usually a fast SSD array | MDT IOPS = metadata performance ceiling |
| OSS (Object Storage Server) | Serves data I/O for one or more OSTs to clients | Add more OSS nodes = more bandwidth |
| OST (Object Storage Target) | Disk/array on OSS — stores actual file data objects | Stripe across OSTs for parallel I/O |
| Lustre Client | Kernel module on compute node that mounts /lustre and intercepts I/O calls | Loaded via: modprobe lustre |
Lustre striping splits a file across multiple OSTs so reads/writes can happen in parallel. A large dataset file striped across 8 OSTs can be read at 8× the speed of a single-OST file. For AI training, always stripe large files.
# Set stripe count (8 OSTs) and size (4MB chunks) for a directory
lfs setstripe -c 8 -S 4m /lustre/training_data
# Check stripe settings on a file
lfs getstripe /lustre/training_data/dataset.tar
# Check filesystem usage
lfs df -h /lustre
# Find files with suboptimal stripe count
lfs find /lustre/checkpoints -stripe-count 1
| lfs Parameter | Meaning | AI Recommendation |
|---|---|---|
-c N (stripe count) | Number of OSTs to stripe across | Set to number of OSTs or -1 for all |
-S size (stripe size) | Chunk size per OST before moving to next | 1–4 MB for sequential, smaller for random |
-i N (start OST) | Which OST to start striping from | Leave default (-1 = round-robin) |
Lustre performance scales linearly with OSS/OST nodes. A well-configured Lustre system for a DGX SuperPOD targets 1 TB/s+ aggregate read bandwidth for dataset loading, achieved by deploying enough OSS nodes with NVMe-backed OSTs. The MDS must also be sized to handle the metadata load from 256 clients opening millions of small files simultaneously.
WEKA is a commercial all-flash parallel filesystem built from scratch for NVMe and RDMA. Unlike Lustre (ported from HDD-era architecture), WEKA was designed entirely around flash storage characteristics. It delivers sub-millisecond latency, native S3 API, and is an official NVIDIA-validated storage platform for DGX.
| Feature | WEKA | Lustre |
|---|---|---|
| Architecture origin | Flash-native (built for NVMe) | HDD-era, adapted for flash |
| Peak bandwidth | 1+ TB/s aggregate | 100s GB/s – 1 TB/s (OSS-limited) |
| Latency | <1 ms (sub-ms for local) | 1–5 ms |
| S3 API | Native (no gateway) | Requires S3 gateway add-on |
| POSIX compliance | Full POSIX | Full POSIX |
| GPUDirect Storage | Certified & optimized | Supported (requires config) |
| Management | GUI + CLI | Command-line / manual |
| License | Commercial | Open source (GPL) |
| Typical use | All-flash AI clusters, DGX POD | Academic HPC, budget-sensitive clusters |
| Filesystem | Type | Bandwidth | Strengths | Weaknesses |
|---|---|---|---|---|
| IBM Spectrum Scale (GPFS) | Commercial parallel FS | 100s GB/s | Enterprise features, HSM tiering to tape, strong support | Cost, complexity |
| BeeGFS | Open source parallel FS | 10–100 GB/s | Easy setup, commodity hardware, academic HPC standard | Lower performance ceiling |
| DDN EXAScaler (Lustre) | Commercial Lustre distrib. | 1+ TB/s | Lustre compatibility, enterprise support, high scale | Cost |
| NFS v4 | Network FS | 1–5 GB/s | Simple, universal | Single MDS bottleneck, not parallel |
GPUDirect is NVIDIA's family of technologies that enable data to move directly to and from GPU HBM memory without passing through the CPU and system DRAM. Each variant eliminates a different copy on the data path, dramatically reducing latency and freeing CPU resources for other work.
| Technology | Data Path | What It Eliminates | Use Case |
|---|---|---|---|
| GPUDirect Storage (GDS) | NVMe ↔ GPU HBM (DMA) | CPU + system DRAM copy | Checkpoint save/load, dataset reading |
| GPUDirect RDMA | Remote NIC ↔ GPU HBM (DMA) | CPU + system DRAM copy on network path | Multi-node NCCL AllReduce, P2P transfers |
| GPUDirect P2P | GPU A ↔ GPU B (NVLink or PCIe) | CPU + system DRAM bounce copy | Tensor/pipeline parallel intra-node |
| GPUDirect Video | Video capture ↔ GPU HBM | CPU frame decode overhead | Video AI inference pipelines |
GPUDirect Storage (GDS) is the most exam-critical GPUDirect technology for NCP-AII storage topics. It enables a DMA engine to transfer data directly between NVMe storage (local or NVMe-oF) and GPU HBM — the CPU is completely uninvolved in the data transfer.
| Requirement | Component | Notes |
|---|---|---|
| CUDA API | cuFile API | Replaces standard POSIX file I/O for GPU-direct reads/writes |
| Kernel driver | nvidia-fs module | Loaded with modprobe nvidia-fs, part of CUDA toolkit |
| Storage compatibility | NVMe local or NVMe-oF | Must be NVMe — SATA and HDD are not supported |
| Filesystem | ext4, xfs, WEKA, Lustre | NFS/CIFS are NOT supported by GDS |
| Driver version | CUDA 11.4+ / NVIDIA 470+ | GDS generally available from CUDA 11.4 |
// Minimal GPUDirect Storage read example
#include <cufile.h>
CUfileDescr_t cf_desc = {0};
CUfileHandle_t cf_handle;
void *gpu_ptr; // already cudaMalloc'd
// Open file with O_DIRECT | O_RDONLY
int fd = open("/lustre/model_weights.bin", O_RDONLY | O_DIRECT);
cf_desc.handle.fd = fd;
cf_desc.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;
cuFileDriverOpen();
cuFileHandleRegister(&cf_handle, &cf_desc);
// DMA: NVMe → GPU HBM (no CPU copy)
cuFileRead(cf_handle, gpu_ptr, file_size, 0, 0);
cuFileHandleDeregister(cf_handle);
cuFileDriverClose();
GPUDirect RDMA enables a ConnectX-7 NIC to DMA data directly from a remote server's GPU HBM into the local GPU HBM, bypassing both CPUs and system DRAMs in the transfer path. This is critical for multi-node AllReduce performance — without RDMA, every inter-node tensor transfer passes through two CPUs and two system memory copies.
modprobe nvidia-peermem · Verify: lsmod | grep nvidia_peermem · NCCL uses it automatically when available — set NCCL_NET_GDR_LEVEL=2 to force GPU DMA for all remote transfers. Check NCCL log: NCCL_DEBUG=INFO.
GPUDirect P2P enables GPU A to read from or write to GPU B's HBM directly over NVLink or PCIe — without staging through system DRAM. In DGX H100 with NVLink 4, P2P transfers at 900 GB/s make tensor parallelism within a node nearly as fast as accessing local HBM. Enable with cudaDeviceEnablePeerAccess(peerDevice, 0) and verify with cudaDeviceCanAccessPeer(&canAccess, devA, devB).
Even with fast NVMe and parallel filesystems, data loading can bottleneck training if the software pipeline is inefficient. The goal is to keep GPU compute 100% utilized by ensuring the next mini-batch of data is always ready before the GPU needs it — this requires pipelining storage reads, preprocessing, and GPU transfer in parallel.
NVIDIA DALI (Data Loading Library) replaces the CPU-side data preprocessing with GPU execution, dramatically reducing data loading overhead for vision and NLP workloads. DALI performs decoding, augmentation, normalization, and format conversion directly on the GPU.
| Feature | PyTorch DataLoader | NVIDIA DALI |
|---|---|---|
| Decode (JPEG) | CPU libjpeg | GPU (nvJPEG) — up to 10× faster |
| Augmentation | CPU (torchvision) | GPU — eliminates CPU bottleneck |
| Pipeline overlap | Limited (Python GIL) | Full async prefetch with CUDA streams |
| Multi-GPU | Replicate per worker | Native multi-GPU shard support |
| Video support | Limited | Yes — GPU video decode |
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def
@pipeline_def(batch_size=256, num_threads=4, device_id=0)
def training_pipeline():
jpegs, labels = fn.readers.file(file_root="/lustre/imagenet/train",
random_shuffle=True)
# Decode directly to GPU
images = fn.decoders.image(jpegs, device="mixed",
output_type=types.RGB)
# Augment on GPU
images = fn.random_resized_crop(images, size=224, device="gpu")
images = fn.crop_mirror_normalize(images, device="gpu",
mean=[0.485*255, 0.456*255, 0.406*255],
std=[0.229*255, 0.224*255, 0.225*255])
return images, labels
Pinned (page-locked) memory on the CPU side enables direct DMA from CPU DRAM to GPU HBM over PCIe — no intermediate copy is needed. Paged memory requires the OS to first copy to a pinned staging buffer before PCIe DMA can occur.
# PyTorch: enable pinned memory in DataLoader
loader = torch.utils.data.DataLoader(
dataset, batch_size=256,
pin_memory=True, # ← enables direct DMA to GPU
num_workers=8, # ← parallel CPU data loading
prefetch_factor=2 # ← prefetch 2 batches ahead
)
num_workers to roughly 2–4 per GPU. Too few = I/O bottleneck. Too many = CPU contention. Always enable pin_memory=True when training on GPU.Checkpointing saves model state (weights + optimizer state + training metadata) periodically during training. For large LLMs, checkpoints are hundreds of GB and must be saved quickly to minimize GPU idle time.
| Strategy | How | Speed | Best For |
|---|---|---|---|
| Full checkpoint | Rank 0 gathers all shards → saves to single file | Slow (single writer, bottlenecked) | Small models, simple restore |
| Sharded checkpoint (FSDP) | Each rank saves its own shard independently in parallel | Fast (N writers in parallel) | LLM training with FSDP or DeepSpeed |
| Async checkpoint | Copy weights to CPU RAM in background while GPU continues training | Near-zero GPU pause | Frequent checkpointing without overhead |
| GDS-accelerated | GPU → NVMe DMA via cuFile (no CPU) | 40–50% faster than CPU-routed | Large checkpoints to local NVMe |
How you store training data is as important as what storage system you use. Millions of small files destroy filesystem metadata performance (Lustre MDS IOPS) and cause random I/O patterns on HDDs/SSDs. Packaging samples into large sequential archives enables high-throughput streaming reads.
| Format | Structure | Read Pattern | Throughput | AI Usage |
|---|---|---|---|---|
| WebDataset | TAR archives (key-value pairs inside) | Sequential (fast) | Excellent | CV training, ImageNet-scale |
| MosaicML Streaming | Indexed binary shards | Sequential + indexed | Excellent | LLM pre-training |
| TFRecord | Protocol buffer records | Sequential | Good | TensorFlow workloads |
| HDF5 | Hierarchical binary format | Random or sequential | Good | Scientific/medical AI |
| Raw files (JPEG/PNG) | Individual files per sample | Random (kills MDS) | Poor at scale | Avoid for large datasets |
pin_memory=True in a PyTorch DataLoader enable?Click any card to reveal the answer
lfs setstripe -c 8 -S 4m /path. Check with: lfs getstripe /path/fileSelect a topic for exam-focused guidance:
fio --name=seq_read --rw=read --bs=1m --ioengine=libaio --iodepth=32 --filename=/dev/nvme0n1 --direct=1 --size=10g. For GDS-enabled benchmarks use gdsio tool from NVIDIA.lfs setstripe -c -1 -S 4m /lustre/checkpoints (-c -1 = stripe across all OSTs). Apply at directory level — inherited by new files.lfs getstripe -r /lustre/training_data (-r = recursive). Check disk usage: lfs df -h /lustre. Check OST balance: lfs df --mdt /lustre.nvidia-fs module, then use cuFile with standard Lustre paths. No special Lustre client config needed beyond the GDS driver.modprobe nvidia-fs. Verify: lsmod | grep nvidia_fs. Check GDS config: cat /proc/driver/nvidia-fs/params. GDS tool test: gdsio -f /dev/nvme0n1 -d 0 -w 0 -s 1g.modprobe nvidia-peermem. NCCL uses automatically when available. Force GDR: NCCL_NET_GDR_LEVEL=2.cudaDeviceEnablePeerAccess(peerDev, 0). Over NVLink, this is always available and gives near-HBM bandwidth for inter-GPU tensor copies.cuFileWrite to its own shard file on local NVMe — fully parallel, no rank 0 bottleneck, near-line-rate NVMe bandwidth utilized by all GPUs simultaneously.import nvidia.dali.fn as fn. Key op: fn.decoders.image(..., device="mixed") decodes JPEG on GPU.pin_memory=True in DataLoader or cudaHostAlloc directly. Enables direct DMA from CPU DRAM to GPU HBM — avoids an OS-level copy from pageable to pinned before DMA. Critical for high-throughput data loading.nvidia-smi dmon and check GPU utilization — should be >90%.cudaMemcpyAsync.pip install webdataset) for large CV datasets. Reads sequentially from TAR archives — no small-file IOPS. MosaicML Streaming dataset format for LLM training with random-access capability across shuffled shards.AsyncCheckpointingManager or custom threading.rsync/cp to Lustre/WEKA → optional offload to S3. Each stage uses the fastest available path.torch.allclose(original_param, loaded_param) to catch silent corruption. Especially important for multi-rank sharded checkpoints where a single corrupt shard can corrupt the whole restore.| Fact | Mnemonic / Hook |
|---|---|
| NVMe Gen 4 = 7 GB/s | "7 GB/s — 7 days in a week, one week of HDD performance per second" |
| DGX H100 NVMe: 8 × 3.84 TB = 30 TB | "8 drives × 4 TB ≈ 30 TB — like 30 TB 'cloud' living inside your server" |
| GDS = cuFile + nvidia-fs | "GPU Direct Storage = C-u-File + nvidia-FS kernel — two parts, one path" |
| Lustre stripe: -c count, -S size | "lfs setstripe: -c for Count, -S for Size. C comes before S alphabetically." |
| LLaMA 70B full checkpoint = ~700 GB | "70B params × 10 bytes avg (weights + optimizer) ≈ 700 GB = 10 H100s of HBM" |
| Small files = MDS bottleneck | "Millions of tiny files = millions of MDS knocks. Use TAR/WebDataset to knock once." |
| DALI = GPU decode | "DALI sends the decoding to the GPU — Data Accelerated Loading Inline" |