💾 NCP-AII  ·  Topic 3  ·  Storage Systems

Storage Systems for
AI Workloads

NVMe SSDs, Lustre & WEKA parallel filesystems, GPUDirect Storage, DALI pipelines, and checkpoint strategies. The I/O layer that keeps your GPUs fed and your training runs alive.

7
GB/s per NVMe
1+
TB/s WEKA peak
280
GB 70B Checkpoint
0
CPU copies with GDS

🗺️ The AI Storage Problem

GPU compute has scaled faster than storage bandwidth over the past decade. A DGX H100 can process data at ~32 PFLOPS FP8, but can only read from disk at ~50 GB/s from local NVMe. This creates the I/O wall: without a high-throughput storage architecture, expensive GPUs sit idle waiting for data. Solving the storage problem is what separates a well-designed AI cluster from an expensive bottleneck.

🔑 Three Storage Demands of AI Training (1) Dataset reads: streaming training data to GPUs fast enough that data loading never becomes the bottleneck. (2) Checkpointing: saving model state every N steps — large models mean hundreds of GB per checkpoint. (3) Scratch I/O: temporary activation storage, gradient accumulation, intermediate tensors.
💿
Local NVMe
Per-node fast scratch: ~7 GB/s per drive, 30 TB in DGX H100
🌐
Parallel Filesystems
Lustre, WEKA — cluster-wide 100+ GB/s shared storage
GPUDirect Storage
NVMe → GPU DMA bypassing CPU — cuFile API, nvidia-fs
🔄
Data Pipelines
DALI, pinned memory, prefetch, WebDataset streaming
💼
Checkpointing
Full vs sharded, async staging, GDS-accelerated saves

📊 AI Storage Hierarchy — Full Picture

🔴 GPU HBM3 (H100)
80 GB per GPU · 3,350 GB/s bandwidth · active model weights & activations
Fastest · 3,350 GB/s
🟠 Local NVMe SSD (DGX H100)
30 TB total (8× 3.84 TB) · ~50 GB/s aggregate · checkpoint staging, scratch
~50 GB/s
🟡 Shared Parallel FS (Lustre / WEKA)
Petabytes · 100–1,000+ GB/s aggregate · shared datasets, checkpoints
100–1,000 GB/s
🟢 All-Flash Array (NVMe-oF)
High-performance shared block storage over NVMe over Fabrics (RDMA)
10–100 GB/s
🔵 Object Storage (S3-compatible)
Petabytes · 1–10 GB/s · raw datasets, model archives, long-term retention
1–10 GB/s
⚫ Tape / Cold Archive
Exabytes · <1 GB/s · compliance, raw data retention, disaster recovery
Slowest · <1 GB/s

⚖️ Technology Comparison at a Glance

Storage TechnologyBandwidthCapacityLatencyBest AI Use Case
NVMe Gen 4 SSD (per drive)~7 GB/s seq read3–8 TB/drive~100 µsLocal checkpoint staging, scratch
NVMe Gen 5 SSD (per drive)~14 GB/s seq read4–8 TB/drive~50 µsUltra-fast local scratch, GDS-enabled
Lustre (cluster)100s GB/s–1 TB/sPetabytes1–5 msTraining datasets, shared checkpoints
WEKA (cluster)Up to 1+ TB/sPetabytes<1 msAll-flash AI, native S3, GDS-optimized
NFS v41–5 GB/sVaries1–10 msNot recommended — MDS bottleneck
AWS S3 / Object1–10 GB/sUnlimited10–100 msRaw dataset storage, model archives

💿 NVMe SSD Architecture

NVMe (Non-Volatile Memory Express) is the protocol standard for connecting SSDs over PCIe. It replaced older SATA and SAS protocols by taking advantage of PCIe lanes for massive parallelism. Understanding NVMe is foundational to AI storage design — it's the fastest per-node storage available.

🔑 NVMe vs SATA vs HDD — Sequential Read Bandwidth NVMe Gen 4: ~7 GB/s  |  NVMe Gen 5: ~14 GB/s  |  SATA SSD: ~0.55 GB/s  |  HDD: ~0.15–0.2 GB/s. NVMe Gen 4 is 13× faster than SATA SSD, and 35–50× faster than HDD. For checkpoint and data loading workloads, this gap is enormous.

Sequential Bandwidth Comparison

NVMe Gen 5 (per drive)14,000 MB/s
Gen 5 — 14 GB/s
NVMe Gen 4 (per drive, DGX H100)7,000 MB/s
Gen 4 — 7 GB/s
NVMe Gen 3 (per drive)3,500 MB/s
Gen 3 — 3.5 GB/s
SATA SSD550 MB/s
SATA — 550 MB/s
HDD (7200 RPM)180 MB/s
HDD

DGX H100 Local NVMe Configuration

ParameterValue
Drive count8× NVMe SSDs
Capacity per drive3.84 TB
Total capacity~30 TB
NVMe generationPCIe Gen 4 NVMe
Sequential read (per drive)~7 GB/s
Aggregate sequential read~50–56 GB/s (all 8 drives parallel)
Aggregate sequential write~40–48 GB/s
Random 4K IOPS (per drive)~1M IOPS
Access latency~100 µs
Primary useCheckpoint staging, dataset cache, scratch

⚙️ NVMe Key Concepts

NVMe Queues

NVMe supports up to 65,535 I/O queues with up to 65,535 commands per queue. Compare this to AHCI (SATA protocol): only 1 queue with 32 commands. This parallelism is why NVMe can hit 1M+ IOPS while SATA tops out at ~100K IOPS.

NVMe Namespaces

An NVMe namespace is a logical partition of an NVMe drive — analogous to a disk partition but more flexible. Supports multiple namespaces per device. In multi-tenant environments (MIG or virtualized), different namespaces can be assigned to different workloads for isolation.

NVMe-oF (NVMe over Fabrics)

NVMe-oF extends the NVMe protocol over a network fabric (RDMA/RoCEv2 or Fibre Channel), allowing remote NVMe drives to appear as local devices with near-local latency. Used in high-performance shared storage arrays for AI clusters. The WEKA filesystem uses NVMe-oF internally between nodes.

RAID Considerations for AI

  • RAID 0 (striping): doubles bandwidth with two drives — used for maximizing checkpoint write speed from local NVMe. No redundancy.
  • RAID 5/6: adds parity for redundancy but write penalty reduces performance — rarely used for AI scratch NVMe.
  • JBOD / no RAID: use all drives independently in parallel via application-level striping (filesystem or application handles distribution). Common in DGX deployments for maximum bandwidth.
📌 Exam tip: For AI checkpointing, parallel writes to all 8 NVMe drives simultaneously (via Lustre/WEKA striping or GPUDirect Storage) gives ~50 GB/s. This can save a 140 GB LLaMA 70B weight checkpoint in under 3 seconds.

🌐 Parallel Filesystems — Why They Exist

A single DGX H100 has ~50 GB/s of local NVMe bandwidth. A DGX SuperPOD with 32 nodes has 32 × 50 GB/s = 1.6 TB/s of local storage — but those are 32 separate filesystems. Training a large model across all 256 GPUs requires shared access to training data and checkpoint directories. Parallel filesystems solve this by distributing storage across many nodes while presenting a unified namespace to all clients.

Lustre Architecture
DGX Node 0
DGX Node 1
DGX Node 2
DGX Node 31
↕ Network (IB or Ethernet)
Metadata
MDS/MDT
Metadata HA
MDS/MDT (backup)
Data
OSS 0 / OST 0
Data
OSS 1 / OST 1
Data
OSS 2 / OST 2
Data
… OST N

🗂️ Lustre — Deep Dive

Lustre is the most widely deployed HPC and AI filesystem. It separates metadata (file names, permissions, directory structure) from data (file contents), scaling each independently. Clients access files by first talking to the MDS for metadata, then reading/writing data directly from OSS nodes in parallel.

Key Lustre Concepts

ComponentRoleExam Key Fact
MDS (Metadata Server)Manages file metadata: names, permissions, inode info, which OSTs hold file dataSingle point of failure if no HA pair
MDT (Metadata Target)Storage device on MDS — holds metadata. Usually a fast SSD arrayMDT IOPS = metadata performance ceiling
OSS (Object Storage Server)Serves data I/O for one or more OSTs to clientsAdd more OSS nodes = more bandwidth
OST (Object Storage Target)Disk/array on OSS — stores actual file data objectsStripe across OSTs for parallel I/O
Lustre ClientKernel module on compute node that mounts /lustre and intercepts I/O callsLoaded via: modprobe lustre

Lustre Striping — Critical for AI Performance

Lustre striping splits a file across multiple OSTs so reads/writes can happen in parallel. A large dataset file striped across 8 OSTs can be read at 8× the speed of a single-OST file. For AI training, always stripe large files.

# Set stripe count (8 OSTs) and size (4MB chunks) for a directory
lfs setstripe -c 8 -S 4m /lustre/training_data

# Check stripe settings on a file
lfs getstripe /lustre/training_data/dataset.tar

# Check filesystem usage
lfs df -h /lustre

# Find files with suboptimal stripe count
lfs find /lustre/checkpoints -stripe-count 1
lfs ParameterMeaningAI Recommendation
-c N (stripe count)Number of OSTs to stripe acrossSet to number of OSTs or -1 for all
-S size (stripe size)Chunk size per OST before moving to next1–4 MB for sequential, smaller for random
-i N (start OST)Which OST to start striping fromLeave default (-1 = round-robin)

Lustre Performance Scale

Lustre performance scales linearly with OSS/OST nodes. A well-configured Lustre system for a DGX SuperPOD targets 1 TB/s+ aggregate read bandwidth for dataset loading, achieved by deploying enough OSS nodes with NVMe-backed OSTs. The MDS must also be sized to handle the metadata load from 256 clients opening millions of small files simultaneously.

⚠️ The Small Files Problem Training with millions of individual small files (e.g., 1 JPEG = 1 file) hammers the Lustre MDS with a metadata request per file. This doesn't scale. Solution: package files into TAR archives (WebDataset format) or large binary blobs. One 10 GB TAR = one metadata operation for thousands of samples.

WEKA — Flash-Native AI Filesystem

WEKA is a commercial all-flash parallel filesystem built from scratch for NVMe and RDMA. Unlike Lustre (ported from HDD-era architecture), WEKA was designed entirely around flash storage characteristics. It delivers sub-millisecond latency, native S3 API, and is an official NVIDIA-validated storage platform for DGX.

FeatureWEKALustre
Architecture originFlash-native (built for NVMe)HDD-era, adapted for flash
Peak bandwidth1+ TB/s aggregate100s GB/s – 1 TB/s (OSS-limited)
Latency<1 ms (sub-ms for local)1–5 ms
S3 APINative (no gateway)Requires S3 gateway add-on
POSIX complianceFull POSIXFull POSIX
GPUDirect StorageCertified & optimizedSupported (requires config)
ManagementGUI + CLICommand-line / manual
LicenseCommercialOpen source (GPL)
Typical useAll-flash AI clusters, DGX PODAcademic HPC, budget-sensitive clusters

📋 Other Shared Filesystems

FilesystemTypeBandwidthStrengthsWeaknesses
IBM Spectrum Scale (GPFS)Commercial parallel FS100s GB/sEnterprise features, HSM tiering to tape, strong supportCost, complexity
BeeGFSOpen source parallel FS10–100 GB/sEasy setup, commodity hardware, academic HPC standardLower performance ceiling
DDN EXAScaler (Lustre)Commercial Lustre distrib.1+ TB/sLustre compatibility, enterprise support, high scaleCost
NFS v4Network FS1–5 GB/sSimple, universalSingle MDS bottleneck, not parallel

GPUDirect Technology Suite

GPUDirect is NVIDIA's family of technologies that enable data to move directly to and from GPU HBM memory without passing through the CPU and system DRAM. Each variant eliminates a different copy on the data path, dramatically reducing latency and freeing CPU resources for other work.

TechnologyData PathWhat It EliminatesUse Case
GPUDirect Storage (GDS)NVMe ↔ GPU HBM (DMA)CPU + system DRAM copyCheckpoint save/load, dataset reading
GPUDirect RDMARemote NIC ↔ GPU HBM (DMA)CPU + system DRAM copy on network pathMulti-node NCCL AllReduce, P2P transfers
GPUDirect P2PGPU A ↔ GPU B (NVLink or PCIe)CPU + system DRAM bounce copyTensor/pipeline parallel intra-node
GPUDirect VideoVideo capture ↔ GPU HBMCPU frame decode overheadVideo AI inference pipelines

💾 GPUDirect Storage — Deep Dive

GPUDirect Storage (GDS) is the most exam-critical GPUDirect technology for NCP-AII storage topics. It enables a DMA engine to transfer data directly between NVMe storage (local or NVMe-oF) and GPU HBM — the CPU is completely uninvolved in the data transfer.

❌ Without GPUDirect Storage
NVMe SSD
PCIe → System DRAM (pageable)
CPU copies to pinned memory
PCIe → GPU HBM (second copy)
2× PCIe crossings · CPU occupied
✅ With GPUDirect Storage (GDS)
NVMe SSD
PCIe DMA → GPU HBM (direct)
1× PCIe crossing · CPU-free · 40-50% faster

GDS Requirements & Setup

RequirementComponentNotes
CUDA APIcuFile APIReplaces standard POSIX file I/O for GPU-direct reads/writes
Kernel drivernvidia-fs moduleLoaded with modprobe nvidia-fs, part of CUDA toolkit
Storage compatibilityNVMe local or NVMe-oFMust be NVMe — SATA and HDD are not supported
Filesystemext4, xfs, WEKA, LustreNFS/CIFS are NOT supported by GDS
Driver versionCUDA 11.4+ / NVIDIA 470+GDS generally available from CUDA 11.4
// Minimal GPUDirect Storage read example
#include <cufile.h>

CUfileDescr_t cf_desc = {0};
CUfileHandle_t cf_handle;
void *gpu_ptr;  // already cudaMalloc'd

// Open file with O_DIRECT | O_RDONLY
int fd = open("/lustre/model_weights.bin", O_RDONLY | O_DIRECT);
cf_desc.handle.fd = fd;
cf_desc.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;

cuFileDriverOpen();
cuFileHandleRegister(&cf_handle, &cf_desc);

// DMA: NVMe → GPU HBM (no CPU copy)
cuFileRead(cf_handle, gpu_ptr, file_size, 0, 0);

cuFileHandleDeregister(cf_handle);
cuFileDriverClose();

🔗 GPUDirect RDMA — Network Path

GPUDirect RDMA enables a ConnectX-7 NIC to DMA data directly from a remote server's GPU HBM into the local GPU HBM, bypassing both CPUs and system DRAMs in the transfer path. This is critical for multi-node AllReduce performance — without RDMA, every inter-node tensor transfer passes through two CPUs and two system memory copies.

🔑 GPUDirect RDMA Setup Commands Load module: modprobe nvidia-peermem · Verify: lsmod | grep nvidia_peermem · NCCL uses it automatically when available — set NCCL_NET_GDR_LEVEL=2 to force GPU DMA for all remote transfers. Check NCCL log: NCCL_DEBUG=INFO.

GPUDirect P2P — Intra-Node

GPUDirect P2P enables GPU A to read from or write to GPU B's HBM directly over NVLink or PCIe — without staging through system DRAM. In DGX H100 with NVLink 4, P2P transfers at 900 GB/s make tensor parallelism within a node nearly as fast as accessing local HBM. Enable with cudaDeviceEnablePeerAccess(peerDevice, 0) and verify with cudaDeviceCanAccessPeer(&canAccess, devA, devB).

🔄 The Training Data Pipeline

Even with fast NVMe and parallel filesystems, data loading can bottleneck training if the software pipeline is inefficient. The goal is to keep GPU compute 100% utilized by ensuring the next mini-batch of data is always ready before the GPU needs it — this requires pipelining storage reads, preprocessing, and GPU transfer in parallel.

Standard PyTorch Pipeline

📁 Storage
NVMe / Lustre
🔄 CPU Decode
JPEG / Tokenize
📌 Pinned Mem
cudaHostAlloc
🔥 GPU HBM
Async cudaMemcpy
⚙️ GPU Compute
Forward/Backward

NVIDIA DALI — GPU-Accelerated Pipeline

NVIDIA DALI (Data Loading Library) replaces the CPU-side data preprocessing with GPU execution, dramatically reducing data loading overhead for vision and NLP workloads. DALI performs decoding, augmentation, normalization, and format conversion directly on the GPU.

📁 Storage
NVMe / Lustre
🔄 CPU Read
Fast I/O only
🔥 GPU Decode
DALI nvJPEG
⚙️ GPU Aug
Crop/Normalize
✅ GPU Tensor
Ready for Model
FeaturePyTorch DataLoaderNVIDIA DALI
Decode (JPEG)CPU libjpegGPU (nvJPEG) — up to 10× faster
AugmentationCPU (torchvision)GPU — eliminates CPU bottleneck
Pipeline overlapLimited (Python GIL)Full async prefetch with CUDA streams
Multi-GPUReplicate per workerNative multi-GPU shard support
Video supportLimitedYes — GPU video decode
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def

@pipeline_def(batch_size=256, num_threads=4, device_id=0)
def training_pipeline():
    jpegs, labels = fn.readers.file(file_root="/lustre/imagenet/train",
                                     random_shuffle=True)
    # Decode directly to GPU
    images = fn.decoders.image(jpegs, device="mixed",
                                output_type=types.RGB)
    # Augment on GPU
    images = fn.random_resized_crop(images, size=224, device="gpu")
    images = fn.crop_mirror_normalize(images, device="gpu",
                                       mean=[0.485*255, 0.456*255, 0.406*255],
                                       std=[0.229*255, 0.224*255, 0.225*255])
    return images, labels

Pinned Memory & Prefetching

Pinned (page-locked) memory on the CPU side enables direct DMA from CPU DRAM to GPU HBM over PCIe — no intermediate copy is needed. Paged memory requires the OS to first copy to a pinned staging buffer before PCIe DMA can occur.

# PyTorch: enable pinned memory in DataLoader
loader = torch.utils.data.DataLoader(
    dataset, batch_size=256,
    pin_memory=True,    # ← enables direct DMA to GPU
    num_workers=8,      # ← parallel CPU data loading
    prefetch_factor=2   # ← prefetch 2 batches ahead
)
📌 Rule of thumb: Set num_workers to roughly 2–4 per GPU. Too few = I/O bottleneck. Too many = CPU contention. Always enable pin_memory=True when training on GPU.

💼 Checkpointing Strategy

Checkpointing saves model state (weights + optimizer state + training metadata) periodically during training. For large LLMs, checkpoints are hundreds of GB and must be saved quickly to minimize GPU idle time.

Checkpoint Size Estimation

Model: LLaMA 70B parameter count70 × 10⁹ params
BF16 weights (2 bytes/param)70B × 2 = 140 GB
FP32 optimizer state (8 bytes/param, Adam m+v)70B × 8 = 560 GB
Mixed precision (BF16 weights + FP32 optimizer)≈ 700 GB total
FSDP sharded across 8 GPUs (each saves 1/8)≈ 87.5 GB per GPU
Save time at 50 GB/s (local NVMe, sharded)~1.75 seconds

Checkpoint Strategies Comparison

StrategyHowSpeedBest For
Full checkpointRank 0 gathers all shards → saves to single fileSlow (single writer, bottlenecked)Small models, simple restore
Sharded checkpoint (FSDP)Each rank saves its own shard independently in parallelFast (N writers in parallel)LLM training with FSDP or DeepSpeed
Async checkpointCopy weights to CPU RAM in background while GPU continues trainingNear-zero GPU pauseFrequent checkpointing without overhead
GDS-acceleratedGPU → NVMe DMA via cuFile (no CPU)40–50% faster than CPU-routedLarge checkpoints to local NVMe

Checkpoint Staging Pattern

  • Step 1 (fast, ~seconds): Save checkpoint to local NVMe via GDS or async CPU copy. GPU resumes training immediately.
  • Step 2 (background): A background process (or async task) copies the checkpoint from local NVMe to the shared parallel filesystem (Lustre/WEKA) for durability across nodes.
  • Step 3 (optional): From the shared filesystem, a separate offload job copies to object storage (S3) for long-term/cloud backup.
✅ Optimal Checkpoint Pipeline GPU HBM → async copy to CPU pinned memory → GDS write to local NVMe (~seconds) → background rsync/copy to Lustre/WEKA → optional S3 offload. Training continues after step 1. No GPU idle time waiting for slow network storage.

📦 Dataset Formats — Avoiding the Small-Files Problem

How you store training data is as important as what storage system you use. Millions of small files destroy filesystem metadata performance (Lustre MDS IOPS) and cause random I/O patterns on HDDs/SSDs. Packaging samples into large sequential archives enables high-throughput streaming reads.

FormatStructureRead PatternThroughputAI Usage
WebDatasetTAR archives (key-value pairs inside)Sequential (fast)ExcellentCV training, ImageNet-scale
MosaicML StreamingIndexed binary shardsSequential + indexedExcellentLLM pre-training
TFRecordProtocol buffer recordsSequentialGoodTensorFlow workloads
HDF5Hierarchical binary formatRandom or sequentialGoodScientific/medical AI
Raw files (JPEG/PNG)Individual files per sampleRandom (kills MDS)Poor at scaleAvoid for large datasets

🧪 Practice Quiz — Storage Systems

Question 1 of 10
Which protocol does NVMe use to connect SSDs to the system, enabling high parallelism and low latency?
A) SATA (Serial ATA)
B) PCIe (Peripheral Component Interconnect Express)
C) SAS (Serial Attached SCSI)
D) USB 3.2
1 / 10
Question 2 of 10
What is the approximate sequential read bandwidth of a single NVMe PCIe Gen 4 SSD (as used in DGX H100)?
A) 0.55 GB/s (SATA speed)
B) 2 GB/s
C) 7 GB/s
D) 14 GB/s
2 / 10
Question 3 of 10
Which lfs command correctly sets Lustre stripe count to 8 OSTs with a 4 MB stripe size on a directory?
A) lfs stripe --count=8 --size=4m /path
B) lustre setstripe -n 8 -b 4m /path
C) lfs setstripe -c 8 -S 4m /path
D) lctl set stripe -c8 -s 4m /path
3 / 10
Question 4 of 10
What two components are required for GPUDirect Storage (GDS) to function?
A) NVLink bridge + ConnectX-7 NIC
B) cuFile API + nvidia-fs kernel module
C) BlueField DPU + DOCA SDK
D) NCCL + GPUDirect RDMA driver
4 / 10
Question 5 of 10
Which dataset format is recommended for high-throughput AI training to avoid the Lustre MDS metadata bottleneck caused by millions of small files?
A) Individual JPEG/PNG files in nested directories
B) WebDataset (TAR archives containing key-value sample pairs)
C) SQLite databases with binary BLOBs
D) ZIP archives of image batches
5 / 10
Question 6 of 10
A LLaMA 70B model checkpoint includes BF16 weights plus FP32 Adam optimizer state (m + v). What is the approximate full checkpoint size?
A) 70 GB (weights only, FP16)
B) 140 GB (weights only, BF16)
C) ~700 GB (BF16 weights + FP32 optimizer state)
D) 1.4 TB (FP32 weights + FP32 optimizer)
6 / 10
Question 7 of 10
In Lustre, what is the role of the MDS (Metadata Server)?
A) Store and serve all file data objects to clients
B) Manage file metadata: names, permissions, inodes, and which OSTs hold each file's data
C) Manage the NVMe RAID across all storage nodes
D) Handle network routing between OSS nodes and clients
7 / 10
Question 8 of 10
Which NVIDIA library replaces PyTorch DataLoader's CPU-side JPEG decoding and augmentation with GPU-accelerated equivalents?
A) cuBLAS
B) TensorRT
C) DALI (Data Loading Library)
D) NCCL
8 / 10
Question 9 of 10
What is the recommended checkpoint staging strategy to minimize GPU idle time during large model training?
A) Write directly to S3 object storage during the training step
B) Synchronously write to shared Lustre before resuming training
C) Save to local NVMe first (fast, via GDS or async), then copy to shared storage in background
D) Only checkpoint at the end of each full training epoch
9 / 10
Question 10 of 10
What does setting pin_memory=True in a PyTorch DataLoader enable?
A) Encrypts the CPU-GPU data transfer for security
B) Bypasses the PCIe bus entirely using NVLink
C) Uses page-locked (pinned) CPU memory, enabling direct DMA to GPU HBM without an intermediate copy from pageable memory
D) Pins specific model layers to GPU memory to prevent swapping
10 / 10
0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

NVMe Bandwidth
NVMe Gen 4 vs Gen 5 vs SATA SSD sequential read speeds?
Gen 4: ~7 GB/s · Gen 5: ~14 GB/s · SATA SSD: ~0.55 GB/s. Gen 4 is 13× faster than SATA.
7 → 14 → 0.55 (Gen4, Gen5, SATA)
DGX H100 Storage
Local NVMe count, per-drive capacity, total, and aggregate read BW?
8× NVMe drives · 3.84 TB each = 30 TB total · ~50 GB/s aggregate sequential read bandwidth
8 drives × 7 GB/s = ~50 GB/s
Lustre Striping
Command to set stripe count=8 and stripe size=4MB on a Lustre path?
lfs setstripe -c 8 -S 4m /path. Check with: lfs getstripe /path/file
-c = count, -S = size
GPUDirect Storage
What are the two required components for GDS, and what path does it enable?
cuFile API + nvidia-fs kernel module. Path: NVMe → PCIe DMA → GPU HBM directly (no CPU, no system DRAM).
cuFile + nvidia-fs = GDS
LLM Checkpoint Size
LLaMA 70B checkpoint size with BF16 weights + FP32 Adam optimizer?
Weights: 70B × 2B = 140 GB. Optimizer (m+v): 70B × 8B = 560 GB. Total: ~700 GB. FSDP sharded /8 = ~87.5 GB per rank.
700 GB full, ~88 GB per GPU (FSDP)
DALI vs DataLoader
What does NVIDIA DALI replace and what is its key advantage?
Replaces PyTorch DataLoader's CPU decode + augmentation with GPU execution (nvJPEG, GPU crop/normalize). Eliminates CPU bottleneck for vision training.
DALI = GPU-accelerated data loading
Pinned Memory
What does cudaHostAlloc / pin_memory=True enable?
Page-locked CPU memory that enables direct PCIe DMA to GPU HBM without an intermediate copy. Avoids CPU overhead of paging during transfer.
Pinned = page-locked = direct DMA
WEKA vs Lustre
Key differences: architecture origin, S3 support, latency?
WEKA: flash-native, native S3 API, <1ms latency, up to 1+ TB/s. Lustre: HDD-era open-source, S3 via gateway, 1–5ms, 100s GB/s–1 TB/s.
WEKA = flash-native, native S3

🤖 Storage Systems Advisor

Select a topic for exam-focused guidance:

NVMe & Local Storage
Lustre Configuration
GPUDirect Technologies
Data Pipeline Optimization
Checkpoint Strategy

💿 NVMe & Local Storage — Key Exam Points

  • NVMe Gen 4 = ~7 GB/s sequential read per drive. Gen 5 = ~14 GB/s. SATA SSD = 0.55 GB/s. NVMe uses PCIe lanes, not SATA protocol — up to 65,535 I/O queues (AHCI/SATA: just 1 queue with 32 commands).
  • DGX H100 local storage: 8× NVMe PCIe Gen 4, 3.84 TB each = 30 TB total, ~50 GB/s aggregate sequential read when accessing all 8 drives in parallel.
  • NVMe access latency: ~100 µs. Compare: HDD = ~10 ms (100× slower), DRAM = ~100 ns (1,000× faster than NVMe). NVMe is the right choice for checkpoint staging — fast enough to not block GPU, cheap enough for large capacity.
  • NVMe-oF extends NVMe over RDMA fabric (RoCEv2 or IB). Used in shared all-flash arrays and WEKA internals. Same NVMe protocol, network-attached instead of local PCIe.
  • NVMe Gen 5 support in newer servers: 14 GB/s per drive doubles checkpoint bandwidth. A system with 8× Gen 5 NVMe drives can save at ~100 GB/s — a 700 GB checkpoint in ~7 seconds.
  • To benchmark local NVMe: fio --name=seq_read --rw=read --bs=1m --ioengine=libaio --iodepth=32 --filename=/dev/nvme0n1 --direct=1 --size=10g. For GDS-enabled benchmarks use gdsio tool from NVIDIA.

🌐 Lustre Configuration — Exam Points

  • Lustre components: MDS (Metadata Server) + MDT (Metadata Target) for namespace; OSS (Object Storage Server) + OST (Object Storage Target) for data. Clients are compute nodes with lustre kernel module.
  • Performance scales with OSS/OST count — more OST nodes = more aggregate bandwidth. The MDS is a potential bottleneck for metadata-heavy workloads (many small files).
  • Set optimal striping before writing large files: lfs setstripe -c -1 -S 4m /lustre/checkpoints (-c -1 = stripe across all OSTs). Apply at directory level — inherited by new files.
  • Small files problem: millions of files → MDS IOPS saturation. Solution: package data into TAR/WebDataset format. One 10 GB WebDataset TAR file = 1 metadata open, not 10,000 file opens.
  • Check stripe settings: lfs getstripe -r /lustre/training_data (-r = recursive). Check disk usage: lfs df -h /lustre. Check OST balance: lfs df --mdt /lustre.
  • Lustre GPUDirect Storage: supported via the Lustre POSIX layer — load nvidia-fs module, then use cuFile with standard Lustre paths. No special Lustre client config needed beyond the GDS driver.

⚡ GPUDirect Technologies — Exam Points

  • GDS = cuFile API + nvidia-fs kernel module. Path: NVMe → PCIe DMA → GPU HBM (no CPU, no system DRAM). Supported filesystems: ext4, xfs, Lustre, WEKA, BeeGFS. NOT supported: NFS, CIFS/SMB.
  • Load GDS driver: modprobe nvidia-fs. Verify: lsmod | grep nvidia_fs. Check GDS config: cat /proc/driver/nvidia-fs/params. GDS tool test: gdsio -f /dev/nvme0n1 -d 0 -w 0 -s 1g.
  • GPUDirect RDMA: ConnectX-7 NIC → GPU HBM DMA (no CPU). Driver: modprobe nvidia-peermem. NCCL uses automatically when available. Force GDR: NCCL_NET_GDR_LEVEL=2.
  • GPUDirect P2P: GPU-to-GPU direct transfer over NVLink (900 GB/s) or PCIe. Enable: cudaDeviceEnablePeerAccess(peerDev, 0). Over NVLink, this is always available and gives near-HBM bandwidth for inter-GPU tensor copies.
  • GDS performance gain: 40–50% faster checkpoint save compared to CPU-routed path. Especially significant for large LLMs where checkpoint size is 100s of GB and I/O is the bottleneck.
  • GDS + FSDP pattern: each rank independently calls cuFileWrite to its own shard file on local NVMe — fully parallel, no rank 0 bottleneck, near-line-rate NVMe bandwidth utilized by all GPUs simultaneously.

🔄 Data Pipeline Optimization — Exam Points

  • DALI (Data Loading Library): GPU-accelerated decode (nvJPEG), augmentation, and normalization. Use instead of PyTorch DataLoader for vision at scale. Import: import nvidia.dali.fn as fn. Key op: fn.decoders.image(..., device="mixed") decodes JPEG on GPU.
  • Pinned memory: pin_memory=True in DataLoader or cudaHostAlloc directly. Enables direct DMA from CPU DRAM to GPU HBM — avoids an OS-level copy from pageable to pinned before DMA. Critical for high-throughput data loading.
  • num_workers: set to 2–4× GPU count. Each worker is a separate process (avoids Python GIL). Too few = I/O bottleneck. Too many = CPU memory pressure. Monitor with nvidia-smi dmon and check GPU utilization — should be >90%.
  • prefetch_factor=N: each DataLoader worker pre-loads N batches ahead. Default is 2. Increase for slower storage (network filesystem) to hide latency.
  • CUDA streams for overlap: use two CUDA streams — stream A computes on batch N while stream B transfers batch N+1 to GPU. This hides PCIe transfer latency behind GPU compute with cudaMemcpyAsync.
  • Dataset format: use WebDataset (pip install webdataset) for large CV datasets. Reads sequentially from TAR archives — no small-file IOPS. MosaicML Streaming dataset format for LLM training with random-access capability across shuffled shards.

💼 Checkpoint Strategy — Exam Points

  • Checkpoint size formula: weights = params × bytes_per_param. BF16 = 2B/param, FP32 = 4B/param. Adam optimizer state = 2× FP32 = 8B/param. LLaMA 70B full: 140 GB (BF16) + 560 GB (FP32 opt) = 700 GB.
  • FSDP sharded checkpointing: each rank saves its own shard independently. 700 GB ÷ 8 GPUs = ~87.5 GB per rank. Saving in parallel: 87.5 GB at 50 GB/s = ~1.75s. Much faster than rank 0 saving all 700 GB (14s at 50 GB/s).
  • Async checkpointing: copy weights to CPU pinned memory in background thread while GPU continues training. GPU resumes within milliseconds. The actual NVMe write happens asynchronously. PyTorch: use AsyncCheckpointingManager or custom threading.
  • Optimal staging: GPU HBM → async copy to CPU pinned RAM (fast) → GDS write to local NVMe → background rsync/cp to Lustre/WEKA → optional offload to S3. Each stage uses the fastest available path.
  • Checkpoint frequency trade-off: checkpoint every 100 steps = small amount of work lost on crash. Checkpoint every 10 steps = significant storage I/O overhead. Use async checkpointing to checkpoint more frequently without GPU pauses.
  • Verify checkpoint integrity: after save, reload and compare a few parameter values. Use torch.allclose(original_param, loaded_param) to catch silent corruption. Especially important for multi-rank sharded checkpoints where a single corrupt shard can corrupt the whole restore.

🧩 Storage Mnemonics

FactMnemonic / Hook
NVMe Gen 4 = 7 GB/s"7 GB/s — 7 days in a week, one week of HDD performance per second"
DGX H100 NVMe: 8 × 3.84 TB = 30 TB"8 drives × 4 TB ≈ 30 TB — like 30 TB 'cloud' living inside your server"
GDS = cuFile + nvidia-fs"GPU Direct Storage = C-u-File + nvidia-FS kernel — two parts, one path"
Lustre stripe: -c count, -S size"lfs setstripe: -c for Count, -S for Size. C comes before S alphabetically."
LLaMA 70B full checkpoint = ~700 GB"70B params × 10 bytes avg (weights + optimizer) ≈ 700 GB = 10 H100s of HBM"
Small files = MDS bottleneck"Millions of tiny files = millions of MDS knocks. Use TAR/WebDataset to knock once."
DALI = GPU decode"DALI sends the decoding to the GPU — Data Accelerated Loading Inline"
← All NCP-AII Topics
NCP-AII · Topic 3 Complete

Keep your GPUs fed — master AI storage

FlashGenius covers every NCP-AII topic with exam-focused flashcards, quizzes, and deep-dive study guides.

Start Free → NVIDIA Certifications ↗