Storage Systems for AI Workloads

🗺️ The AI Storage Problem

GPU compute has scaled faster than storage bandwidth over the past decade. A DGX H100 can process data at ~32 PFLOPS FP8, but can only read from disk at ~50 GB/s from local NVMe. This creates the I/O wall: without a high-throughput storage architecture, expensive GPUs sit idle waiting for data. Solving the storage problem is what separates a well-designed AI cluster from an expensive bottleneck.

🔑 Three Storage Demands of AI Training (1) Dataset reads: streaming training data to GPUs fast enough that data loading never becomes the bottleneck. (2) Checkpointing: saving model state every N steps — large models mean hundreds of GB per checkpoint. (3) Scratch I/O: temporary activation storage, gradient accumulation, intermediate tensors.

💿

Local NVMe

Per-node fast scratch: ~7 GB/s per drive, 30 TB in DGX H100

🌐

Parallel Filesystems

Lustre, WEKA — cluster-wide 100+ GB/s shared storage

⚡

GPUDirect Storage

NVMe → GPU DMA bypassing CPU — cuFile API, nvidia-fs

🔄

Data Pipelines

DALI, pinned memory, prefetch, WebDataset streaming

💼

Checkpointing

Full vs sharded, async staging, GDS-accelerated saves

📊 AI Storage Hierarchy — Full Picture

🔴 GPU HBM3 (H100)

80 GB per GPU · 3,350 GB/s bandwidth · active model weights & activations

Fastest · 3,350 GB/s

🟠 Local NVMe SSD (DGX H100)

30 TB total (8× 3.84 TB) · ~50 GB/s aggregate · checkpoint staging, scratch

~50 GB/s

🟡 Shared Parallel FS (Lustre / WEKA)

Petabytes · 100–1,000+ GB/s aggregate · shared datasets, checkpoints

100–1,000 GB/s

🟢 All-Flash Array (NVMe-oF)

High-performance shared block storage over NVMe over Fabrics (RDMA)

10–100 GB/s

🔵 Object Storage (S3-compatible)

Petabytes · 1–10 GB/s · raw datasets, model archives, long-term retention

1–10 GB/s

⚫ Tape / Cold Archive

Exabytes · <1 GB/s · compliance, raw data retention, disaster recovery

Slowest · <1 GB/s

⚖️ Technology Comparison at a Glance

Storage Technology	Bandwidth	Capacity	Latency	Best AI Use Case
NVMe Gen 4 SSD (per drive)	~7 GB/s seq read	3–8 TB/drive	~100 µs	Local checkpoint staging, scratch
NVMe Gen 5 SSD (per drive)	~14 GB/s seq read	4–8 TB/drive	~50 µs	Ultra-fast local scratch, GDS-enabled
Lustre (cluster)	100s GB/s–1 TB/s	Petabytes	1–5 ms	Training datasets, shared checkpoints
WEKA (cluster)	Up to 1+ TB/s	Petabytes	<1 ms	All-flash AI, native S3, GDS-optimized
NFS v4	1–5 GB/s	Varies	1–10 ms	Not recommended — MDS bottleneck
AWS S3 / Object	1–10 GB/s	Unlimited	10–100 ms	Raw dataset storage, model archives

💿 NVMe SSD Architecture

NVMe (Non-Volatile Memory Express) is the protocol standard for connecting SSDs over PCIe. It replaced older SATA and SAS protocols by taking advantage of PCIe lanes for massive parallelism. Understanding NVMe is foundational to AI storage design — it's the fastest per-node storage available.

🔑 NVMe vs SATA vs HDD — Sequential Read Bandwidth NVMe Gen 4: ~7 GB/s | NVMe Gen 5: ~14 GB/s | SATA SSD: ~0.55 GB/s | HDD: ~0.15–0.2 GB/s. NVMe Gen 4 is 13× faster than SATA SSD, and 35–50× faster than HDD. For checkpoint and data loading workloads, this gap is enormous.

Sequential Bandwidth Comparison

NVMe Gen 5 (per drive)14,000 MB/s

Gen 5 — 14 GB/s

NVMe Gen 4 (per drive, DGX H100)7,000 MB/s

Gen 4 — 7 GB/s

NVMe Gen 3 (per drive)3,500 MB/s

Gen 3 — 3.5 GB/s

SATA SSD550 MB/s

SATA — 550 MB/s

HDD (7200 RPM)180 MB/s

HDD

DGX H100 Local NVMe Configuration

Parameter	Value
Drive count	8× NVMe SSDs
Capacity per drive	3.84 TB
Total capacity	~30 TB
NVMe generation	PCIe Gen 4 NVMe
Sequential read (per drive)	~7 GB/s
Aggregate sequential read	~50–56 GB/s (all 8 drives parallel)
Aggregate sequential write	~40–48 GB/s
Random 4K IOPS (per drive)	~1M IOPS
Access latency	~100 µs
Primary use	Checkpoint staging, dataset cache, scratch

⚙️ NVMe Key Concepts

NVMe Queues

NVMe supports up to 65,535 I/O queues with up to 65,535 commands per queue. Compare this to AHCI (SATA protocol): only 1 queue with 32 commands. This parallelism is why NVMe can hit 1M+ IOPS while SATA tops out at ~100K IOPS.

NVMe Namespaces

An NVMe namespace is a logical partition of an NVMe drive — analogous to a disk partition but more flexible. Supports multiple namespaces per device. In multi-tenant environments (MIG or virtualized), different namespaces can be assigned to different workloads for isolation.

NVMe-oF (NVMe over Fabrics)

NVMe-oF extends the NVMe protocol over a network fabric (RDMA/RoCEv2 or Fibre Channel), allowing remote NVMe drives to appear as local devices with near-local latency. Used in high-performance shared storage arrays for AI clusters. The WEKA filesystem uses NVMe-oF internally between nodes.

RAID Considerations for AI

RAID 0 (striping): doubles bandwidth with two drives — used for maximizing checkpoint write speed from local NVMe. No redundancy.
RAID 5/6: adds parity for redundancy but write penalty reduces performance — rarely used for AI scratch NVMe.
JBOD / no RAID: use all drives independently in parallel via application-level striping (filesystem or application handles distribution). Common in DGX deployments for maximum bandwidth.

📌 Exam tip: For AI checkpointing, parallel writes to all 8 NVMe drives simultaneously (via Lustre/WEKA striping or GPUDirect Storage) gives ~50 GB/s. This can save a 140 GB LLaMA 70B weight checkpoint in under 3 seconds.

🌐 Parallel Filesystems — Why They Exist

A single DGX H100 has ~50 GB/s of local NVMe bandwidth. A DGX SuperPOD with 32 nodes has 32 × 50 GB/s = 1.6 TB/s of local storage — but those are 32 separate filesystems. Training a large model across all 256 GPUs requires shared access to training data and checkpoint directories. Parallel filesystems solve this by distributing storage across many nodes while presenting a unified namespace to all clients.

Lustre Architecture

DGX Node 0

DGX Node 1

DGX Node 2

…

DGX Node 31

↕ Network (IB or Ethernet)

Metadata

MDS/MDT

Metadata HA

MDS/MDT (backup)

↕

Data

OSS 0 / OST 0

Data

OSS 1 / OST 1

Data

OSS 2 / OST 2

Data

… OST N

🗂️ Lustre — Deep Dive

Lustre is the most widely deployed HPC and AI filesystem. It separates metadata (file names, permissions, directory structure) from data (file contents), scaling each independently. Clients access files by first talking to the MDS for metadata, then reading/writing data directly from OSS nodes in parallel.

Key Lustre Concepts

Component	Role	Exam Key Fact
MDS (Metadata Server)	Manages file metadata: names, permissions, inode info, which OSTs hold file data	Single point of failure if no HA pair
MDT (Metadata Target)	Storage device on MDS — holds metadata. Usually a fast SSD array	MDT IOPS = metadata performance ceiling
OSS (Object Storage Server)	Serves data I/O for one or more OSTs to clients	Add more OSS nodes = more bandwidth
OST (Object Storage Target)	Disk/array on OSS — stores actual file data objects	Stripe across OSTs for parallel I/O
Lustre Client	Kernel module on compute node that mounts /lustre and intercepts I/O calls	Loaded via: `modprobe lustre`

Lustre Striping — Critical for AI Performance

Lustre striping splits a file across multiple OSTs so reads/writes can happen in parallel. A large dataset file striped across 8 OSTs can be read at 8× the speed of a single-OST file. For AI training, always stripe large files.

# Set stripe count (8 OSTs) and size (4MB chunks) for a directory
lfs setstripe -c 8 -S 4m /lustre/training_data

# Check stripe settings on a file
lfs getstripe /lustre/training_data/dataset.tar

# Check filesystem usage
lfs df -h /lustre

# Find files with suboptimal stripe count
lfs find /lustre/checkpoints -stripe-count 1

lfs Parameter	Meaning	AI Recommendation
`-c N` (stripe count)	Number of OSTs to stripe across	Set to number of OSTs or -1 for all
`-S size` (stripe size)	Chunk size per OST before moving to next	1–4 MB for sequential, smaller for random
`-i N` (start OST)	Which OST to start striping from	Leave default (-1 = round-robin)

Lustre Performance Scale

Lustre performance scales linearly with OSS/OST nodes. A well-configured Lustre system for a DGX SuperPOD targets 1 TB/s+ aggregate read bandwidth for dataset loading, achieved by deploying enough OSS nodes with NVMe-backed OSTs. The MDS must also be sized to handle the metadata load from 256 clients opening millions of small files simultaneously.

⚠️ The Small Files Problem Training with millions of individual small files (e.g., 1 JPEG = 1 file) hammers the Lustre MDS with a metadata request per file. This doesn't scale. Solution: package files into TAR archives (WebDataset format) or large binary blobs. One 10 GB TAR = one metadata operation for thousands of samples.

⚡ WEKA — Flash-Native AI Filesystem

WEKA is a commercial all-flash parallel filesystem built from scratch for NVMe and RDMA. Unlike Lustre (ported from HDD-era architecture), WEKA was designed entirely around flash storage characteristics. It delivers sub-millisecond latency, native S3 API, and is an official NVIDIA-validated storage platform for DGX.

Feature	WEKA	Lustre
Architecture origin	Flash-native (built for NVMe)	HDD-era, adapted for flash
Peak bandwidth	1+ TB/s aggregate	100s GB/s – 1 TB/s (OSS-limited)
Latency	<1 ms (sub-ms for local)	1–5 ms
S3 API	Native (no gateway)	Requires S3 gateway add-on
POSIX compliance	Full POSIX	Full POSIX
GPUDirect Storage	Certified & optimized	Supported (requires config)
Management	GUI + CLI	Command-line / manual
License	Commercial	Open source (GPL)
Typical use	All-flash AI clusters, DGX POD	Academic HPC, budget-sensitive clusters

📋 Other Shared Filesystems

Filesystem	Type	Bandwidth	Strengths	Weaknesses
IBM Spectrum Scale (GPFS)	Commercial parallel FS	100s GB/s	Enterprise features, HSM tiering to tape, strong support	Cost, complexity
BeeGFS	Open source parallel FS	10–100 GB/s	Easy setup, commodity hardware, academic HPC standard	Lower performance ceiling
DDN EXAScaler (Lustre)	Commercial Lustre distrib.	1+ TB/s	Lustre compatibility, enterprise support, high scale	Cost
NFS v4	Network FS	1–5 GB/s	Simple, universal	Single MDS bottleneck, not parallel

⚡ GPUDirect Technology Suite

GPUDirect is NVIDIA's family of technologies that enable data to move directly to and from GPU HBM memory without passing through the CPU and system DRAM. Each variant eliminates a different copy on the data path, dramatically reducing latency and freeing CPU resources for other work.

Technology	Data Path	What It Eliminates	Use Case
GPUDirect Storage (GDS)	NVMe ↔ GPU HBM (DMA)	CPU + system DRAM copy	Checkpoint save/load, dataset reading
GPUDirect RDMA	Remote NIC ↔ GPU HBM (DMA)	CPU + system DRAM copy on network path	Multi-node NCCL AllReduce, P2P transfers
GPUDirect P2P	GPU A ↔ GPU B (NVLink or PCIe)	CPU + system DRAM bounce copy	Tensor/pipeline parallel intra-node
GPUDirect Video	Video capture ↔ GPU HBM	CPU frame decode overhead	Video AI inference pipelines

💾 GPUDirect Storage — Deep Dive

GPUDirect Storage (GDS) is the most exam-critical GPUDirect technology for NCP-AII storage topics. It enables a DMA engine to transfer data directly between NVMe storage (local or NVMe-oF) and GPU HBM — the CPU is completely uninvolved in the data transfer.

❌ Without GPUDirect Storage

NVMe SSD

PCIe → System DRAM (pageable)

CPU copies to pinned memory

PCIe → GPU HBM (second copy)

2× PCIe crossings · CPU occupied

✅ With GPUDirect Storage (GDS)

NVMe SSD

PCIe DMA → GPU HBM (direct)

System DRAM ← skipped

CPU copy ← skipped

1× PCIe crossing · CPU-free · 40-50% faster

GDS Requirements & Setup

Requirement	Component	Notes
CUDA API	`cuFile` API	Replaces standard POSIX file I/O for GPU-direct reads/writes
Kernel driver	`nvidia-fs` module	Loaded with `modprobe nvidia-fs`, part of CUDA toolkit
Storage compatibility	NVMe local or NVMe-oF	Must be NVMe — SATA and HDD are not supported
Filesystem	ext4, xfs, WEKA, Lustre	NFS/CIFS are NOT supported by GDS
Driver version	CUDA 11.4+ / NVIDIA 470+	GDS generally available from CUDA 11.4

// Minimal GPUDirect Storage read example
#include <cufile.h>

CUfileDescr_t cf_desc = {0};
CUfileHandle_t cf_handle;
void *gpu_ptr;  // already cudaMalloc'd

// Open file with O_DIRECT | O_RDONLY
int fd = open("/lustre/model_weights.bin", O_RDONLY | O_DIRECT);
cf_desc.handle.fd = fd;
cf_desc.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;

cuFileDriverOpen();
cuFileHandleRegister(&cf_handle, &cf_desc);

// DMA: NVMe → GPU HBM (no CPU copy)
cuFileRead(cf_handle, gpu_ptr, file_size, 0, 0);

cuFileHandleDeregister(cf_handle);
cuFileDriverClose();

🔗 GPUDirect RDMA — Network Path

GPUDirect RDMA enables a ConnectX-7 NIC to DMA data directly from a remote server's GPU HBM into the local GPU HBM, bypassing both CPUs and system DRAMs in the transfer path. This is critical for multi-node AllReduce performance — without RDMA, every inter-node tensor transfer passes through two CPUs and two system memory copies.

🔑 GPUDirect RDMA Setup Commands Load module: modprobe nvidia-peermem · Verify: lsmod | grep nvidia_peermem · NCCL uses it automatically when available — set NCCL_NET_GDR_LEVEL=2 to force GPU DMA for all remote transfers. Check NCCL log: NCCL_DEBUG=INFO.

GPUDirect P2P — Intra-Node

GPUDirect P2P enables GPU A to read from or write to GPU B's HBM directly over NVLink or PCIe — without staging through system DRAM. In DGX H100 with NVLink 4, P2P transfers at 900 GB/s make tensor parallelism within a node nearly as fast as accessing local HBM. Enable with cudaDeviceEnablePeerAccess(peerDevice, 0) and verify with cudaDeviceCanAccessPeer(&canAccess, devA, devB).

🔄 The Training Data Pipeline

Even with fast NVMe and parallel filesystems, data loading can bottleneck training if the software pipeline is inefficient. The goal is to keep GPU compute 100% utilized by ensuring the next mini-batch of data is always ready before the GPU needs it — this requires pipelining storage reads, preprocessing, and GPU transfer in parallel.

Standard PyTorch Pipeline

📁 Storage
NVMe / Lustre

→

🔄 CPU Decode
JPEG / Tokenize

→

📌 Pinned Mem
cudaHostAlloc

→

🔥 GPU HBM
Async cudaMemcpy

→

⚙️ GPU Compute
Forward/Backward

NVIDIA DALI — GPU-Accelerated Pipeline

NVIDIA DALI (Data Loading Library) replaces the CPU-side data preprocessing with GPU execution, dramatically reducing data loading overhead for vision and NLP workloads. DALI performs decoding, augmentation, normalization, and format conversion directly on the GPU.

📁 Storage
NVMe / Lustre

→

🔄 CPU Read
Fast I/O only

→

🔥 GPU Decode
DALI nvJPEG

→

⚙️ GPU Aug
Crop/Normalize

→

✅ GPU Tensor
Ready for Model

Feature	PyTorch DataLoader	NVIDIA DALI
Decode (JPEG)	CPU libjpeg	GPU (nvJPEG) — up to 10× faster
Augmentation	CPU (torchvision)	GPU — eliminates CPU bottleneck
Pipeline overlap	Limited (Python GIL)	Full async prefetch with CUDA streams
Multi-GPU	Replicate per worker	Native multi-GPU shard support
Video support	Limited	Yes — GPU video decode

import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.pipeline import pipeline_def

@pipeline_def(batch_size=256, num_threads=4, device_id=0)
def training_pipeline():
    jpegs, labels = fn.readers.file(file_root="/lustre/imagenet/train",
                                     random_shuffle=True)
    # Decode directly to GPU
    images = fn.decoders.image(jpegs, device="mixed",
                                output_type=types.RGB)
    # Augment on GPU
    images = fn.random_resized_crop(images, size=224, device="gpu")
    images = fn.crop_mirror_normalize(images, device="gpu",
                                       mean=[0.485*255, 0.456*255, 0.406*255],
                                       std=[0.229*255, 0.224*255, 0.225*255])
    return images, labels

Pinned Memory & Prefetching

Pinned (page-locked) memory on the CPU side enables direct DMA from CPU DRAM to GPU HBM over PCIe — no intermediate copy is needed. Paged memory requires the OS to first copy to a pinned staging buffer before PCIe DMA can occur.

# PyTorch: enable pinned memory in DataLoader
loader = torch.utils.data.DataLoader(
    dataset, batch_size=256,
    pin_memory=True,    # ← enables direct DMA to GPU
    num_workers=8,      # ← parallel CPU data loading
    prefetch_factor=2   # ← prefetch 2 batches ahead
)

📌 Rule of thumb: Set num_workers to roughly 2–4 per GPU. Too few = I/O bottleneck. Too many = CPU contention. Always enable pin_memory=True when training on GPU.

💼 Checkpointing Strategy

Checkpointing saves model state (weights + optimizer state + training metadata) periodically during training. For large LLMs, checkpoints are hundreds of GB and must be saved quickly to minimize GPU idle time.

Checkpoint Size Estimation

Model: LLaMA 70B parameter count70 × 10⁹ params

BF16 weights (2 bytes/param)70B × 2 = 140 GB

FP32 optimizer state (8 bytes/param, Adam m+v)70B × 8 = 560 GB

Mixed precision (BF16 weights + FP32 optimizer)≈ 700 GB total

FSDP sharded across 8 GPUs (each saves 1/8)≈ 87.5 GB per GPU

Save time at 50 GB/s (local NVMe, sharded)~1.75 seconds

Checkpoint Strategies Comparison

Strategy	How	Speed	Best For
Full checkpoint	Rank 0 gathers all shards → saves to single file	Slow (single writer, bottlenecked)	Small models, simple restore
Sharded checkpoint (FSDP)	Each rank saves its own shard independently in parallel	Fast (N writers in parallel)	LLM training with FSDP or DeepSpeed
Async checkpoint	Copy weights to CPU RAM in background while GPU continues training	Near-zero GPU pause	Frequent checkpointing without overhead
GDS-accelerated	GPU → NVMe DMA via cuFile (no CPU)	40–50% faster than CPU-routed	Large checkpoints to local NVMe

Checkpoint Staging Pattern

Step 1 (fast, ~seconds): Save checkpoint to local NVMe via GDS or async CPU copy. GPU resumes training immediately.
Step 2 (background): A background process (or async task) copies the checkpoint from local NVMe to the shared parallel filesystem (Lustre/WEKA) for durability across nodes.
Step 3 (optional): From the shared filesystem, a separate offload job copies to object storage (S3) for long-term/cloud backup.

✅ Optimal Checkpoint Pipeline GPU HBM → async copy to CPU pinned memory → GDS write to local NVMe (~seconds) → background rsync/copy to Lustre/WEKA → optional S3 offload. Training continues after step 1. No GPU idle time waiting for slow network storage.

📦 Dataset Formats — Avoiding the Small-Files Problem

How you store training data is as important as what storage system you use. Millions of small files destroy filesystem metadata performance (Lustre MDS IOPS) and cause random I/O patterns on HDDs/SSDs. Packaging samples into large sequential archives enables high-throughput streaming reads.

Format	Structure	Read Pattern	Throughput	AI Usage
WebDataset	TAR archives (key-value pairs inside)	Sequential (fast)	Excellent	CV training, ImageNet-scale
MosaicML Streaming	Indexed binary shards	Sequential + indexed	Excellent	LLM pre-training
TFRecord	Protocol buffer records	Sequential	Good	TensorFlow workloads
HDF5	Hierarchical binary format	Random or sequential	Good	Scientific/medical AI
Raw files (JPEG/PNG)	Individual files per sample	Random (kills MDS)	Poor at scale	Avoid for large datasets

🧪 Practice Quiz — Storage Systems

Question 1 of 10

Which protocol does NVMe use to connect SSDs to the system, enabling high parallelism and low latency?

A) SATA (Serial ATA)

B) PCIe (Peripheral Component Interconnect Express)

C) SAS (Serial Attached SCSI)

D) USB 3.2

1 / 10

Question 2 of 10

What is the approximate sequential read bandwidth of a single NVMe PCIe Gen 4 SSD (as used in DGX H100)?

A) 0.55 GB/s (SATA speed)

B) 2 GB/s

C) 7 GB/s

D) 14 GB/s

2 / 10

Question 3 of 10

Which lfs command correctly sets Lustre stripe count to 8 OSTs with a 4 MB stripe size on a directory?

A) lfs stripe --count=8 --size=4m /path

B) lustre setstripe -n 8 -b 4m /path

C) lfs setstripe -c 8 -S 4m /path

D) lctl set stripe -c8 -s 4m /path

3 / 10

Question 4 of 10

What two components are required for GPUDirect Storage (GDS) to function?

A) NVLink bridge + ConnectX-7 NIC

B) cuFile API + nvidia-fs kernel module

C) BlueField DPU + DOCA SDK

D) NCCL + GPUDirect RDMA driver

4 / 10

Question 5 of 10

Which dataset format is recommended for high-throughput AI training to avoid the Lustre MDS metadata bottleneck caused by millions of small files?

A) Individual JPEG/PNG files in nested directories

B) WebDataset (TAR archives containing key-value sample pairs)

C) SQLite databases with binary BLOBs

D) ZIP archives of image batches

5 / 10

Question 6 of 10

A LLaMA 70B model checkpoint includes BF16 weights plus FP32 Adam optimizer state (m + v). What is the approximate full checkpoint size?

A) 70 GB (weights only, FP16)

B) 140 GB (weights only, BF16)

C) ~700 GB (BF16 weights + FP32 optimizer state)

D) 1.4 TB (FP32 weights + FP32 optimizer)

6 / 10

Question 7 of 10

In Lustre, what is the role of the MDS (Metadata Server)?

A) Store and serve all file data objects to clients

B) Manage file metadata: names, permissions, inodes, and which OSTs hold each file's data

C) Manage the NVMe RAID across all storage nodes

D) Handle network routing between OSS nodes and clients

7 / 10

Question 8 of 10

Which NVIDIA library replaces PyTorch DataLoader's CPU-side JPEG decoding and augmentation with GPU-accelerated equivalents?

A) cuBLAS

B) TensorRT

C) DALI (Data Loading Library)

D) NCCL

8 / 10

Question 9 of 10

What is the recommended checkpoint staging strategy to minimize GPU idle time during large model training?

A) Write directly to S3 object storage during the training step

B) Synchronously write to shared Lustre before resuming training

C) Save to local NVMe first (fast, via GDS or async), then copy to shared storage in background

D) Only checkpoint at the end of each full training epoch

9 / 10

Question 10 of 10

What does setting pin_memory=True in a PyTorch DataLoader enable?

A) Encrypts the CPU-GPU data transfer for security

B) Bypasses the PCIe bus entirely using NVLink

C) Uses page-locked (pinned) CPU memory, enabling direct DMA to GPU HBM without an intermediate copy from pageable memory

D) Pins specific model layers to GPU memory to prevent swapping

10 / 10

0/10

🃏 Flashcards — Tap to Flip

Click any card to reveal the answer

NVMe Bandwidth

NVMe Gen 4 vs Gen 5 vs SATA SSD sequential read speeds?

Gen 4: ~7 GB/s · Gen 5: ~14 GB/s · SATA SSD: ~0.55 GB/s. Gen 4 is 13× faster than SATA.

7 → 14 → 0.55 (Gen4, Gen5, SATA)

DGX H100 Storage

Local NVMe count, per-drive capacity, total, and aggregate read BW?

8× NVMe drives · 3.84 TB each = 30 TB total · ~50 GB/s aggregate sequential read bandwidth

8 drives × 7 GB/s = ~50 GB/s

Lustre Striping

Command to set stripe count=8 and stripe size=4MB on a Lustre path?

lfs setstripe -c 8 -S 4m /path. Check with: lfs getstripe /path/file

-c = count, -S = size

GPUDirect Storage

What are the two required components for GDS, and what path does it enable?

cuFile API + nvidia-fs kernel module. Path: NVMe → PCIe DMA → GPU HBM directly (no CPU, no system DRAM).

cuFile + nvidia-fs = GDS

LLM Checkpoint Size

LLaMA 70B checkpoint size with BF16 weights + FP32 Adam optimizer?

Weights: 70B × 2B = 140 GB. Optimizer (m+v): 70B × 8B = 560 GB. Total: ~700 GB. FSDP sharded /8 = ~87.5 GB per rank.

700 GB full, ~88 GB per GPU (FSDP)

DALI vs DataLoader

What does NVIDIA DALI replace and what is its key advantage?

Replaces PyTorch DataLoader's CPU decode + augmentation with GPU execution (nvJPEG, GPU crop/normalize). Eliminates CPU bottleneck for vision training.

DALI = GPU-accelerated data loading

Pinned Memory

What does cudaHostAlloc / pin_memory=True enable?

Page-locked CPU memory that enables direct PCIe DMA to GPU HBM without an intermediate copy. Avoids CPU overhead of paging during transfer.

Pinned = page-locked = direct DMA

WEKA vs Lustre

Key differences: architecture origin, S3 support, latency?

WEKA: flash-native, native S3 API, <1ms latency, up to 1+ TB/s. Lustre: HDD-era open-source, S3 via gateway, 1–5ms, 100s GB/s–1 TB/s.

WEKA = flash-native, native S3

🤖 Storage Systems Advisor

Select a topic for exam-focused guidance:

NVMe & Local Storage

Lustre Configuration

GPUDirect Technologies

Data Pipeline Optimization

Checkpoint Strategy

💿 NVMe & Local Storage — Key Exam Points

NVMe Gen 4 = ~7 GB/s sequential read per drive. Gen 5 = ~14 GB/s. SATA SSD = 0.55 GB/s. NVMe uses PCIe lanes, not SATA protocol — up to 65,535 I/O queues (AHCI/SATA: just 1 queue with 32 commands).
DGX H100 local storage: 8× NVMe PCIe Gen 4, 3.84 TB each = 30 TB total, ~50 GB/s aggregate sequential read when accessing all 8 drives in parallel.
NVMe access latency: ~100 µs. Compare: HDD = ~10 ms (100× slower), DRAM = ~100 ns (1,000× faster than NVMe). NVMe is the right choice for checkpoint staging — fast enough to not block GPU, cheap enough for large capacity.
NVMe-oF extends NVMe over RDMA fabric (RoCEv2 or IB). Used in shared all-flash arrays and WEKA internals. Same NVMe protocol, network-attached instead of local PCIe.
NVMe Gen 5 support in newer servers: 14 GB/s per drive doubles checkpoint bandwidth. A system with 8× Gen 5 NVMe drives can save at ~100 GB/s — a 700 GB checkpoint in ~7 seconds.
To benchmark local NVMe: fio --name=seq_read --rw=read --bs=1m --ioengine=libaio --iodepth=32 --filename=/dev/nvme0n1 --direct=1 --size=10g. For GDS-enabled benchmarks use gdsio tool from NVIDIA.

🌐 Lustre Configuration — Exam Points

Lustre components: MDS (Metadata Server) + MDT (Metadata Target) for namespace; OSS (Object Storage Server) + OST (Object Storage Target) for data. Clients are compute nodes with lustre kernel module.
Performance scales with OSS/OST count — more OST nodes = more aggregate bandwidth. The MDS is a potential bottleneck for metadata-heavy workloads (many small files).
Set optimal striping before writing large files: lfs setstripe -c -1 -S 4m /lustre/checkpoints (-c -1 = stripe across all OSTs). Apply at directory level — inherited by new files.
Small files problem: millions of files → MDS IOPS saturation. Solution: package data into TAR/WebDataset format. One 10 GB WebDataset TAR file = 1 metadata open, not 10,000 file opens.
Check stripe settings: lfs getstripe -r /lustre/training_data (-r = recursive). Check disk usage: lfs df -h /lustre. Check OST balance: lfs df --mdt /lustre.
Lustre GPUDirect Storage: supported via the Lustre POSIX layer — load nvidia-fs module, then use cuFile with standard Lustre paths. No special Lustre client config needed beyond the GDS driver.

⚡ GPUDirect Technologies — Exam Points

GDS = cuFile API + nvidia-fs kernel module. Path: NVMe → PCIe DMA → GPU HBM (no CPU, no system DRAM). Supported filesystems: ext4, xfs, Lustre, WEKA, BeeGFS. NOT supported: NFS, CIFS/SMB.
Load GDS driver: modprobe nvidia-fs. Verify: lsmod | grep nvidia_fs. Check GDS config: cat /proc/driver/nvidia-fs/params. GDS tool test: gdsio -f /dev/nvme0n1 -d 0 -w 0 -s 1g.
GPUDirect RDMA: ConnectX-7 NIC → GPU HBM DMA (no CPU). Driver: modprobe nvidia-peermem. NCCL uses automatically when available. Force GDR: NCCL_NET_GDR_LEVEL=2.
GPUDirect P2P: GPU-to-GPU direct transfer over NVLink (900 GB/s) or PCIe. Enable: cudaDeviceEnablePeerAccess(peerDev, 0). Over NVLink, this is always available and gives near-HBM bandwidth for inter-GPU tensor copies.
GDS performance gain: 40–50% faster checkpoint save compared to CPU-routed path. Especially significant for large LLMs where checkpoint size is 100s of GB and I/O is the bottleneck.
GDS + FSDP pattern: each rank independently calls cuFileWrite to its own shard file on local NVMe — fully parallel, no rank 0 bottleneck, near-line-rate NVMe bandwidth utilized by all GPUs simultaneously.

🔄 Data Pipeline Optimization — Exam Points

DALI (Data Loading Library): GPU-accelerated decode (nvJPEG), augmentation, and normalization. Use instead of PyTorch DataLoader for vision at scale. Import: import nvidia.dali.fn as fn. Key op: fn.decoders.image(..., device="mixed") decodes JPEG on GPU.
Pinned memory: pin_memory=True in DataLoader or cudaHostAlloc directly. Enables direct DMA from CPU DRAM to GPU HBM — avoids an OS-level copy from pageable to pinned before DMA. Critical for high-throughput data loading.
num_workers: set to 2–4× GPU count. Each worker is a separate process (avoids Python GIL). Too few = I/O bottleneck. Too many = CPU memory pressure. Monitor with nvidia-smi dmon and check GPU utilization — should be >90%.
prefetch_factor=N: each DataLoader worker pre-loads N batches ahead. Default is 2. Increase for slower storage (network filesystem) to hide latency.
CUDA streams for overlap: use two CUDA streams — stream A computes on batch N while stream B transfers batch N+1 to GPU. This hides PCIe transfer latency behind GPU compute with cudaMemcpyAsync.
Dataset format: use WebDataset (pip install webdataset) for large CV datasets. Reads sequentially from TAR archives — no small-file IOPS. MosaicML Streaming dataset format for LLM training with random-access capability across shuffled shards.

💼 Checkpoint Strategy — Exam Points

Checkpoint size formula: weights = params × bytes_per_param. BF16 = 2B/param, FP32 = 4B/param. Adam optimizer state = 2× FP32 = 8B/param. LLaMA 70B full: 140 GB (BF16) + 560 GB (FP32 opt) = 700 GB.
FSDP sharded checkpointing: each rank saves its own shard independently. 700 GB ÷ 8 GPUs = ~87.5 GB per rank. Saving in parallel: 87.5 GB at 50 GB/s = ~1.75s. Much faster than rank 0 saving all 700 GB (14s at 50 GB/s).
Async checkpointing: copy weights to CPU pinned memory in background thread while GPU continues training. GPU resumes within milliseconds. The actual NVMe write happens asynchronously. PyTorch: use AsyncCheckpointingManager or custom threading.
Optimal staging: GPU HBM → async copy to CPU pinned RAM (fast) → GDS write to local NVMe → background rsync/cp to Lustre/WEKA → optional offload to S3. Each stage uses the fastest available path.
Checkpoint frequency trade-off: checkpoint every 100 steps = small amount of work lost on crash. Checkpoint every 10 steps = significant storage I/O overhead. Use async checkpointing to checkpoint more frequently without GPU pauses.
Verify checkpoint integrity: after save, reload and compare a few parameter values. Use torch.allclose(original_param, loaded_param) to catch silent corruption. Especially important for multi-rank sharded checkpoints where a single corrupt shard can corrupt the whole restore.

🧩 Storage Mnemonics

Fact	Mnemonic / Hook
NVMe Gen 4 = 7 GB/s	"7 GB/s — 7 days in a week, one week of HDD performance per second"
DGX H100 NVMe: 8 × 3.84 TB = 30 TB	"8 drives × 4 TB ≈ 30 TB — like 30 TB 'cloud' living inside your server"
GDS = cuFile + nvidia-fs	"GPU Direct Storage = C-u-File + nvidia-FS kernel — two parts, one path"
Lustre stripe: -c count, -S size	"lfs setstripe: -c for Count, -S for Size. C comes before S alphabetically."
LLaMA 70B full checkpoint = ~700 GB	"70B params × 10 bytes avg (weights + optimizer) ≈ 700 GB = 10 H100s of HBM"
Small files = MDS bottleneck	"Millions of tiny files = millions of MDS knocks. Use TAR/WebDataset to knock once."
DALI = GPU decode	"DALI sends the decoding to the GPU — Data Accelerated Loading Inline"

← All NCP-AII Topics

Storage Systems forAI Workloads

🗺️ The AI Storage Problem

📊 AI Storage Hierarchy — Full Picture

⚖️ Technology Comparison at a Glance

💿 NVMe SSD Architecture

Sequential Bandwidth Comparison

DGX H100 Local NVMe Configuration

⚙️ NVMe Key Concepts

NVMe Queues

NVMe Namespaces

NVMe-oF (NVMe over Fabrics)

RAID Considerations for AI

🌐 Parallel Filesystems — Why They Exist

🗂️ Lustre — Deep Dive

Key Lustre Concepts

Lustre Striping — Critical for AI Performance

Lustre Performance Scale

⚡ WEKA — Flash-Native AI Filesystem

📋 Other Shared Filesystems

⚡ GPUDirect Technology Suite

💾 GPUDirect Storage — Deep Dive

GDS Requirements & Setup

🔗 GPUDirect RDMA — Network Path

GPUDirect P2P — Intra-Node

🔄 The Training Data Pipeline

Standard PyTorch Pipeline

NVIDIA DALI — GPU-Accelerated Pipeline

Pinned Memory & Prefetching

💼 Checkpointing Strategy

Checkpoint Size Estimation

Checkpoint Strategies Comparison

Checkpoint Staging Pattern

📦 Dataset Formats — Avoiding the Small-Files Problem

🧪 Practice Quiz — Storage Systems

🃏 Flashcards — Tap to Flip

🤖 Storage Systems Advisor

💿 NVMe & Local Storage — Key Exam Points

🌐 Lustre Configuration — Exam Points

⚡ GPUDirect Technologies — Exam Points

🔄 Data Pipeline Optimization — Exam Points

💼 Checkpoint Strategy — Exam Points

🧩 Storage Mnemonics

Keep your GPUs fed — master AI storage

Storage Systems for
AI Workloads