NCP-ADS: GPU Foundations & RAPIDS Ecosystem

Every RAPIDS operation runs on GPU hardware — understanding how GPUs are structured, how memory moves, and what each RAPIDS library does is the bedrock of the entire NCP-ADS exam. Master this first.

Why GPUs for Data Science?

Massive parallelism: thousands of CUDA cores vs. tens of CPU cores
High memory bandwidth: HBM2e on A100 = 2 TB/s vs. ~50 GB/s CPU RAM
SIMD execution: same instruction runs on thousands of data elements simultaneously
Bottleneck shift: data science workloads are memory-bandwidth-bound — GPUs excel here
RAPIDS benefit: 10–100× speedup over CPU pandas/scikit-learn on large datasets

RAPIDS at a Glance

cuDF: GPU DataFrame library — pandas-compatible API, runs in GPU memory
cuML: GPU machine learning — scikit-learn-compatible algorithms on GPU
cuGraph: GPU graph analytics — NetworkX-compatible algorithms
DALI: GPU data loading and augmentation for deep learning pipelines
cuSpatial: GPU-accelerated geospatial analytics
Dask-cuDF: distributed multi-GPU DataFrames using Dask

RAPIDS Library Quick Reference

Library	CPU Equivalent	Primary Use	Key Benefit
cuDF	pandas	DataFrame ETL and manipulation	GPU-accelerated data wrangling
cuML	scikit-learn	ML algorithms on GPU	10–50× faster training on large data
cuGraph	NetworkX	Graph analytics	Scales to billion-edge graphs
DALI	torchvision / tf.data	Data loading and augmentation	Moves preprocessing to GPU
cuSpatial	GeoPandas	Geospatial analytics	GPU-accelerated spatial joins
Dask-cuDF	Dask + pandas	Multi-GPU distributed DataFrames	Scales cuDF across GPUs/nodes

GPU Architecture for Data Science

Understanding the GPU hardware model explains why RAPIDS achieves such dramatic speedups — and when GPU acceleration helps vs. hurts.

CUDA Programming Model

CUDA Hierarchy

Thread: smallest unit of execution; runs a single instance of a kernel function
Warp: 32 threads that execute in lockstep (SIMT — Single Instruction, Multiple Threads)
Thread Block: group of warps sharing Shared Memory; up to 1024 threads per block
Grid: collection of thread blocks executing the same kernel; mapped to the full GPU
Streaming Multiprocessor (SM): physical GPU unit that schedules and executes warps; A100 has 108 SMs

GPU Memory Hierarchy

Registers: per-thread, fastest, very small; spilling to local memory is slow
Shared Memory (L1): per-SM, ~48–96 KB; explicitly managed; key for cooperative algorithms
L2 Cache: shared across SMs; automatic; hundreds of KB
Device Memory (VRAM/HBM): large (40–80 GB on A100); high bandwidth but higher latency than shared memory
Host Memory (CPU RAM): connected via PCIe; data transfer is a major bottleneck — minimize H2D/D2H copies

GPU vs CPU for Data Science

When GPU Wins

Large datasets (>1M rows) where parallelism pays off
Embarrassingly parallel operations: column transforms, aggregations, joins on large tables
Iterative ML algorithms: gradient descent, k-means, random forests on big data
Matrix operations: linear algebra, dot products — native GPU strength
Data augmentation pipelines: image transforms, batch normalization

When GPU May NOT Help

Small datasets (<100K rows) — overhead of GPU launch exceeds benefit
Sequential, branchy logic with data dependencies (CPU branch predictor excels here)
Data transfer bottleneck: if H2D/D2H copies dominate, no net speedup
Algorithms with poor GPU implementations in RAPIDS
Available GPU VRAM insufficient to hold dataset in memory

Key NVIDIA GPU Generations for ADS

V100: Volta; 32/16 GB HBM2; first Tensor Core GPU; 900 GB/s bandwidth
A100: Ampere; 40/80 GB HBM2e; MIG support; 2 TB/s bandwidth; primary for data science
H100: Hopper; 80 GB HBM3; 3.35 TB/s bandwidth; FP8 support; NVLink 4.0
NVLink: high-speed GPU-to-GPU interconnect; enables GPU memory pooling across devices

Data Transfer & PCIe

Host-Device Transfer Optimization

PCIe bandwidth: PCIe 4.0 ×16 = ~32 GB/s; far slower than GPU HBM (~2 TB/s)
Pinned (page-locked) memory: CPU memory that cannot be swapped; enables faster DMA transfers to GPU
Zero-copy memory: GPU accesses CPU RAM directly over PCIe — avoids explicit transfer but slower than HBM
Best practice: load data to GPU once, keep all transformations on GPU, minimize round-trips
RMM (RAPIDS Memory Manager): pool-based GPU memory allocator that reduces allocation overhead

Tensor Cores & Specialized Hardware

CUDA Cores: general-purpose floating point and integer compute
Tensor Cores: hardware-accelerated matrix multiply-accumulate (GEMM); critical for ML training and inference
Supported precisions: FP32, TF32, FP16, BF16, INT8, FP8 (H100)
cuBLAS: CUDA-optimized BLAS routines used by cuML and DL frameworks under the hood
cuSPARSE: sparse matrix operations — used in graph analytics and NLP

RAPIDS Ecosystem Deep Dive

RAPIDS provides a GPU-native data science stack where data stays in GPU memory across the entire pipeline — eliminating costly CPU↔GPU round-trips between steps.

cuDF — GPU DataFrames

cuDF Core Concepts

pandas-compatible API: most pandas operations have direct cuDF equivalents; drop-in replacement for many workflows
Apache Arrow columnar format: cuDF stores data in GPU memory using Arrow; enables zero-copy interoperability with other Arrow-compatible libraries
Lazy evaluation: some operations are deferred and fused for efficiency
String operations: cuDF.strings module handles text data on GPU; regex, split, replace all GPU-accelerated
Null handling: cuDF uses a separate bitmask for nulls (Apache Arrow style)

cuDF Key Operations

groupby/agg: multi-column groupby with sum, mean, count, min, max — all GPU-parallel
merge/join: inner, left, right, outer joins; hash-based GPU implementation
apply/map: user-defined functions via Numba CUDA JIT — write Python, runs on GPU
read_csv / read_parquet: GPU-accelerated file I/O (cuIO)
to_pandas() / from_pandas(): explicit conversion between cuDF and pandas (triggers H2D/D2H)

cuML, cuGraph & DALI

cuML — GPU Machine Learning

scikit-learn-compatible estimator API: .fit(), .predict(), .transform()
Algorithms: Linear/Logistic Regression, KNN, SVM, k-Means, DBSCAN, PCA, UMAP, Random Forest, XGBoost
Accepts cuDF DataFrames directly — no CPU round-trip
Multi-GPU training via cuml.dask — distribute across GPU cluster

cuGraph — GPU Graph Analytics

NetworkX-compatible API for graph algorithms on GPU
Algorithms: PageRank, BFS/DFS, Louvain community detection, Betweenness Centrality, Jaccard similarity
Scales to billions of edges — CPU NetworkX would time out on the same graph
Input: edge lists as cuDF DataFrames

DALI — Data Loading Library

GPU-accelerated data loading, decoding, and augmentation for DL pipelines
Moves image/video decode and augmentation from CPU to GPU
Integrates with PyTorch DataLoader and TensorFlow tf.data
Eliminates CPU preprocessing bottleneck in training pipelines
Supports image, video, audio data types

Environment Setup & Configuration

Knowing how to install, configure, and manage the RAPIDS environment is tested directly on the NCP-ADS exam.

Installation Methods

Conda Installation

Recommended for local development: conda install -c rapidsai -c conda-forge -c nvidia rapids=24.06 python=3.11 cuda-version=12.0
Creates isolated environment with all RAPIDS libraries and CUDA dependencies
Must match CUDA toolkit version to installed GPU driver
Use nvidia-smi to check driver version and max supported CUDA
RAPIDS release cadence: approximately every 6–8 weeks (e.g., 24.04, 24.06, 24.08)

Docker Containers

NVIDIA provides pre-built RAPIDS containers on NGC and Docker Hub
Guarantees correct CUDA/driver/library version alignment
Requires: NVIDIA Container Toolkit (nvidia-docker2) installed on host
Run: docker run --gpus all -it rapidsai/base:24.06-cuda12.0-py3.11
NGC containers are enterprise-grade, validated, and production-ready

RMM — RAPIDS Memory Manager

RMM Memory Allocation Strategies

CudaMemoryResource (default): calls cudaMalloc/cudaFree directly; safe but slow for frequent small allocations
PoolMemoryResource: pre-allocates a large pool, sub-allocates from it; dramatically faster for many small allocations
ManagedMemoryResource: CUDA Unified Memory — data can spill from GPU to CPU RAM automatically
Best practice: use PoolMemoryResource in production workflows for lowest allocation overhead

Version Compatibility

RAPIDS requires: NVIDIA GPU with CUDA Compute Capability ≥ 7.0 (Volta+)
Check: nvidia-smi --query-gpu=compute_cap --format=csv
CUDA toolkit version must be ≤ driver's max supported CUDA version
Python version: RAPIDS supports Python 3.10 and 3.11
cuDF, cuML, cuGraph must all be the same RAPIDS release version

When to Use RAPIDS vs CPU

Dataset fits in GPU VRAM → use cuDF/cuML directly
Dataset exceeds single GPU VRAM → use Dask-cuDF for multi-GPU
Dataset is small (<100K rows) → pandas is fine; GPU overhead not worth it
Need unsupported operations → use .to_pandas() temporarily, then back to cuDF
Production serving → use NIM or Triton for GPU inference

Practice Quiz — GPU Foundations & RAPIDS

10 questions on GPU architecture, RAPIDS library roles, and environment setup.

Memory Hooks

Memorable anchors for the core GPU and RAPIDS concepts.

⚡

RAPIDS Stack

"cuDF Cleans, cuML Learns, cuGraph Connects"

Three core libraries: cuDF = DataFrames (data prep), cuML = machine learning, cuGraph = graph analytics. DALI loads images, Dask scales out. Each replaces its CPU equivalent but stays in GPU memory throughout.

🏗️

CUDA Hierarchy

"Thread → Warp (32) → Block → Grid"

Threads group into warps of 32 that execute in lockstep (SIMT). Warps group into blocks sharing Shared Memory. Blocks form a Grid across the whole GPU. Know this hierarchy — it explains parallelism and memory scope.

💾

Memory Hierarchy Speed

"Registers → Shared → L2 → HBM → PCIe (slowest)"

Speed drops dramatically as you go further from the compute core. Shared Memory is ~100× faster than HBM. PCIe (CPU↔GPU) is the biggest bottleneck — keep data on GPU, minimize transfers.

🧩

cuDF vs pandas

"Same API, Different Address Space"

cuDF is pandas in GPU memory. Most method names are identical. The key cost is crossing the boundary: to_pandas() and from_pandas() trigger CPU↔GPU transfers. Stay in cuDF for the whole pipeline.

🏊

RMM Pool

"Pool Once, Allocate Fast"

Default cudaMalloc is slow for repeated small allocations. RMM's PoolMemoryResource pre-allocates a large chunk and sub-allocates from it — orders of magnitude faster for real workloads. Always use pools in production.

🎯

GPU Sweet Spot

"Big Data + Parallel Ops = GPU Wins"

GPU wins when: data is large (>1M rows), operations are parallelizable, and data stays in VRAM. GPU loses when: data is small, operations are sequential, or H2D/D2H transfers dominate the runtime.

Flashcards & Advisor

Click a card to reveal the answer

CUDA Warp

How many threads, and what is SIMT?

32 threads that execute the same instruction simultaneously (Single Instruction, Multiple Threads). If threads diverge (different branches), warps serialize — this is called warp divergence and reduces efficiency.

cuDF vs pandas

Key similarity, key difference, and conversion cost?

Same pandas-compatible API. cuDF data lives in GPU memory (VRAM); pandas lives in CPU RAM. Conversion via to_pandas() / from_pandas() triggers a PCIe transfer — keep data in cuDF throughout the pipeline.

RMM PoolMemoryResource

What problem does it solve?

Solves slow repeated cudaMalloc/cudaFree calls. Pre-allocates a large GPU memory pool at startup and sub-allocates from it — dramatically reduces allocation overhead in workloads with many small allocations.

RAPIDS Minimum GPU Requirement

What Compute Capability is needed?

RAPIDS requires CUDA Compute Capability ≥ 7.0 (Volta architecture and later). Check with: nvidia-smi --query-gpu=compute_cap --format=csv. This means V100, T4, A100, H100 are all supported.

DALI

What does it do and what bottleneck does it solve?

NVIDIA Data Loading Library — moves image/video decoding and augmentation from CPU to GPU. Eliminates the CPU preprocessing bottleneck that causes GPU starvation during deep learning training. Integrates with PyTorch and TensorFlow.

GPU Memory Bandwidth — A100

How does it compare to CPU RAM bandwidth?

A100 HBM2e: ~2 TB/s memory bandwidth. Typical CPU DDR4/5: ~50–100 GB/s. This ~20–40× bandwidth advantage is a primary reason RAPIDS operations on large DataFrames dramatically outperform CPU equivalents.

cuGraph

What is it and what does it replace?

GPU-accelerated graph analytics library. Replaces NetworkX with a compatible API. Runs PageRank, BFS/DFS, Louvain community detection, betweenness centrality on billion-edge graphs. Input: edge lists as cuDF DataFrames.

Pinned Memory

What is it and why does it speed up transfers?

CPU memory marked as page-locked — the OS cannot swap it to disk. This enables direct DMA transfers between CPU RAM and GPU VRAM over PCIe, achieving higher bandwidth than pageable memory transfers.

Study Advisor

CUDA Programming Model

Thread = smallest execution unit; Warp = 32 threads executing in lockstep (SIMT)
Thread Block = group of warps sharing Shared Memory; up to 1024 threads per block
Grid = all blocks for a single kernel launch across the GPU
SM (Streaming Multiprocessor) = physical compute unit scheduling warps; A100 has 108 SMs
Warp divergence: branches within a warp cause serialization — avoid for peak efficiency
Occupancy: ratio of active warps to maximum possible warps per SM; higher = better GPU utilization

GPU Foundations & RAPIDS Ecosystem