NCP-ADS: GPU Foundations & RAPIDS Ecosystem
Every RAPIDS operation runs on GPU hardware — understanding how GPUs are structured, how memory moves, and what each RAPIDS library does is the bedrock of the entire NCP-ADS exam. Master this first.
Why GPUs for Data Science?
- Massive parallelism: thousands of CUDA cores vs. tens of CPU cores
- High memory bandwidth: HBM2e on A100 = 2 TB/s vs. ~50 GB/s CPU RAM
- SIMD execution: same instruction runs on thousands of data elements simultaneously
- Bottleneck shift: data science workloads are memory-bandwidth-bound — GPUs excel here
- RAPIDS benefit: 10–100× speedup over CPU pandas/scikit-learn on large datasets
RAPIDS at a Glance
- cuDF: GPU DataFrame library — pandas-compatible API, runs in GPU memory
- cuML: GPU machine learning — scikit-learn-compatible algorithms on GPU
- cuGraph: GPU graph analytics — NetworkX-compatible algorithms
- DALI: GPU data loading and augmentation for deep learning pipelines
- cuSpatial: GPU-accelerated geospatial analytics
- Dask-cuDF: distributed multi-GPU DataFrames using Dask
RAPIDS Library Quick Reference
| Library | CPU Equivalent | Primary Use | Key Benefit |
|---|---|---|---|
| cuDF | pandas | DataFrame ETL and manipulation | GPU-accelerated data wrangling |
| cuML | scikit-learn | ML algorithms on GPU | 10–50× faster training on large data |
| cuGraph | NetworkX | Graph analytics | Scales to billion-edge graphs |
| DALI | torchvision / tf.data | Data loading and augmentation | Moves preprocessing to GPU |
| cuSpatial | GeoPandas | Geospatial analytics | GPU-accelerated spatial joins |
| Dask-cuDF | Dask + pandas | Multi-GPU distributed DataFrames | Scales cuDF across GPUs/nodes |
GPU Architecture for Data Science
Understanding the GPU hardware model explains why RAPIDS achieves such dramatic speedups — and when GPU acceleration helps vs. hurts.
CUDA Hierarchy
- Thread: smallest unit of execution; runs a single instance of a kernel function
- Warp: 32 threads that execute in lockstep (SIMT — Single Instruction, Multiple Threads)
- Thread Block: group of warps sharing Shared Memory; up to 1024 threads per block
- Grid: collection of thread blocks executing the same kernel; mapped to the full GPU
- Streaming Multiprocessor (SM): physical GPU unit that schedules and executes warps; A100 has 108 SMs
GPU Memory Hierarchy
- Registers: per-thread, fastest, very small; spilling to local memory is slow
- Shared Memory (L1): per-SM, ~48–96 KB; explicitly managed; key for cooperative algorithms
- L2 Cache: shared across SMs; automatic; hundreds of KB
- Device Memory (VRAM/HBM): large (40–80 GB on A100); high bandwidth but higher latency than shared memory
- Host Memory (CPU RAM): connected via PCIe; data transfer is a major bottleneck — minimize H2D/D2H copies
When GPU Wins
- Large datasets (>1M rows) where parallelism pays off
- Embarrassingly parallel operations: column transforms, aggregations, joins on large tables
- Iterative ML algorithms: gradient descent, k-means, random forests on big data
- Matrix operations: linear algebra, dot products — native GPU strength
- Data augmentation pipelines: image transforms, batch normalization
When GPU May NOT Help
- Small datasets (<100K rows) — overhead of GPU launch exceeds benefit
- Sequential, branchy logic with data dependencies (CPU branch predictor excels here)
- Data transfer bottleneck: if H2D/D2H copies dominate, no net speedup
- Algorithms with poor GPU implementations in RAPIDS
- Available GPU VRAM insufficient to hold dataset in memory
Key NVIDIA GPU Generations for ADS
- V100: Volta; 32/16 GB HBM2; first Tensor Core GPU; 900 GB/s bandwidth
- A100: Ampere; 40/80 GB HBM2e; MIG support; 2 TB/s bandwidth; primary for data science
- H100: Hopper; 80 GB HBM3; 3.35 TB/s bandwidth; FP8 support; NVLink 4.0
- NVLink: high-speed GPU-to-GPU interconnect; enables GPU memory pooling across devices
Host-Device Transfer Optimization
- PCIe bandwidth: PCIe 4.0 ×16 = ~32 GB/s; far slower than GPU HBM (~2 TB/s)
- Pinned (page-locked) memory: CPU memory that cannot be swapped; enables faster DMA transfers to GPU
- Zero-copy memory: GPU accesses CPU RAM directly over PCIe — avoids explicit transfer but slower than HBM
- Best practice: load data to GPU once, keep all transformations on GPU, minimize round-trips
- RMM (RAPIDS Memory Manager): pool-based GPU memory allocator that reduces allocation overhead
Tensor Cores & Specialized Hardware
- CUDA Cores: general-purpose floating point and integer compute
- Tensor Cores: hardware-accelerated matrix multiply-accumulate (GEMM); critical for ML training and inference
- Supported precisions: FP32, TF32, FP16, BF16, INT8, FP8 (H100)
- cuBLAS: CUDA-optimized BLAS routines used by cuML and DL frameworks under the hood
- cuSPARSE: sparse matrix operations — used in graph analytics and NLP
RAPIDS Ecosystem Deep Dive
RAPIDS provides a GPU-native data science stack where data stays in GPU memory across the entire pipeline — eliminating costly CPU↔GPU round-trips between steps.
cuDF Core Concepts
- pandas-compatible API: most pandas operations have direct cuDF equivalents; drop-in replacement for many workflows
- Apache Arrow columnar format: cuDF stores data in GPU memory using Arrow; enables zero-copy interoperability with other Arrow-compatible libraries
- Lazy evaluation: some operations are deferred and fused for efficiency
- String operations: cuDF.strings module handles text data on GPU; regex, split, replace all GPU-accelerated
- Null handling: cuDF uses a separate bitmask for nulls (Apache Arrow style)
cuDF Key Operations
- groupby/agg: multi-column groupby with sum, mean, count, min, max — all GPU-parallel
- merge/join: inner, left, right, outer joins; hash-based GPU implementation
- apply/map: user-defined functions via Numba CUDA JIT — write Python, runs on GPU
- read_csv / read_parquet: GPU-accelerated file I/O (cuIO)
- to_pandas() / from_pandas(): explicit conversion between cuDF and pandas (triggers H2D/D2H)
cuML — GPU Machine Learning
- scikit-learn-compatible estimator API:
.fit(),.predict(),.transform() - Algorithms: Linear/Logistic Regression, KNN, SVM, k-Means, DBSCAN, PCA, UMAP, Random Forest, XGBoost
- Accepts cuDF DataFrames directly — no CPU round-trip
- Multi-GPU training via
cuml.dask— distribute across GPU cluster
cuGraph — GPU Graph Analytics
- NetworkX-compatible API for graph algorithms on GPU
- Algorithms: PageRank, BFS/DFS, Louvain community detection, Betweenness Centrality, Jaccard similarity
- Scales to billions of edges — CPU NetworkX would time out on the same graph
- Input: edge lists as cuDF DataFrames
DALI — Data Loading Library
- GPU-accelerated data loading, decoding, and augmentation for DL pipelines
- Moves image/video decode and augmentation from CPU to GPU
- Integrates with PyTorch DataLoader and TensorFlow tf.data
- Eliminates CPU preprocessing bottleneck in training pipelines
- Supports image, video, audio data types
Environment Setup & Configuration
Knowing how to install, configure, and manage the RAPIDS environment is tested directly on the NCP-ADS exam.
Conda Installation
- Recommended for local development:
conda install -c rapidsai -c conda-forge -c nvidia rapids=24.06 python=3.11 cuda-version=12.0 - Creates isolated environment with all RAPIDS libraries and CUDA dependencies
- Must match CUDA toolkit version to installed GPU driver
- Use
nvidia-smito check driver version and max supported CUDA - RAPIDS release cadence: approximately every 6–8 weeks (e.g., 24.04, 24.06, 24.08)
Docker Containers
- NVIDIA provides pre-built RAPIDS containers on NGC and Docker Hub
- Guarantees correct CUDA/driver/library version alignment
- Requires: NVIDIA Container Toolkit (nvidia-docker2) installed on host
- Run:
docker run --gpus all -it rapidsai/base:24.06-cuda12.0-py3.11 - NGC containers are enterprise-grade, validated, and production-ready
RMM Memory Allocation Strategies
- CudaMemoryResource (default): calls
cudaMalloc/cudaFreedirectly; safe but slow for frequent small allocations - PoolMemoryResource: pre-allocates a large pool, sub-allocates from it; dramatically faster for many small allocations
- ManagedMemoryResource: CUDA Unified Memory — data can spill from GPU to CPU RAM automatically
- Best practice: use PoolMemoryResource in production workflows for lowest allocation overhead
Version Compatibility
- RAPIDS requires: NVIDIA GPU with CUDA Compute Capability ≥ 7.0 (Volta+)
- Check:
nvidia-smi --query-gpu=compute_cap --format=csv - CUDA toolkit version must be ≤ driver's max supported CUDA version
- Python version: RAPIDS supports Python 3.10 and 3.11
- cuDF, cuML, cuGraph must all be the same RAPIDS release version
When to Use RAPIDS vs CPU
- Dataset fits in GPU VRAM → use cuDF/cuML directly
- Dataset exceeds single GPU VRAM → use Dask-cuDF for multi-GPU
- Dataset is small (<100K rows) → pandas is fine; GPU overhead not worth it
- Need unsupported operations → use
.to_pandas()temporarily, then back to cuDF - Production serving → use NIM or Triton for GPU inference
Practice Quiz — GPU Foundations & RAPIDS
10 questions on GPU architecture, RAPIDS library roles, and environment setup.
Memory Hooks
Memorable anchors for the core GPU and RAPIDS concepts.
to_pandas() and from_pandas() trigger CPU↔GPU transfers. Stay in cuDF for the whole pipeline.Flashcards & Advisor
Click a card to reveal the answer
to_pandas() / from_pandas() triggers a PCIe transfer — keep data in cuDF throughout the pipeline.cudaMalloc/cudaFree calls. Pre-allocates a large GPU memory pool at startup and sub-allocates from it — dramatically reduces allocation overhead in workloads with many small allocations.nvidia-smi --query-gpu=compute_cap --format=csv. This means V100, T4, A100, H100 are all supported.Study Advisor
CUDA Programming Model
- Thread = smallest execution unit; Warp = 32 threads executing in lockstep (SIMT)
- Thread Block = group of warps sharing Shared Memory; up to 1024 threads per block
- Grid = all blocks for a single kernel launch across the GPU
- SM (Streaming Multiprocessor) = physical compute unit scheduling warps; A100 has 108 SMs
- Warp divergence: branches within a warp cause serialization — avoid for peak efficiency
- Occupancy: ratio of active warps to maximum possible warps per SM; higher = better GPU utilization