FlashGenius Logo FlashGenius
NCP-ADS Exam Prep · Topic 1

GPU Foundations & RAPIDS Ecosystem

CUDA Architecture · GPU Memory · RAPIDS Library Suite · Environment Setup

Core foundation for the NCP-ADS — Accelerated Data Science exam

Study with Practice Tests →

NCP-ADS: GPU Foundations & RAPIDS Ecosystem

Every RAPIDS operation runs on GPU hardware — understanding how GPUs are structured, how memory moves, and what each RAPIDS library does is the bedrock of the entire NCP-ADS exam. Master this first.

Why GPUs for Data Science?

  • Massive parallelism: thousands of CUDA cores vs. tens of CPU cores
  • High memory bandwidth: HBM2e on A100 = 2 TB/s vs. ~50 GB/s CPU RAM
  • SIMD execution: same instruction runs on thousands of data elements simultaneously
  • Bottleneck shift: data science workloads are memory-bandwidth-bound — GPUs excel here
  • RAPIDS benefit: 10–100× speedup over CPU pandas/scikit-learn on large datasets

RAPIDS at a Glance

  • cuDF: GPU DataFrame library — pandas-compatible API, runs in GPU memory
  • cuML: GPU machine learning — scikit-learn-compatible algorithms on GPU
  • cuGraph: GPU graph analytics — NetworkX-compatible algorithms
  • DALI: GPU data loading and augmentation for deep learning pipelines
  • cuSpatial: GPU-accelerated geospatial analytics
  • Dask-cuDF: distributed multi-GPU DataFrames using Dask

RAPIDS Library Quick Reference

LibraryCPU EquivalentPrimary UseKey Benefit
cuDFpandasDataFrame ETL and manipulationGPU-accelerated data wrangling
cuMLscikit-learnML algorithms on GPU10–50× faster training on large data
cuGraphNetworkXGraph analyticsScales to billion-edge graphs
DALItorchvision / tf.dataData loading and augmentationMoves preprocessing to GPU
cuSpatialGeoPandasGeospatial analyticsGPU-accelerated spatial joins
Dask-cuDFDask + pandasMulti-GPU distributed DataFramesScales cuDF across GPUs/nodes

GPU Architecture for Data Science

Understanding the GPU hardware model explains why RAPIDS achieves such dramatic speedups — and when GPU acceleration helps vs. hurts.

CUDA Programming Model

CUDA Hierarchy

  • Thread: smallest unit of execution; runs a single instance of a kernel function
  • Warp: 32 threads that execute in lockstep (SIMT — Single Instruction, Multiple Threads)
  • Thread Block: group of warps sharing Shared Memory; up to 1024 threads per block
  • Grid: collection of thread blocks executing the same kernel; mapped to the full GPU
  • Streaming Multiprocessor (SM): physical GPU unit that schedules and executes warps; A100 has 108 SMs

GPU Memory Hierarchy

  • Registers: per-thread, fastest, very small; spilling to local memory is slow
  • Shared Memory (L1): per-SM, ~48–96 KB; explicitly managed; key for cooperative algorithms
  • L2 Cache: shared across SMs; automatic; hundreds of KB
  • Device Memory (VRAM/HBM): large (40–80 GB on A100); high bandwidth but higher latency than shared memory
  • Host Memory (CPU RAM): connected via PCIe; data transfer is a major bottleneck — minimize H2D/D2H copies
GPU vs CPU for Data Science

When GPU Wins

  • Large datasets (>1M rows) where parallelism pays off
  • Embarrassingly parallel operations: column transforms, aggregations, joins on large tables
  • Iterative ML algorithms: gradient descent, k-means, random forests on big data
  • Matrix operations: linear algebra, dot products — native GPU strength
  • Data augmentation pipelines: image transforms, batch normalization

When GPU May NOT Help

  • Small datasets (<100K rows) — overhead of GPU launch exceeds benefit
  • Sequential, branchy logic with data dependencies (CPU branch predictor excels here)
  • Data transfer bottleneck: if H2D/D2H copies dominate, no net speedup
  • Algorithms with poor GPU implementations in RAPIDS
  • Available GPU VRAM insufficient to hold dataset in memory

Key NVIDIA GPU Generations for ADS

  • V100: Volta; 32/16 GB HBM2; first Tensor Core GPU; 900 GB/s bandwidth
  • A100: Ampere; 40/80 GB HBM2e; MIG support; 2 TB/s bandwidth; primary for data science
  • H100: Hopper; 80 GB HBM3; 3.35 TB/s bandwidth; FP8 support; NVLink 4.0
  • NVLink: high-speed GPU-to-GPU interconnect; enables GPU memory pooling across devices
Data Transfer & PCIe

Host-Device Transfer Optimization

  • PCIe bandwidth: PCIe 4.0 ×16 = ~32 GB/s; far slower than GPU HBM (~2 TB/s)
  • Pinned (page-locked) memory: CPU memory that cannot be swapped; enables faster DMA transfers to GPU
  • Zero-copy memory: GPU accesses CPU RAM directly over PCIe — avoids explicit transfer but slower than HBM
  • Best practice: load data to GPU once, keep all transformations on GPU, minimize round-trips
  • RMM (RAPIDS Memory Manager): pool-based GPU memory allocator that reduces allocation overhead

Tensor Cores & Specialized Hardware

  • CUDA Cores: general-purpose floating point and integer compute
  • Tensor Cores: hardware-accelerated matrix multiply-accumulate (GEMM); critical for ML training and inference
  • Supported precisions: FP32, TF32, FP16, BF16, INT8, FP8 (H100)
  • cuBLAS: CUDA-optimized BLAS routines used by cuML and DL frameworks under the hood
  • cuSPARSE: sparse matrix operations — used in graph analytics and NLP

RAPIDS Ecosystem Deep Dive

RAPIDS provides a GPU-native data science stack where data stays in GPU memory across the entire pipeline — eliminating costly CPU↔GPU round-trips between steps.

cuDF — GPU DataFrames

cuDF Core Concepts

  • pandas-compatible API: most pandas operations have direct cuDF equivalents; drop-in replacement for many workflows
  • Apache Arrow columnar format: cuDF stores data in GPU memory using Arrow; enables zero-copy interoperability with other Arrow-compatible libraries
  • Lazy evaluation: some operations are deferred and fused for efficiency
  • String operations: cuDF.strings module handles text data on GPU; regex, split, replace all GPU-accelerated
  • Null handling: cuDF uses a separate bitmask for nulls (Apache Arrow style)

cuDF Key Operations

  • groupby/agg: multi-column groupby with sum, mean, count, min, max — all GPU-parallel
  • merge/join: inner, left, right, outer joins; hash-based GPU implementation
  • apply/map: user-defined functions via Numba CUDA JIT — write Python, runs on GPU
  • read_csv / read_parquet: GPU-accelerated file I/O (cuIO)
  • to_pandas() / from_pandas(): explicit conversion between cuDF and pandas (triggers H2D/D2H)
cuML, cuGraph & DALI

cuML — GPU Machine Learning

  • scikit-learn-compatible estimator API: .fit(), .predict(), .transform()
  • Algorithms: Linear/Logistic Regression, KNN, SVM, k-Means, DBSCAN, PCA, UMAP, Random Forest, XGBoost
  • Accepts cuDF DataFrames directly — no CPU round-trip
  • Multi-GPU training via cuml.dask — distribute across GPU cluster

cuGraph — GPU Graph Analytics

  • NetworkX-compatible API for graph algorithms on GPU
  • Algorithms: PageRank, BFS/DFS, Louvain community detection, Betweenness Centrality, Jaccard similarity
  • Scales to billions of edges — CPU NetworkX would time out on the same graph
  • Input: edge lists as cuDF DataFrames

DALI — Data Loading Library

  • GPU-accelerated data loading, decoding, and augmentation for DL pipelines
  • Moves image/video decode and augmentation from CPU to GPU
  • Integrates with PyTorch DataLoader and TensorFlow tf.data
  • Eliminates CPU preprocessing bottleneck in training pipelines
  • Supports image, video, audio data types

Environment Setup & Configuration

Knowing how to install, configure, and manage the RAPIDS environment is tested directly on the NCP-ADS exam.

Installation Methods

Conda Installation

  • Recommended for local development: conda install -c rapidsai -c conda-forge -c nvidia rapids=24.06 python=3.11 cuda-version=12.0
  • Creates isolated environment with all RAPIDS libraries and CUDA dependencies
  • Must match CUDA toolkit version to installed GPU driver
  • Use nvidia-smi to check driver version and max supported CUDA
  • RAPIDS release cadence: approximately every 6–8 weeks (e.g., 24.04, 24.06, 24.08)

Docker Containers

  • NVIDIA provides pre-built RAPIDS containers on NGC and Docker Hub
  • Guarantees correct CUDA/driver/library version alignment
  • Requires: NVIDIA Container Toolkit (nvidia-docker2) installed on host
  • Run: docker run --gpus all -it rapidsai/base:24.06-cuda12.0-py3.11
  • NGC containers are enterprise-grade, validated, and production-ready
RMM — RAPIDS Memory Manager

RMM Memory Allocation Strategies

  • CudaMemoryResource (default): calls cudaMalloc/cudaFree directly; safe but slow for frequent small allocations
  • PoolMemoryResource: pre-allocates a large pool, sub-allocates from it; dramatically faster for many small allocations
  • ManagedMemoryResource: CUDA Unified Memory — data can spill from GPU to CPU RAM automatically
  • Best practice: use PoolMemoryResource in production workflows for lowest allocation overhead

Version Compatibility

  • RAPIDS requires: NVIDIA GPU with CUDA Compute Capability ≥ 7.0 (Volta+)
  • Check: nvidia-smi --query-gpu=compute_cap --format=csv
  • CUDA toolkit version must be ≤ driver's max supported CUDA version
  • Python version: RAPIDS supports Python 3.10 and 3.11
  • cuDF, cuML, cuGraph must all be the same RAPIDS release version

When to Use RAPIDS vs CPU

  • Dataset fits in GPU VRAM → use cuDF/cuML directly
  • Dataset exceeds single GPU VRAM → use Dask-cuDF for multi-GPU
  • Dataset is small (<100K rows) → pandas is fine; GPU overhead not worth it
  • Need unsupported operations → use .to_pandas() temporarily, then back to cuDF
  • Production serving → use NIM or Triton for GPU inference

Practice Quiz — GPU Foundations & RAPIDS

10 questions on GPU architecture, RAPIDS library roles, and environment setup.

Memory Hooks

Memorable anchors for the core GPU and RAPIDS concepts.

RAPIDS Stack
"cuDF Cleans, cuML Learns, cuGraph Connects"
Three core libraries: cuDF = DataFrames (data prep), cuML = machine learning, cuGraph = graph analytics. DALI loads images, Dask scales out. Each replaces its CPU equivalent but stays in GPU memory throughout.
🏗️
CUDA Hierarchy
"Thread → Warp (32) → Block → Grid"
Threads group into warps of 32 that execute in lockstep (SIMT). Warps group into blocks sharing Shared Memory. Blocks form a Grid across the whole GPU. Know this hierarchy — it explains parallelism and memory scope.
💾
Memory Hierarchy Speed
"Registers → Shared → L2 → HBM → PCIe (slowest)"
Speed drops dramatically as you go further from the compute core. Shared Memory is ~100× faster than HBM. PCIe (CPU↔GPU) is the biggest bottleneck — keep data on GPU, minimize transfers.
🧩
cuDF vs pandas
"Same API, Different Address Space"
cuDF is pandas in GPU memory. Most method names are identical. The key cost is crossing the boundary: to_pandas() and from_pandas() trigger CPU↔GPU transfers. Stay in cuDF for the whole pipeline.
🏊
RMM Pool
"Pool Once, Allocate Fast"
Default cudaMalloc is slow for repeated small allocations. RMM's PoolMemoryResource pre-allocates a large chunk and sub-allocates from it — orders of magnitude faster for real workloads. Always use pools in production.
🎯
GPU Sweet Spot
"Big Data + Parallel Ops = GPU Wins"
GPU wins when: data is large (>1M rows), operations are parallelizable, and data stays in VRAM. GPU loses when: data is small, operations are sequential, or H2D/D2H transfers dominate the runtime.

Flashcards & Advisor

Click a card to reveal the answer

CUDA Warp
How many threads, and what is SIMT?
32 threads that execute the same instruction simultaneously (Single Instruction, Multiple Threads). If threads diverge (different branches), warps serialize — this is called warp divergence and reduces efficiency.
cuDF vs pandas
Key similarity, key difference, and conversion cost?
Same pandas-compatible API. cuDF data lives in GPU memory (VRAM); pandas lives in CPU RAM. Conversion via to_pandas() / from_pandas() triggers a PCIe transfer — keep data in cuDF throughout the pipeline.
RMM PoolMemoryResource
What problem does it solve?
Solves slow repeated cudaMalloc/cudaFree calls. Pre-allocates a large GPU memory pool at startup and sub-allocates from it — dramatically reduces allocation overhead in workloads with many small allocations.
RAPIDS Minimum GPU Requirement
What Compute Capability is needed?
RAPIDS requires CUDA Compute Capability ≥ 7.0 (Volta architecture and later). Check with: nvidia-smi --query-gpu=compute_cap --format=csv. This means V100, T4, A100, H100 are all supported.
DALI
What does it do and what bottleneck does it solve?
NVIDIA Data Loading Library — moves image/video decoding and augmentation from CPU to GPU. Eliminates the CPU preprocessing bottleneck that causes GPU starvation during deep learning training. Integrates with PyTorch and TensorFlow.
GPU Memory Bandwidth — A100
How does it compare to CPU RAM bandwidth?
A100 HBM2e: ~2 TB/s memory bandwidth. Typical CPU DDR4/5: ~50–100 GB/s. This ~20–40× bandwidth advantage is a primary reason RAPIDS operations on large DataFrames dramatically outperform CPU equivalents.
cuGraph
What is it and what does it replace?
GPU-accelerated graph analytics library. Replaces NetworkX with a compatible API. Runs PageRank, BFS/DFS, Louvain community detection, betweenness centrality on billion-edge graphs. Input: edge lists as cuDF DataFrames.
Pinned Memory
What is it and why does it speed up transfers?
CPU memory marked as page-locked — the OS cannot swap it to disk. This enables direct DMA transfers between CPU RAM and GPU VRAM over PCIe, achieving higher bandwidth than pageable memory transfers.

Study Advisor

CUDA Programming Model

  • Thread = smallest execution unit; Warp = 32 threads executing in lockstep (SIMT)
  • Thread Block = group of warps sharing Shared Memory; up to 1024 threads per block
  • Grid = all blocks for a single kernel launch across the GPU
  • SM (Streaming Multiprocessor) = physical compute unit scheduling warps; A100 has 108 SMs
  • Warp divergence: branches within a warp cause serialization — avoid for peak efficiency
  • Occupancy: ratio of active warps to maximum possible warps per SM; higher = better GPU utilization

Ready to Pass the NCP-ADS?

Practice with full-length exams, timed simulations, and detailed answer explanations

Unlock Full Practice Tests on FlashGenius →