NCP-ADS: Distributed Pipelines & Profiling

Distributed Pipelines, Dask & Profiling

When data exceeds a single GPU's VRAM, Dask distributes the workload across multiple GPUs. CRISP-DM provides the project methodology, while Nsight and RMM tools help identify and fix pipeline bottlenecks.

Dask + Multi-GPU

dask_cudf: distributed multi-GPU DataFrames; same cuDF API; automatic partitioning
LocalCUDACluster: single-node multi-GPU Dask cluster; one worker per GPU
Partitions: each cuDF partition lives in one GPU's VRAM; workers process partitions independently
Scale-out: multi-node via RAPIDS UCX + NVLink for GPU-to-GPU communication
When to use: dataset exceeds single GPU VRAM or want parallelism across GPUs/nodes

CRISP-DM Phases

1. Business Understanding: define objectives, success criteria, project plan
2. Data Understanding: collect, explore, identify quality issues
3. Data Preparation: clean, transform, engineer — most time-consuming (60–80%)
4. Modeling: select algorithms, build models, tune hyperparameters
5. Evaluation: assess against business goals; readiness for deployment
6. Deployment: productionize; monitor performance in production

Profiling Tools

Nsight Systems: system-level timeline; identifies GPU idle time, PCIe stalls
Nsight Compute: kernel-level metrics; occupancy, memory throughput, arithmetic intensity
nvidia-smi: CLI monitor; GPU util%, VRAM usage, temperature, power
RMM PoolMemoryResource: reduces GPU memory fragmentation from many small allocations
cudf spill: cudf.set_option('spill',True) — spill GPU data to CPU RAM when full

Tool & Technique Quick Reference

Tool/Technique	Purpose	Key Output
dask_cudf	Multi-GPU distributed DataFrames	Partitioned cuDF DataFrames across workers
LocalCUDACluster	Single-node multi-GPU cluster setup	One Dask worker per GPU
Nsight Systems	System-level GPU/CPU timeline	Identifies bottlenecks: idle GPU, PCIe stalls
Nsight Compute	Per-kernel CUDA profiling	Occupancy, memory throughput, arithmetic intensity
nvidia-smi	CLI GPU monitor	Utilization, VRAM, temperature, power draw
RMM PoolMemoryResource	GPU memory pool	Reduces allocation overhead and fragmentation

Dask & Multi-GPU Distributed Computing

Dask enables parallel, deferred execution across multiple GPUs. Combined with cuDF via dask_cudf, it extends the RAPIDS ecosystem beyond single-GPU memory limits to multi-GPU and multi-node clusters.

Core Concepts

Dask Fundamentals

Deferred execution: Dask builds a task graph of operations but does not execute until .compute() is called
Task graph: DAG of operations that Dask scheduler executes in optimal order across workers
Lazy evaluation: transformations chain without running — enables optimization before execution
Partitions: dataset split into chunks; each partition processed independently by one worker

dask_cudf

Dask + cuDF = distributed multi-GPU DataFrames
Same cuDF API — groupby, merge, apply all work the same way
Each partition is a cuDF DataFrame held in one GPU's VRAM
dask_cudf.read_parquet('data/*.parquet') — reads and distributes partitions across GPUs
Operations that touch one partition: embarrassingly parallel — no inter-GPU communication
Aggregations spanning partitions: require shuffle (inter-GPU communication)

Cluster Setup & Client

LocalCUDACluster

Spins up a multi-GPU Dask cluster on a single node
Creates one worker per GPU — worker is a separate process with its own GPU context
Usage: from dask_cuda import LocalCUDACluster; cluster = LocalCUDACluster()
Workers coordinate via Dask scheduler running on CPU
Ideal for single-server multi-GPU setups (e.g., DGX A100 with 8 GPUs)

Dask Client

from dask.distributed import Client; client = Client(cluster)
Client connects to the Dask cluster and submits tasks
Provides dashboard at client.dashboard_link — view task progress, worker utilization
Submit work: call any dask_cudf operation; it queues in the task graph
Execute: result = ddf.compute() — triggers actual GPU computation

Partition Strategy

Embarrassingly parallel: groupby, map, filter — each partition processed independently on its GPU
Shuffle-based aggregation: cross-partition groupby requires shuffling data between GPUs — communication overhead
Partition size: target ~1 GB per partition so each fits in GPU cache
ddf.repartition(npartitions=N) — adjust number of partitions

Scale-Out & When to Use Dask

Multi-Node Scaling

Multi-node via RAPIDS UCX (Unified Communication X framework)
GPU-to-GPU communication over NVLink (intra-node) or InfiniBand (inter-node)
UCX bypasses CPU for GPU data transfers — dramatically reduces latency
Dask + UCX enables petabyte-scale GPU data science across GPU clusters

When to Use Dask vs Single GPU

Use Dask when dataset exceeds single GPU VRAM (e.g., 500 GB data, 80 GB GPU)
Use Dask to parallelize across multiple GPUs for faster wall-clock time
Single GPU cuDF is faster for data that fits — less scheduling overhead
GPU spilling (managed memory) is an alternative to Dask for modest overflows

CRISP-DM & ETL Design

CRISP-DM is the standard 6-phase data mining methodology. For NCP-ADS, the focus is on how each phase maps to GPU-accelerated RAPIDS tools and ETL design principles for high-throughput pipelines.

CRISP-DM Phases

Business Understanding

Define project objectives, success criteria, and project plan. Translate business goals into data science problem formulation (classification, regression, clustering). Identify constraints: data availability, latency requirements, deployment environment.

Data Understanding

Collect initial data; explore with EDA (cuDF describe, corr, value_counts). Identify data quality issues: missing values, outliers, inconsistent formats. Document data sources, schemas, and volume — guides GPU infrastructure sizing.

Data Preparation (60–80% of project time)

Most time-consuming phase. Tasks: cleaning nulls, encoding categoricals, normalizing features, engineering new columns, joining datasets. All done in cuDF on GPU. This phase is often repeated iteratively as modeling reveals data issues.

Modeling

Select algorithms (cuML, XGBoost), build models, tune hyperparameters. CRISP-DM is iterative — often cycle back to Data Preparation when modeling reveals feature gaps or quality issues requiring additional cleaning.

Evaluation

Assess model performance against business objectives — not just metric scores. Determine if the model meets success criteria defined in Phase 1. Decision gate: ready for deployment or return to earlier phases?

Deployment

Productionize model: Triton Inference Server, REST API, or batch scoring pipeline. Monitor for data drift and performance degradation in production. CRISP-DM is cyclic — new business questions trigger a new iteration.

ETL Design Principles for GPU

Extract

Use cuIO (cuDF's I/O module) — reads Parquet, CSV, JSON, ORC directly to GPU memory
Parquet preferred: columnar format; cuDF reads only needed columns; compressed; GPU-accelerated decode
GDS (GPU Direct Storage): NVMe → GPU without CPU copy; eliminates PCIe bottleneck for large reads
cudf.read_parquet('data.parquet', columns=['col1','col2']) — column pruning at read time

Transform

All transforms in cuDF — filtering, joining, encoding, normalization, aggregation
Stay on GPU: avoid .to_pandas() mid-pipeline — kills performance
Chain operations: cuDF supports method chaining; Dask defers until .compute()
Pass results via __cuda_array_interface__ to DL frameworks (zero-copy)

Load & Caching

Write output: df.to_parquet('output.parquet') — GPU-encoded Parquet
Pass to cuML: cuDF DataFrames feed directly into cuML estimators
Pass to DL: via __cuda_array_interface__ or DLPack — zero-copy tensor creation
Caching: persist intermediate cuDF DataFrames in GPU memory between pipeline stages
Pipeline parallelism: use Dask to overlap I/O and compute across GPUs

Profiling & Optimization

GPU profiling reveals where time is actually spent. Nsight Systems exposes system-level bottlenecks while Nsight Compute dives into individual CUDA kernels. RMM and memory management techniques prevent VRAM exhaustion.

Profiling Tools

Nsight Systems

System-level GPU profiler — timeline view of CPU, GPU, and memory transfer activity
Identifies bottlenecks: idle GPU time, PCIe transfer stalls, kernel launch overhead
Shows overlap (or lack thereof) between CPU work and GPU execution
Use to answer: "Is my GPU sitting idle while CPU loads data?"
Output: timeline trace file viewable in Nsight Systems GUI

Nsight Compute

Kernel-level profiler — measures per-CUDA-kernel performance metrics
Key metrics: occupancy (fraction of max threads active), memory throughput, arithmetic intensity
Identifies underutilized kernels: low occupancy = wasted GPU capacity
Use to answer: "Is this specific cuML kernel memory-bound or compute-bound?"
More detailed than Nsight Systems but produces more data to analyze

DLProf

Deep learning profiler that wraps Nsight
Highlights time spent in forward/backward passes, data loading, and optimizer steps
Useful specifically for DL training pipeline bottleneck identification
Reports operator-level GPU time breakdown

nvidia-smi & Interactive Monitors

nvidia-smi: CLI; shows GPU utilization %, memory used/free, temperature, power draw per GPU
nvidia-smi dmon: real-time streaming monitor of GPU metrics
nvtop / nvitop: interactive htop-style GPU process monitors — shows per-process GPU/VRAM usage
Quickest way to check if GPU is being utilized during a training run

Common Bottlenecks & Fixes

Bottleneck	Symptom	Fix
Low GPU utilization	nvidia-smi shows <50% GPU util; GPU idle between batches	Use DALI or async prefetching; overlap data loading and GPU compute
High PCIe activity	Nsight shows frequent H2D/D2H transfers; slow wall time	Minimize `.to_pandas()` calls; keep data on GPU; use GDS for I/O
Memory fragmentation	OOM errors despite available VRAM; many small allocs	Use RMM PoolMemoryResource to pre-allocate a memory pool
Kernel underutilization	Nsight Compute shows low occupancy; arithmetic intensity low	Increase batch size; restructure data layout; check kernel launch configuration
Data loading bottleneck	GPU idle while CPU processes data	Use DALI GPU pipeline; prefetch next batch while current batch trains

GPU Memory Management

RMM (RAPIDS Memory Manager)

PoolMemoryResource: pre-allocates a large GPU memory pool; sub-allocates from it — avoids repeated cudaMalloc/cudaFree overhead
Reduces memory fragmentation from many small GPU allocations in cuDF/cuML
Setup: rmm.reinitialize(pool_allocator=True, initial_pool_size=2**30)
ManagedMemoryResource: enables automatic GPU→CPU spill via CUDA Unified Memory

GPU Spilling

cudf.set_option('spill', True) — enables automatic spilling to CPU RAM when GPU memory is full
Uses CUDA managed memory (UVM) — pages migrate between GPU and CPU automatically
Performance cost: spilled data accessed at PCIe speed (~32 GB/s vs HBM ~2 TB/s)
Useful when dataset slightly exceeds VRAM; severe spilling → use Dask instead

Memory Monitoring

cupy.get_default_memory_pool().used_bytes() — current GPU memory pool usage
cupy.get_default_memory_pool().free_all_blocks() — release cached GPU memory
del df; import gc; gc.collect() — release Python references and trigger garbage collection
nvidia-smi memory column: tracks actual VRAM allocation from OS perspective

Practice Quiz — Distributed Pipelines & Profiling

10 questions covering Dask multi-GPU, CRISP-DM methodology, ETL design, and GPU profiling tools. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical Dask, CRISP-DM, and profiling concepts.

🔀

Dask Partitions

"One Partition, One GPU, One Worker"

Each dask_cudf partition is a cuDF DataFrame that lives in one GPU's VRAM. LocalCUDACluster creates one worker per GPU. Workers process their partition independently — embarrassingly parallel operations need no inter-GPU communication.

🔄

CRISP-DM Order

"Business Data Prep Model Evaluate Deploy — But Loop Back!"

CRISP-DM 6 phases in order. Data Preparation consumes 60–80% of project time. The process is iterative — Modeling often cycles back to Data Preparation when new feature needs are discovered.

🔭

Nsight Systems vs Compute

"Systems Sees the Forest; Compute Sees the Tree"

Nsight Systems = system-level timeline (CPU, GPU, PCIe — the whole picture). Nsight Compute = per-kernel metrics (occupancy, memory throughput — one kernel at a time). Start with Systems to find the bottleneck, then Compute to diagnose it.

💾

RMM Pool Resource

"Pool Once, Allocate Many — Kill Fragmentation"

RMM PoolMemoryResource pre-allocates a large GPU memory pool and sub-allocates from it. This eliminates the overhead and fragmentation from repeated cudaMalloc/cudaFree calls in cuDF/cuML pipelines.

📦

ETL GPU Principle

"Extract to GPU, Stay on GPU, Write from GPU"

Use cuIO to read directly to GPU (Parquet preferred), perform all transforms in cuDF, write back with to_parquet(). Never call .to_pandas() mid-pipeline — it kills performance with a PCIe round-trip.

🐢

Low GPU Utilization Fix

"Idle GPU = Hungry GPU Waiting for Data"

Low GPU utilization (nvidia-smi shows <50%) means the GPU is waiting for the CPU to deliver data. Fix: use DALI or async prefetching to overlap data loading with GPU compute — keep the GPU fed continuously.

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review on each topic area.

Click a card to flip it

LocalCUDACluster

What does it create and how many workers per GPU?

Spins up a multi-GPU Dask cluster on a single node with one worker per GPU. Each worker is a separate process with its own GPU context. Enables dask_cudf operations across all GPUs on the node.

CRISP-DM Phase 3

What is it and why does it take the most time?

Data Preparation — cleaning nulls, encoding, normalizing, joining, engineering features. Consumes 60–80% of project time. All done in cuDF on GPU. Often revisited after modeling reveals data quality gaps.

Nsight Systems vs Nsight Compute

What level does each tool profile?

Nsight Systems: system-level timeline — CPU, GPU, memory transfers, PCIe activity. Finds overall bottlenecks. Nsight Compute: per-kernel metrics — occupancy, throughput, arithmetic intensity. Diagnoses specific CUDA kernel inefficiencies.

dask_cudf Embarrassingly Parallel

What operations qualify and why are they fast?

Operations like groupby-per-partition, map, filter — each partition processed independently on its GPU worker. No inter-GPU communication needed. Examples: column transforms, row filters, per-group statistics.

RMM PoolMemoryResource

What problem does it solve?

Solves GPU memory fragmentation from many small cudaMalloc/cudaFree calls in cuDF/cuML pipelines. Pre-allocates a large pool and sub-allocates from it — reduces overhead and OOM errors caused by fragmentation.

cudf.set_option('spill', True)

What does it enable and when should you use it vs Dask?

Enables automatic spilling of GPU data to CPU RAM when VRAM is full. Uses CUDA managed memory. Performance cost: PCIe speed (~32 GB/s) instead of HBM (~2 TB/s). Use for modest VRAM overflow; switch to Dask for large multi-GPU workloads.

Parquet in GPU ETL

Why is Parquet the preferred format for cuIO reads?

Parquet is columnar — cuDF reads only needed columns (column pruning at read time). Compressed on disk. GPU-accelerated decode via cuIO. Faster than CSV for analytical workloads. Supports predicate pushdown for row filtering.

RAPIDS UCX

What communication does it enable and for what scale?

RAPIDS UCX (Unified Communication X) enables GPU-to-GPU communication in multi-node Dask clusters. Works over NVLink (intra-node) and InfiniBand (inter-node). Bypasses CPU for data transfers — enables petabyte-scale GPU data science.

Study Advisor

Dask Multi-GPU Key Points

dask_cudf: distributed multi-GPU DataFrames; same cuDF API; each partition = one cuDF DataFrame on one GPU
LocalCUDACluster(): single-node multi-GPU cluster; one worker per GPU
Client(cluster): connects to cluster; submit tasks via dask_cudf operations
dask_cudf.read_parquet('*.parquet'): distributes Parquet files as partitions across GPUs
Embarrassingly parallel: filter, map, column transform — no inter-GPU comms
Shuffle-based: cross-partition groupby — requires inter-GPU communication; more expensive
When to use: dataset exceeds single GPU VRAM; want multi-GPU parallelism

Distributed Pipelines, Dask & Profiling

Distributed Pipelines, Dask & Profiling

Dask + Multi-GPU

CRISP-DM Phases

Profiling Tools

Tool & Technique Quick Reference

Dask & Multi-GPU Distributed Computing

Dask Fundamentals

dask_cudf

LocalCUDACluster

Dask Client

Partition Strategy

Multi-Node Scaling

When to Use Dask vs Single GPU

CRISP-DM & ETL Design

Business Understanding

Data Understanding

Data Preparation (60–80% of project time)

Modeling

Evaluation

Deployment

Extract

Transform

Load & Caching

Profiling & Optimization

Nsight Systems

Nsight Compute

DLProf

nvidia-smi & Interactive Monitors

RMM (RAPIDS Memory Manager)

GPU Spilling

Memory Monitoring

Practice Quiz — Distributed Pipelines & Profiling

Memory Hooks

Flashcards & Advisor

Study Advisor

Dask Multi-GPU Key Points

Ready to Pass NCP-ADS?