Distributed Pipelines, Dask & Profiling
When data exceeds a single GPU's VRAM, Dask distributes the workload across multiple GPUs. CRISP-DM provides the project methodology, while Nsight and RMM tools help identify and fix pipeline bottlenecks.
Dask + Multi-GPU
- dask_cudf: distributed multi-GPU DataFrames; same cuDF API; automatic partitioning
- LocalCUDACluster: single-node multi-GPU Dask cluster; one worker per GPU
- Partitions: each cuDF partition lives in one GPU's VRAM; workers process partitions independently
- Scale-out: multi-node via RAPIDS UCX + NVLink for GPU-to-GPU communication
- When to use: dataset exceeds single GPU VRAM or want parallelism across GPUs/nodes
CRISP-DM Phases
- 1. Business Understanding: define objectives, success criteria, project plan
- 2. Data Understanding: collect, explore, identify quality issues
- 3. Data Preparation: clean, transform, engineer — most time-consuming (60–80%)
- 4. Modeling: select algorithms, build models, tune hyperparameters
- 5. Evaluation: assess against business goals; readiness for deployment
- 6. Deployment: productionize; monitor performance in production
Profiling Tools
- Nsight Systems: system-level timeline; identifies GPU idle time, PCIe stalls
- Nsight Compute: kernel-level metrics; occupancy, memory throughput, arithmetic intensity
- nvidia-smi: CLI monitor; GPU util%, VRAM usage, temperature, power
- RMM PoolMemoryResource: reduces GPU memory fragmentation from many small allocations
- cudf spill:
cudf.set_option('spill',True)— spill GPU data to CPU RAM when full
Tool & Technique Quick Reference
| Tool/Technique | Purpose | Key Output |
|---|---|---|
| dask_cudf | Multi-GPU distributed DataFrames | Partitioned cuDF DataFrames across workers |
| LocalCUDACluster | Single-node multi-GPU cluster setup | One Dask worker per GPU |
| Nsight Systems | System-level GPU/CPU timeline | Identifies bottlenecks: idle GPU, PCIe stalls |
| Nsight Compute | Per-kernel CUDA profiling | Occupancy, memory throughput, arithmetic intensity |
| nvidia-smi | CLI GPU monitor | Utilization, VRAM, temperature, power draw |
| RMM PoolMemoryResource | GPU memory pool | Reduces allocation overhead and fragmentation |
Dask & Multi-GPU Distributed Computing
Dask enables parallel, deferred execution across multiple GPUs. Combined with cuDF via dask_cudf, it extends the RAPIDS ecosystem beyond single-GPU memory limits to multi-GPU and multi-node clusters.
Dask Fundamentals
- Deferred execution: Dask builds a task graph of operations but does not execute until
.compute()is called - Task graph: DAG of operations that Dask scheduler executes in optimal order across workers
- Lazy evaluation: transformations chain without running — enables optimization before execution
- Partitions: dataset split into chunks; each partition processed independently by one worker
dask_cudf
- Dask + cuDF = distributed multi-GPU DataFrames
- Same cuDF API — groupby, merge, apply all work the same way
- Each partition is a cuDF DataFrame held in one GPU's VRAM
dask_cudf.read_parquet('data/*.parquet')— reads and distributes partitions across GPUs- Operations that touch one partition: embarrassingly parallel — no inter-GPU communication
- Aggregations spanning partitions: require shuffle (inter-GPU communication)
LocalCUDACluster
- Spins up a multi-GPU Dask cluster on a single node
- Creates one worker per GPU — worker is a separate process with its own GPU context
- Usage:
from dask_cuda import LocalCUDACluster; cluster = LocalCUDACluster() - Workers coordinate via Dask scheduler running on CPU
- Ideal for single-server multi-GPU setups (e.g., DGX A100 with 8 GPUs)
Dask Client
from dask.distributed import Client; client = Client(cluster)- Client connects to the Dask cluster and submits tasks
- Provides dashboard at
client.dashboard_link— view task progress, worker utilization - Submit work: call any dask_cudf operation; it queues in the task graph
- Execute:
result = ddf.compute()— triggers actual GPU computation
Partition Strategy
- Embarrassingly parallel: groupby, map, filter — each partition processed independently on its GPU
- Shuffle-based aggregation: cross-partition groupby requires shuffling data between GPUs — communication overhead
- Partition size: target ~1 GB per partition so each fits in GPU cache
ddf.repartition(npartitions=N)— adjust number of partitions
Multi-Node Scaling
- Multi-node via RAPIDS UCX (Unified Communication X framework)
- GPU-to-GPU communication over NVLink (intra-node) or InfiniBand (inter-node)
- UCX bypasses CPU for GPU data transfers — dramatically reduces latency
- Dask + UCX enables petabyte-scale GPU data science across GPU clusters
When to Use Dask vs Single GPU
- Use Dask when dataset exceeds single GPU VRAM (e.g., 500 GB data, 80 GB GPU)
- Use Dask to parallelize across multiple GPUs for faster wall-clock time
- Single GPU cuDF is faster for data that fits — less scheduling overhead
- GPU spilling (managed memory) is an alternative to Dask for modest overflows
CRISP-DM & ETL Design
CRISP-DM is the standard 6-phase data mining methodology. For NCP-ADS, the focus is on how each phase maps to GPU-accelerated RAPIDS tools and ETL design principles for high-throughput pipelines.
Business Understanding
Define project objectives, success criteria, and project plan. Translate business goals into data science problem formulation (classification, regression, clustering). Identify constraints: data availability, latency requirements, deployment environment.
Data Understanding
Collect initial data; explore with EDA (cuDF describe, corr, value_counts). Identify data quality issues: missing values, outliers, inconsistent formats. Document data sources, schemas, and volume — guides GPU infrastructure sizing.
Data Preparation (60–80% of project time)
Most time-consuming phase. Tasks: cleaning nulls, encoding categoricals, normalizing features, engineering new columns, joining datasets. All done in cuDF on GPU. This phase is often repeated iteratively as modeling reveals data issues.
Modeling
Select algorithms (cuML, XGBoost), build models, tune hyperparameters. CRISP-DM is iterative — often cycle back to Data Preparation when modeling reveals feature gaps or quality issues requiring additional cleaning.
Evaluation
Assess model performance against business objectives — not just metric scores. Determine if the model meets success criteria defined in Phase 1. Decision gate: ready for deployment or return to earlier phases?
Deployment
Productionize model: Triton Inference Server, REST API, or batch scoring pipeline. Monitor for data drift and performance degradation in production. CRISP-DM is cyclic — new business questions trigger a new iteration.
Extract
- Use cuIO (cuDF's I/O module) — reads Parquet, CSV, JSON, ORC directly to GPU memory
- Parquet preferred: columnar format; cuDF reads only needed columns; compressed; GPU-accelerated decode
- GDS (GPU Direct Storage): NVMe → GPU without CPU copy; eliminates PCIe bottleneck for large reads
cudf.read_parquet('data.parquet', columns=['col1','col2'])— column pruning at read time
Transform
- All transforms in cuDF — filtering, joining, encoding, normalization, aggregation
- Stay on GPU: avoid
.to_pandas()mid-pipeline — kills performance - Chain operations: cuDF supports method chaining; Dask defers until
.compute() - Pass results via
__cuda_array_interface__to DL frameworks (zero-copy)
Load & Caching
- Write output:
df.to_parquet('output.parquet')— GPU-encoded Parquet - Pass to cuML: cuDF DataFrames feed directly into cuML estimators
- Pass to DL: via
__cuda_array_interface__or DLPack — zero-copy tensor creation - Caching: persist intermediate cuDF DataFrames in GPU memory between pipeline stages
- Pipeline parallelism: use Dask to overlap I/O and compute across GPUs
Profiling & Optimization
GPU profiling reveals where time is actually spent. Nsight Systems exposes system-level bottlenecks while Nsight Compute dives into individual CUDA kernels. RMM and memory management techniques prevent VRAM exhaustion.
Nsight Systems
- System-level GPU profiler — timeline view of CPU, GPU, and memory transfer activity
- Identifies bottlenecks: idle GPU time, PCIe transfer stalls, kernel launch overhead
- Shows overlap (or lack thereof) between CPU work and GPU execution
- Use to answer: "Is my GPU sitting idle while CPU loads data?"
- Output: timeline trace file viewable in Nsight Systems GUI
Nsight Compute
- Kernel-level profiler — measures per-CUDA-kernel performance metrics
- Key metrics: occupancy (fraction of max threads active), memory throughput, arithmetic intensity
- Identifies underutilized kernels: low occupancy = wasted GPU capacity
- Use to answer: "Is this specific cuML kernel memory-bound or compute-bound?"
- More detailed than Nsight Systems but produces more data to analyze
DLProf
- Deep learning profiler that wraps Nsight
- Highlights time spent in forward/backward passes, data loading, and optimizer steps
- Useful specifically for DL training pipeline bottleneck identification
- Reports operator-level GPU time breakdown
nvidia-smi & Interactive Monitors
nvidia-smi: CLI; shows GPU utilization %, memory used/free, temperature, power draw per GPUnvidia-smi dmon: real-time streaming monitor of GPU metrics- nvtop / nvitop: interactive htop-style GPU process monitors — shows per-process GPU/VRAM usage
- Quickest way to check if GPU is being utilized during a training run
| Bottleneck | Symptom | Fix |
|---|---|---|
| Low GPU utilization | nvidia-smi shows <50% GPU util; GPU idle between batches | Use DALI or async prefetching; overlap data loading and GPU compute |
| High PCIe activity | Nsight shows frequent H2D/D2H transfers; slow wall time | Minimize .to_pandas() calls; keep data on GPU; use GDS for I/O |
| Memory fragmentation | OOM errors despite available VRAM; many small allocs | Use RMM PoolMemoryResource to pre-allocate a memory pool |
| Kernel underutilization | Nsight Compute shows low occupancy; arithmetic intensity low | Increase batch size; restructure data layout; check kernel launch configuration |
| Data loading bottleneck | GPU idle while CPU processes data | Use DALI GPU pipeline; prefetch next batch while current batch trains |
RMM (RAPIDS Memory Manager)
- PoolMemoryResource: pre-allocates a large GPU memory pool; sub-allocates from it — avoids repeated cudaMalloc/cudaFree overhead
- Reduces memory fragmentation from many small GPU allocations in cuDF/cuML
- Setup:
rmm.reinitialize(pool_allocator=True, initial_pool_size=2**30) - ManagedMemoryResource: enables automatic GPU→CPU spill via CUDA Unified Memory
GPU Spilling
cudf.set_option('spill', True)— enables automatic spilling to CPU RAM when GPU memory is full- Uses CUDA managed memory (UVM) — pages migrate between GPU and CPU automatically
- Performance cost: spilled data accessed at PCIe speed (~32 GB/s vs HBM ~2 TB/s)
- Useful when dataset slightly exceeds VRAM; severe spilling → use Dask instead
Memory Monitoring
cupy.get_default_memory_pool().used_bytes()— current GPU memory pool usagecupy.get_default_memory_pool().free_all_blocks()— release cached GPU memorydel df; import gc; gc.collect()— release Python references and trigger garbage collectionnvidia-smimemory column: tracks actual VRAM allocation from OS perspective
Practice Quiz — Distributed Pipelines & Profiling
10 questions covering Dask multi-GPU, CRISP-DM methodology, ETL design, and GPU profiling tools. Select one answer per question, then click Submit.
Memory Hooks
Six mnemonic devices to lock in the most exam-critical Dask, CRISP-DM, and profiling concepts.
to_parquet(). Never call .to_pandas() mid-pipeline — it kills performance with a PCIe round-trip.Flashcards & Advisor
Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review on each topic area.
Click a card to flip it
Study Advisor
Dask Multi-GPU Key Points
dask_cudf: distributed multi-GPU DataFrames; same cuDF API; each partition = one cuDF DataFrame on one GPULocalCUDACluster(): single-node multi-GPU cluster; one worker per GPUClient(cluster): connects to cluster; submit tasks via dask_cudf operationsdask_cudf.read_parquet('*.parquet'): distributes Parquet files as partitions across GPUs- Embarrassingly parallel: filter, map, column transform — no inter-GPU comms
- Shuffle-based: cross-partition groupby — requires inter-GPU communication; more expensive
- When to use: dataset exceeds single GPU VRAM; want multi-GPU parallelism