FlashGenius Logo FlashGenius
NCP-ADS Exam Prep · Topic 5

Distributed Pipelines, Dask & Profiling

Multi-GPU Dask · CRISP-DM · ETL Design · Nsight Profiling · Memory Management

Study with Practice Tests →

Distributed Pipelines, Dask & Profiling

When data exceeds a single GPU's VRAM, Dask distributes the workload across multiple GPUs. CRISP-DM provides the project methodology, while Nsight and RMM tools help identify and fix pipeline bottlenecks.

Dask + Multi-GPU

  • dask_cudf: distributed multi-GPU DataFrames; same cuDF API; automatic partitioning
  • LocalCUDACluster: single-node multi-GPU Dask cluster; one worker per GPU
  • Partitions: each cuDF partition lives in one GPU's VRAM; workers process partitions independently
  • Scale-out: multi-node via RAPIDS UCX + NVLink for GPU-to-GPU communication
  • When to use: dataset exceeds single GPU VRAM or want parallelism across GPUs/nodes

CRISP-DM Phases

  • 1. Business Understanding: define objectives, success criteria, project plan
  • 2. Data Understanding: collect, explore, identify quality issues
  • 3. Data Preparation: clean, transform, engineer — most time-consuming (60–80%)
  • 4. Modeling: select algorithms, build models, tune hyperparameters
  • 5. Evaluation: assess against business goals; readiness for deployment
  • 6. Deployment: productionize; monitor performance in production

Profiling Tools

  • Nsight Systems: system-level timeline; identifies GPU idle time, PCIe stalls
  • Nsight Compute: kernel-level metrics; occupancy, memory throughput, arithmetic intensity
  • nvidia-smi: CLI monitor; GPU util%, VRAM usage, temperature, power
  • RMM PoolMemoryResource: reduces GPU memory fragmentation from many small allocations
  • cudf spill: cudf.set_option('spill',True) — spill GPU data to CPU RAM when full

Tool & Technique Quick Reference

Tool/TechniquePurposeKey Output
dask_cudfMulti-GPU distributed DataFramesPartitioned cuDF DataFrames across workers
LocalCUDAClusterSingle-node multi-GPU cluster setupOne Dask worker per GPU
Nsight SystemsSystem-level GPU/CPU timelineIdentifies bottlenecks: idle GPU, PCIe stalls
Nsight ComputePer-kernel CUDA profilingOccupancy, memory throughput, arithmetic intensity
nvidia-smiCLI GPU monitorUtilization, VRAM, temperature, power draw
RMM PoolMemoryResourceGPU memory poolReduces allocation overhead and fragmentation

Dask & Multi-GPU Distributed Computing

Dask enables parallel, deferred execution across multiple GPUs. Combined with cuDF via dask_cudf, it extends the RAPIDS ecosystem beyond single-GPU memory limits to multi-GPU and multi-node clusters.

Core Concepts

Dask Fundamentals

  • Deferred execution: Dask builds a task graph of operations but does not execute until .compute() is called
  • Task graph: DAG of operations that Dask scheduler executes in optimal order across workers
  • Lazy evaluation: transformations chain without running — enables optimization before execution
  • Partitions: dataset split into chunks; each partition processed independently by one worker

dask_cudf

  • Dask + cuDF = distributed multi-GPU DataFrames
  • Same cuDF API — groupby, merge, apply all work the same way
  • Each partition is a cuDF DataFrame held in one GPU's VRAM
  • dask_cudf.read_parquet('data/*.parquet') — reads and distributes partitions across GPUs
  • Operations that touch one partition: embarrassingly parallel — no inter-GPU communication
  • Aggregations spanning partitions: require shuffle (inter-GPU communication)
Cluster Setup & Client

LocalCUDACluster

  • Spins up a multi-GPU Dask cluster on a single node
  • Creates one worker per GPU — worker is a separate process with its own GPU context
  • Usage: from dask_cuda import LocalCUDACluster; cluster = LocalCUDACluster()
  • Workers coordinate via Dask scheduler running on CPU
  • Ideal for single-server multi-GPU setups (e.g., DGX A100 with 8 GPUs)

Dask Client

  • from dask.distributed import Client; client = Client(cluster)
  • Client connects to the Dask cluster and submits tasks
  • Provides dashboard at client.dashboard_link — view task progress, worker utilization
  • Submit work: call any dask_cudf operation; it queues in the task graph
  • Execute: result = ddf.compute() — triggers actual GPU computation

Partition Strategy

  • Embarrassingly parallel: groupby, map, filter — each partition processed independently on its GPU
  • Shuffle-based aggregation: cross-partition groupby requires shuffling data between GPUs — communication overhead
  • Partition size: target ~1 GB per partition so each fits in GPU cache
  • ddf.repartition(npartitions=N) — adjust number of partitions
Scale-Out & When to Use Dask

Multi-Node Scaling

  • Multi-node via RAPIDS UCX (Unified Communication X framework)
  • GPU-to-GPU communication over NVLink (intra-node) or InfiniBand (inter-node)
  • UCX bypasses CPU for GPU data transfers — dramatically reduces latency
  • Dask + UCX enables petabyte-scale GPU data science across GPU clusters

When to Use Dask vs Single GPU

  • Use Dask when dataset exceeds single GPU VRAM (e.g., 500 GB data, 80 GB GPU)
  • Use Dask to parallelize across multiple GPUs for faster wall-clock time
  • Single GPU cuDF is faster for data that fits — less scheduling overhead
  • GPU spilling (managed memory) is an alternative to Dask for modest overflows

CRISP-DM & ETL Design

CRISP-DM is the standard 6-phase data mining methodology. For NCP-ADS, the focus is on how each phase maps to GPU-accelerated RAPIDS tools and ETL design principles for high-throughput pipelines.

CRISP-DM Phases
1

Business Understanding

Define project objectives, success criteria, and project plan. Translate business goals into data science problem formulation (classification, regression, clustering). Identify constraints: data availability, latency requirements, deployment environment.

2

Data Understanding

Collect initial data; explore with EDA (cuDF describe, corr, value_counts). Identify data quality issues: missing values, outliers, inconsistent formats. Document data sources, schemas, and volume — guides GPU infrastructure sizing.

3

Data Preparation (60–80% of project time)

Most time-consuming phase. Tasks: cleaning nulls, encoding categoricals, normalizing features, engineering new columns, joining datasets. All done in cuDF on GPU. This phase is often repeated iteratively as modeling reveals data issues.

4

Modeling

Select algorithms (cuML, XGBoost), build models, tune hyperparameters. CRISP-DM is iterative — often cycle back to Data Preparation when modeling reveals feature gaps or quality issues requiring additional cleaning.

5

Evaluation

Assess model performance against business objectives — not just metric scores. Determine if the model meets success criteria defined in Phase 1. Decision gate: ready for deployment or return to earlier phases?

6

Deployment

Productionize model: Triton Inference Server, REST API, or batch scoring pipeline. Monitor for data drift and performance degradation in production. CRISP-DM is cyclic — new business questions trigger a new iteration.

ETL Design Principles for GPU

Extract

  • Use cuIO (cuDF's I/O module) — reads Parquet, CSV, JSON, ORC directly to GPU memory
  • Parquet preferred: columnar format; cuDF reads only needed columns; compressed; GPU-accelerated decode
  • GDS (GPU Direct Storage): NVMe → GPU without CPU copy; eliminates PCIe bottleneck for large reads
  • cudf.read_parquet('data.parquet', columns=['col1','col2']) — column pruning at read time

Transform

  • All transforms in cuDF — filtering, joining, encoding, normalization, aggregation
  • Stay on GPU: avoid .to_pandas() mid-pipeline — kills performance
  • Chain operations: cuDF supports method chaining; Dask defers until .compute()
  • Pass results via __cuda_array_interface__ to DL frameworks (zero-copy)

Load & Caching

  • Write output: df.to_parquet('output.parquet') — GPU-encoded Parquet
  • Pass to cuML: cuDF DataFrames feed directly into cuML estimators
  • Pass to DL: via __cuda_array_interface__ or DLPack — zero-copy tensor creation
  • Caching: persist intermediate cuDF DataFrames in GPU memory between pipeline stages
  • Pipeline parallelism: use Dask to overlap I/O and compute across GPUs

Profiling & Optimization

GPU profiling reveals where time is actually spent. Nsight Systems exposes system-level bottlenecks while Nsight Compute dives into individual CUDA kernels. RMM and memory management techniques prevent VRAM exhaustion.

Profiling Tools

Nsight Systems

  • System-level GPU profiler — timeline view of CPU, GPU, and memory transfer activity
  • Identifies bottlenecks: idle GPU time, PCIe transfer stalls, kernel launch overhead
  • Shows overlap (or lack thereof) between CPU work and GPU execution
  • Use to answer: "Is my GPU sitting idle while CPU loads data?"
  • Output: timeline trace file viewable in Nsight Systems GUI

Nsight Compute

  • Kernel-level profiler — measures per-CUDA-kernel performance metrics
  • Key metrics: occupancy (fraction of max threads active), memory throughput, arithmetic intensity
  • Identifies underutilized kernels: low occupancy = wasted GPU capacity
  • Use to answer: "Is this specific cuML kernel memory-bound or compute-bound?"
  • More detailed than Nsight Systems but produces more data to analyze

DLProf

  • Deep learning profiler that wraps Nsight
  • Highlights time spent in forward/backward passes, data loading, and optimizer steps
  • Useful specifically for DL training pipeline bottleneck identification
  • Reports operator-level GPU time breakdown

nvidia-smi & Interactive Monitors

  • nvidia-smi: CLI; shows GPU utilization %, memory used/free, temperature, power draw per GPU
  • nvidia-smi dmon: real-time streaming monitor of GPU metrics
  • nvtop / nvitop: interactive htop-style GPU process monitors — shows per-process GPU/VRAM usage
  • Quickest way to check if GPU is being utilized during a training run
Common Bottlenecks & Fixes
BottleneckSymptomFix
Low GPU utilizationnvidia-smi shows <50% GPU util; GPU idle between batchesUse DALI or async prefetching; overlap data loading and GPU compute
High PCIe activityNsight shows frequent H2D/D2H transfers; slow wall timeMinimize .to_pandas() calls; keep data on GPU; use GDS for I/O
Memory fragmentationOOM errors despite available VRAM; many small allocsUse RMM PoolMemoryResource to pre-allocate a memory pool
Kernel underutilizationNsight Compute shows low occupancy; arithmetic intensity lowIncrease batch size; restructure data layout; check kernel launch configuration
Data loading bottleneckGPU idle while CPU processes dataUse DALI GPU pipeline; prefetch next batch while current batch trains
GPU Memory Management

RMM (RAPIDS Memory Manager)

  • PoolMemoryResource: pre-allocates a large GPU memory pool; sub-allocates from it — avoids repeated cudaMalloc/cudaFree overhead
  • Reduces memory fragmentation from many small GPU allocations in cuDF/cuML
  • Setup: rmm.reinitialize(pool_allocator=True, initial_pool_size=2**30)
  • ManagedMemoryResource: enables automatic GPU→CPU spill via CUDA Unified Memory

GPU Spilling

  • cudf.set_option('spill', True) — enables automatic spilling to CPU RAM when GPU memory is full
  • Uses CUDA managed memory (UVM) — pages migrate between GPU and CPU automatically
  • Performance cost: spilled data accessed at PCIe speed (~32 GB/s vs HBM ~2 TB/s)
  • Useful when dataset slightly exceeds VRAM; severe spilling → use Dask instead

Memory Monitoring

  • cupy.get_default_memory_pool().used_bytes() — current GPU memory pool usage
  • cupy.get_default_memory_pool().free_all_blocks() — release cached GPU memory
  • del df; import gc; gc.collect() — release Python references and trigger garbage collection
  • nvidia-smi memory column: tracks actual VRAM allocation from OS perspective

Practice Quiz — Distributed Pipelines & Profiling

10 questions covering Dask multi-GPU, CRISP-DM methodology, ETL design, and GPU profiling tools. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical Dask, CRISP-DM, and profiling concepts.

🔀
Dask Partitions
"One Partition, One GPU, One Worker"
Each dask_cudf partition is a cuDF DataFrame that lives in one GPU's VRAM. LocalCUDACluster creates one worker per GPU. Workers process their partition independently — embarrassingly parallel operations need no inter-GPU communication.
🔄
CRISP-DM Order
"Business Data Prep Model Evaluate Deploy — But Loop Back!"
CRISP-DM 6 phases in order. Data Preparation consumes 60–80% of project time. The process is iterative — Modeling often cycles back to Data Preparation when new feature needs are discovered.
🔭
Nsight Systems vs Compute
"Systems Sees the Forest; Compute Sees the Tree"
Nsight Systems = system-level timeline (CPU, GPU, PCIe — the whole picture). Nsight Compute = per-kernel metrics (occupancy, memory throughput — one kernel at a time). Start with Systems to find the bottleneck, then Compute to diagnose it.
💾
RMM Pool Resource
"Pool Once, Allocate Many — Kill Fragmentation"
RMM PoolMemoryResource pre-allocates a large GPU memory pool and sub-allocates from it. This eliminates the overhead and fragmentation from repeated cudaMalloc/cudaFree calls in cuDF/cuML pipelines.
📦
ETL GPU Principle
"Extract to GPU, Stay on GPU, Write from GPU"
Use cuIO to read directly to GPU (Parquet preferred), perform all transforms in cuDF, write back with to_parquet(). Never call .to_pandas() mid-pipeline — it kills performance with a PCIe round-trip.
🐢
Low GPU Utilization Fix
"Idle GPU = Hungry GPU Waiting for Data"
Low GPU utilization (nvidia-smi shows <50%) means the GPU is waiting for the CPU to deliver data. Fix: use DALI or async prefetching to overlap data loading with GPU compute — keep the GPU fed continuously.

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review on each topic area.

Click a card to flip it

LocalCUDACluster
What does it create and how many workers per GPU?
Spins up a multi-GPU Dask cluster on a single node with one worker per GPU. Each worker is a separate process with its own GPU context. Enables dask_cudf operations across all GPUs on the node.
CRISP-DM Phase 3
What is it and why does it take the most time?
Data Preparation — cleaning nulls, encoding, normalizing, joining, engineering features. Consumes 60–80% of project time. All done in cuDF on GPU. Often revisited after modeling reveals data quality gaps.
Nsight Systems vs Nsight Compute
What level does each tool profile?
Nsight Systems: system-level timeline — CPU, GPU, memory transfers, PCIe activity. Finds overall bottlenecks. Nsight Compute: per-kernel metrics — occupancy, throughput, arithmetic intensity. Diagnoses specific CUDA kernel inefficiencies.
dask_cudf Embarrassingly Parallel
What operations qualify and why are they fast?
Operations like groupby-per-partition, map, filter — each partition processed independently on its GPU worker. No inter-GPU communication needed. Examples: column transforms, row filters, per-group statistics.
RMM PoolMemoryResource
What problem does it solve?
Solves GPU memory fragmentation from many small cudaMalloc/cudaFree calls in cuDF/cuML pipelines. Pre-allocates a large pool and sub-allocates from it — reduces overhead and OOM errors caused by fragmentation.
cudf.set_option('spill', True)
What does it enable and when should you use it vs Dask?
Enables automatic spilling of GPU data to CPU RAM when VRAM is full. Uses CUDA managed memory. Performance cost: PCIe speed (~32 GB/s) instead of HBM (~2 TB/s). Use for modest VRAM overflow; switch to Dask for large multi-GPU workloads.
Parquet in GPU ETL
Why is Parquet the preferred format for cuIO reads?
Parquet is columnar — cuDF reads only needed columns (column pruning at read time). Compressed on disk. GPU-accelerated decode via cuIO. Faster than CSV for analytical workloads. Supports predicate pushdown for row filtering.
RAPIDS UCX
What communication does it enable and for what scale?
RAPIDS UCX (Unified Communication X) enables GPU-to-GPU communication in multi-node Dask clusters. Works over NVLink (intra-node) and InfiniBand (inter-node). Bypasses CPU for data transfers — enables petabyte-scale GPU data science.

Study Advisor

Dask Multi-GPU Key Points

  • dask_cudf: distributed multi-GPU DataFrames; same cuDF API; each partition = one cuDF DataFrame on one GPU
  • LocalCUDACluster(): single-node multi-GPU cluster; one worker per GPU
  • Client(cluster): connects to cluster; submit tasks via dask_cudf operations
  • dask_cudf.read_parquet('*.parquet'): distributes Parquet files as partitions across GPUs
  • Embarrassingly parallel: filter, map, column transform — no inter-GPU comms
  • Shuffle-based: cross-partition groupby — requires inter-GPU communication; more expensive
  • When to use: dataset exceeds single GPU VRAM; want multi-GPU parallelism

Ready to Pass NCP-ADS?

Test your Dask, CRISP-DM, and profiling knowledge with full practice exams on FlashGenius.

Unlock Full Practice Tests on FlashGenius →