FlashGenius Logo FlashGenius
NCP-ADS Exam Prep · Topic 2

Data Preparation with cuDF

GPU ETL · DataFrame Operations · Cleaning · Joins · cuIO File Reading

Core data wrangling skills for the NCP-ADS exam

Study with Practice Tests →

Topic 2: Data Preparation with cuDF

Data preparation is typically 60–80% of any data science project. cuDF brings this step to the GPU, enabling massive datasets to be cleaned, transformed, and joined at speeds impossible with CPU pandas. This topic covers everything from basic DataFrame ops to GPU-native file I/O.

cuDF Design Philosophy

  • pandas API compatibility: intentional — existing pandas code often runs with just import cudf as pd
  • Apache Arrow columnar layout: data stored column-by-column in GPU memory; ideal for analytical (OLAP) queries
  • Null representation: separate bitmask for null values (Arrow standard); not NaN float tricks
  • Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch tensors via __cuda_array_interface__ — no data copy needed
  • Numba UDFs: write Python functions decorated with @cuda.jit or use apply_rows for custom GPU operations

Key Performance Principles

  • Keep all operations in cuDF — avoid to_pandas() mid-pipeline (triggers PCIe transfer)
  • Prefer built-in cuDF methods over Python loops — loops are not GPU-parallelized
  • Use RMM PoolMemoryResource for workloads with many small allocations
  • Read files directly with cuDF's cuIO (GPU-accelerated) rather than pandas then converting
  • Use categorical dtypes for low-cardinality string columns — reduces memory and speeds groupby

cuDF vs pandas — API Comparison

OperationpandascuDF equivalentNotes
Read CSVpd.read_csv()cudf.read_csv()GPU-accelerated cuIO; much faster on large files
GroupBydf.groupby().agg()Same APIHash-based parallel aggregation on GPU
Merge/Joinpd.merge()cudf.merge()GPU hash join; faster on large tables
Apply UDFdf.apply(func)df.apply_rows(kernel)cuDF UDF uses Numba CUDA JIT
Sortdf.sort_values()Same APIGPU radix/merge sort
String opsdf.str.contains()Same APIGPU-accelerated via cuDF strings module
Convert to CPUN/Adf.to_pandas()Triggers PCIe D2H transfer — use sparingly

Core cuDF DataFrame Operations

These are the workhorses of GPU ETL — groupby aggregations, joins, and sorting are where cuDF delivers the most dramatic speedups over pandas.

GroupBy & Aggregation

GPU GroupBy Mechanics

  • Hash-based implementation: cuDF builds a GPU hash table mapping group keys to row indices — massively parallel across all CUDA cores
  • Supported aggregations: sum, mean, count, min, max, std, var, first, last, nunique
  • Multi-column groupby: df.groupby(['col_a','col_b']).agg({'col_c':'sum','col_d':'mean'})
  • Named aggregations: df.groupby('key').agg(total=('value','sum'), avg=('value','mean'))
  • Speedup: 10–100× over pandas groupby on datasets >10M rows

Joins & Merges

  • Inner join: returns rows with matching keys in both DataFrames — most common
  • Left join: all rows from left; matching rows from right; NaN for non-matches
  • Outer join: all rows from both; NaN where no match
  • GPU implementation: hash join — build hash table on smaller table, probe with larger table; embarrassingly parallel
  • Multi-key joins: cudf.merge(df1, df2, on=['key1','key2'], how='inner')
  • Broadcast join: when one table is very small — replicate to all GPU threads
Sorting, Filtering & Window Functions

Sorting

  • df.sort_values('col', ascending=False) — GPU radix sort on numeric data
  • Multi-column sort: df.sort_values(['col_a','col_b'], ascending=[True,False])
  • Stable sort by default; GPU radix sort is O(n) for fixed-width types
  • df.nlargest(k, 'col') / df.nsmallest(k, 'col') — top-K without full sort

Filtering & Selection

  • Boolean mask: df[df['age'] > 30] — creates boolean array on GPU, applies mask
  • Multiple conditions: df[(df['a']>0) & (df['b']<100)] — use & not and
  • df.query("age > 30 and salary < 100000") — string-based query (cuDF subset supported)
  • df.isin(['val1','val2']) — membership test on GPU
  • Column selection: df[['col_a','col_b']] — zero-copy view of Arrow buffer

Window & Rolling Functions

  • df['col'].rolling(window=7).mean() — sliding window aggregation on GPU
  • df.groupby('key')['val'].cumsum() — cumulative operations within groups
  • Ranking: df['col'].rank(method='average')
  • Shift: df['col'].shift(1) — lag/lead for time series feature engineering

Data Cleaning & Transformations

Data rarely arrives clean. cuDF provides GPU-accelerated tools for handling missing values, type conversions, normalization, encoding, and custom transforms — all without leaving GPU memory.

Missing Value Handling

Null Detection & Filling

  • cuDF nulls: stored as a separate Arrow bitmask — distinct from NaN (which is a float value)
  • df.isnull() / df.notnull() — GPU-parallel null detection
  • df['col'].fillna(value) — fill with scalar; fillna(method='ffill') — forward fill
  • df.dropna(subset=['col_a','col_b']) — drop rows with nulls in specified columns
  • Imputation: df['col'].fillna(df['col'].mean()) — mean imputation on GPU

Type Conversions & Casting

  • df['col'].astype('float32') — cast column to new dtype; in-place on GPU
  • Common casts: objectcategory for low-cardinality strings; int64int32 to halve memory
  • Categorical dtype: stores unique string values once; each row stores an integer code — ideal for groupby
  • pd.to_datetime(df['ts'])cudf.to_datetime(df['ts']) — GPU datetime parsing
  • Datetime extraction: df['ts'].dt.year, .dt.month, .dt.dayofweek, .dt.hour
Normalization, Encoding & Feature Engineering

Normalization & Scaling

  • Min-max: (df['col'] - df['col'].min()) / (df['col'].max() - df['col'].min())
  • Z-score: (df['col'] - df['col'].mean()) / df['col'].std()
  • Or use cuML.preprocessing.MinMaxScaler / StandardScaler — cuML wrappers
  • Log transform: cupy.log1p(df['col'].values) — use cuPy for element-wise math
  • Clip outliers: df['col'].clip(lower=0, upper=99)

Categorical Encoding

  • Label encoding: df['cat'].astype('category').cat.codes — assign integer code per category
  • One-hot encoding: cudf.get_dummies(df, columns=['cat_col']) — creates binary columns per category
  • Target encoding: merge target mean per category using groupby + merge
  • Ordinal encoding: map categories to integers using df['col'].map({'Low':0,'Med':1,'High':2})

Custom Transforms with Numba

  • apply_rows(): applies a Numba-JIT function row-wise across the DataFrame on GPU
  • Use @numba.cuda.jit for custom GPU kernels when cuDF built-ins aren't sufficient
  • CuPy integration: df['col'].values returns a CuPy array — apply any CuPy math operation
  • Avoid Python-level loops over DataFrame rows — not parallelized
Data Deduplication & Outlier Handling

Deduplication

  • df.drop_duplicates(subset=['id','timestamp']) — GPU-parallel duplicate detection and removal
  • df.duplicated(subset=['col'], keep='first') — returns boolean mask of duplicate rows
  • For approximate deduplication on large datasets: hash-based fingerprinting + groupby

Outlier Detection & Removal

  • Z-score method: compute (x - mean)/std; flag rows where |z| > 3
  • IQR method: Q1 = 25th percentile, Q3 = 75th; outlier if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
  • df['col'].quantile([0.25, 0.75]) — GPU quantile computation
  • Statistical outlier methods scale effortlessly to 100M+ rows on GPU

File I/O & String Operations

cuIO brings GPU acceleration to file reading and writing. The cuDF strings module handles text data — regex, splitting, tokenizing — entirely on the GPU.

cuIO — GPU-Accelerated File I/O

Supported File Formats

  • CSV: cudf.read_csv('file.csv') — multi-threaded GPU parsing; specify dtype dict to avoid inference overhead
  • Parquet: cudf.read_parquet('file.parquet') — columnar format; ideal for cuDF (column-oriented); supports predicate pushdown
  • ORC: cudf.read_orc() — compressed columnar; common in Hive/Spark ecosystems
  • JSON: cudf.read_json() — line-delimited JSON (JSONL); GPU-accelerated parsing
  • Avro: cudf.read_avro() — row-oriented; less common but supported

I/O Performance Tips

  • Parquet > CSV for repeated reads — compressed, columnar, supports column pruning
  • Predicate pushdown: cudf.read_parquet('file.parquet', filters=[('col','>',100)]) — skip row groups at read time
  • Column pruning: cudf.read_csv('file.csv', usecols=['col_a','col_b']) — read only needed columns
  • GDS (GPUDirect Storage): reads data directly from NVMe SSD to GPU memory — bypasses CPU RAM entirely; requires NVIDIA Magnum IO
  • Write: df.to_parquet('out.parquet'), df.to_csv('out.csv')
String Operations

cuDF Strings Module

  • Access via df['text_col'].str accessor — same as pandas
  • .str.contains('pattern', regex=True) — GPU regex matching
  • .str.replace('old','new') — GPU string replacement
  • .str.split(delimiter) — returns list-type column
  • .str.lower() / .str.upper() / .str.strip()
  • .str.len() — character count per string

String Encoding & Tokenization

  • cuDF strings stored as variable-length UTF-8 with separate offsets array
  • .str.token_count(delimiter=' ') — word count on GPU
  • .str.ngrams(n=2) — character n-grams generation
  • For NLP preprocessing at scale: use cuDF strings → pass to cuML or RAPIDS NLP tools
  • Convert to category: df['str_col'].astype('category') — efficient for repeated values

Synthetic Data Generation

  • Use cupy.random for GPU-generated random numerical data
  • cupy.random.randn(n) — standard normal distribution on GPU
  • cupy.random.randint(low, high, size=n) — uniform integers
  • Wrap in cuDF: cudf.Series(cupy.random.randn(1_000_000))
  • Useful for testing pipeline performance before real data arrives

Practice Quiz — Data Preparation with cuDF

10 questions covering cuDF operations, data cleaning, file I/O, and string processing.

Memory Hooks

Lock in cuDF's key behaviors and tradeoffs before exam day.

🔄
Null vs NaN
"cuDF Uses Null Bitmasks, Not NaN Floats"
pandas represents missing values as NaN (a float trick). cuDF uses Apache Arrow's separate bitmask — works for any dtype, not just floats. Always use isnull() to check, and fillna() to fill, in both.
🚫
Avoid Mid-Pipeline Conversions
"to_pandas() Breaks the GPU Pipeline"
Every to_pandas() call fires a PCIe D2H transfer. One call is fine; ten calls in a loop kills performance. Design pipelines to stay in cuDF from file read to model input — convert to pandas only for final output.
🏷️
Categorical Dtype
"Category Codes = Memory Win + GroupBy Speedup"
A string column with 10 unique values across 100M rows stores 100M string copies. Casting to category stores 10 strings once + 100M integer codes. Groupby on integer codes is far faster than on strings.
📄
Parquet > CSV
"Parquet: Columns + Compression + Predicate Pushdown"
Parquet is columnar like cuDF — reads only the columns you need. It's compressed, so smaller files transfer faster. Predicate pushdown skips entire row groups at read time. CSV re-reads everything, every time.
🔑
GPU Hash Join
"Hash Join: Build Small Table, Probe Big Table"
cuDF's merge builds a GPU hash table on the smaller DataFrame and probes it with the larger one — all in parallel. Inner join is fastest; outer join requires more memory. Always put the smaller table on the right side of the merge.
🧵
String Operations
"df.str.* Works — But It's Not Free"
The .str accessor works identically in cuDF and pandas. On GPU, regex and string ops on 100M strings are orders of magnitude faster than pandas. For best performance, cast repeated string values to category before groupby.

Flashcards & Advisor

Click a card to reveal the answer

cuDF Null Representation
How does cuDF store missing values?
Apache Arrow bitmask — a separate per-column bit array where 0 = null, 1 = valid. This works for any dtype (int, float, string). Unlike pandas which uses NaN (float trick), cuDF nulls are dtype-agnostic.
cudf.get_dummies()
What does it do and when do you use it?
One-hot encoding on GPU. Creates one binary column per unique category value. Use for nominal (unordered) categorical features before ML. High-cardinality columns produce many columns — consider embedding or target encoding instead.
Parquet Predicate Pushdown
What is it and how do you use it in cuDF?
Skips entire Parquet row groups that don't match a filter condition — data is never read from disk. In cuDF: cudf.read_parquet('file.parquet', filters=[('age', '>', 30)]). Dramatically reduces I/O on large files.
Categorical dtype in cuDF
What memory advantage does it provide?
Stores unique string values once, then replaces each row value with a small integer code. For a column with 20 unique strings across 100M rows: stores 20 strings + 100M int32 codes vs. 100M full string copies. Also speeds up groupby operations.
apply_rows() vs apply()
Why is cuDF's apply_rows() different from pandas apply()?
pandas apply(func) calls a Python function row-by-row (serial). cuDF's apply_rows(kernel) requires a Numba CUDA JIT function that runs in parallel across all rows on the GPU — orders of magnitude faster for large DataFrames.
GPUDirect Storage (GDS)
What bottleneck does it eliminate?
Reads data directly from NVMe SSD → GPU VRAM, bypassing CPU RAM entirely. Eliminates the SSD→CPU→GPU copy bottleneck. Requires NVIDIA Magnum IO. Enables near-storage-bandwidth data loading without CPU involvement.
IQR Outlier Method
What are the bounds and how do you compute them in cuDF?
IQR = Q3 − Q1. Outlier bounds: lower = Q1 − 1.5×IQR, upper = Q3 + 1.5×IQR. In cuDF: q = df['col'].quantile([0.25, 0.75]) → compute IQR → filter with boolean mask. Fully GPU-parallelized on 100M+ rows.
__cuda_array_interface__
What does this enable between cuDF, cuPy, and PyTorch?
A standard protocol for zero-copy sharing of GPU array data between libraries. cuDF Series/columns, CuPy arrays, and PyTorch CUDA tensors all implement it — data pointer is shared with no GPU-to-GPU copy needed.

Study Advisor

Core DataFrame Operations

  • GroupBy: hash-based parallel aggregation; multi-column groupby with agg dict; 10–100× over pandas
  • Merge: hash join on GPU; build smaller table, probe with larger; inner join fastest
  • Sort: GPU radix sort on numeric; multi-column sort with ascending list; nlargest/nsmallest for top-K
  • Filter: boolean mask with & (not and) for multiple conditions; query() for string-based filtering
  • Rolling: df['col'].rolling(7).mean() — sliding window on GPU; shift() for lag features
  • Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch via __cuda_array_interface__ — no data copy

Ready to Pass the NCP-ADS?

Practice with full-length exams, timed simulations, and detailed answer explanations

Unlock Full Practice Tests on FlashGenius →