NCP-ADS: Data Preparation with cuDF

Topic 2: Data Preparation with cuDF

Data preparation is typically 60–80% of any data science project. cuDF brings this step to the GPU, enabling massive datasets to be cleaned, transformed, and joined at speeds impossible with CPU pandas. This topic covers everything from basic DataFrame ops to GPU-native file I/O.

cuDF Design Philosophy

pandas API compatibility: intentional — existing pandas code often runs with just import cudf as pd
Apache Arrow columnar layout: data stored column-by-column in GPU memory; ideal for analytical (OLAP) queries
Null representation: separate bitmask for null values (Arrow standard); not NaN float tricks
Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch tensors via __cuda_array_interface__ — no data copy needed
Numba UDFs: write Python functions decorated with @cuda.jit or use apply_rows for custom GPU operations

Key Performance Principles

Keep all operations in cuDF — avoid to_pandas() mid-pipeline (triggers PCIe transfer)
Prefer built-in cuDF methods over Python loops — loops are not GPU-parallelized
Use RMM PoolMemoryResource for workloads with many small allocations
Read files directly with cuDF's cuIO (GPU-accelerated) rather than pandas then converting
Use categorical dtypes for low-cardinality string columns — reduces memory and speeds groupby

cuDF vs pandas — API Comparison

Operation	pandas	cuDF equivalent	Notes
Read CSV	`pd.read_csv()`	`cudf.read_csv()`	GPU-accelerated cuIO; much faster on large files
GroupBy	`df.groupby().agg()`	Same API	Hash-based parallel aggregation on GPU
Merge/Join	`pd.merge()`	`cudf.merge()`	GPU hash join; faster on large tables
Apply UDF	`df.apply(func)`	`df.apply_rows(kernel)`	cuDF UDF uses Numba CUDA JIT
Sort	`df.sort_values()`	Same API	GPU radix/merge sort
String ops	`df.str.contains()`	Same API	GPU-accelerated via cuDF strings module
Convert to CPU	N/A	`df.to_pandas()`	Triggers PCIe D2H transfer — use sparingly

Core cuDF DataFrame Operations

These are the workhorses of GPU ETL — groupby aggregations, joins, and sorting are where cuDF delivers the most dramatic speedups over pandas.

GroupBy & Aggregation

GPU GroupBy Mechanics

Hash-based implementation: cuDF builds a GPU hash table mapping group keys to row indices — massively parallel across all CUDA cores
Supported aggregations: sum, mean, count, min, max, std, var, first, last, nunique
Multi-column groupby: df.groupby(['col_a','col_b']).agg({'col_c':'sum','col_d':'mean'})
Named aggregations: df.groupby('key').agg(total=('value','sum'), avg=('value','mean'))
Speedup: 10–100× over pandas groupby on datasets >10M rows

Joins & Merges

Inner join: returns rows with matching keys in both DataFrames — most common
Left join: all rows from left; matching rows from right; NaN for non-matches
Outer join: all rows from both; NaN where no match
GPU implementation: hash join — build hash table on smaller table, probe with larger table; embarrassingly parallel
Multi-key joins: cudf.merge(df1, df2, on=['key1','key2'], how='inner')
Broadcast join: when one table is very small — replicate to all GPU threads

Sorting, Filtering & Window Functions

Sorting

df.sort_values('col', ascending=False) — GPU radix sort on numeric data
Multi-column sort: df.sort_values(['col_a','col_b'], ascending=[True,False])
Stable sort by default; GPU radix sort is O(n) for fixed-width types
df.nlargest(k, 'col') / df.nsmallest(k, 'col') — top-K without full sort

Filtering & Selection

Boolean mask: df[df['age'] > 30] — creates boolean array on GPU, applies mask
Multiple conditions: df[(df['a']>0) & (df['b']<100)] — use & not and
df.query("age > 30 and salary < 100000") — string-based query (cuDF subset supported)
df.isin(['val1','val2']) — membership test on GPU
Column selection: df[['col_a','col_b']] — zero-copy view of Arrow buffer

Window & Rolling Functions

df['col'].rolling(window=7).mean() — sliding window aggregation on GPU
df.groupby('key')['val'].cumsum() — cumulative operations within groups
Ranking: df['col'].rank(method='average')
Shift: df['col'].shift(1) — lag/lead for time series feature engineering

Data Cleaning & Transformations

Data rarely arrives clean. cuDF provides GPU-accelerated tools for handling missing values, type conversions, normalization, encoding, and custom transforms — all without leaving GPU memory.

Missing Value Handling

Null Detection & Filling

cuDF nulls: stored as a separate Arrow bitmask — distinct from NaN (which is a float value)
df.isnull() / df.notnull() — GPU-parallel null detection
df['col'].fillna(value) — fill with scalar; fillna(method='ffill') — forward fill
df.dropna(subset=['col_a','col_b']) — drop rows with nulls in specified columns
Imputation: df['col'].fillna(df['col'].mean()) — mean imputation on GPU

Type Conversions & Casting

df['col'].astype('float32') — cast column to new dtype; in-place on GPU
Common casts: object→category for low-cardinality strings; int64→int32 to halve memory
Categorical dtype: stores unique string values once; each row stores an integer code — ideal for groupby
pd.to_datetime(df['ts']) → cudf.to_datetime(df['ts']) — GPU datetime parsing
Datetime extraction: df['ts'].dt.year, .dt.month, .dt.dayofweek, .dt.hour

Normalization, Encoding & Feature Engineering

Normalization & Scaling

Min-max: (df['col'] - df['col'].min()) / (df['col'].max() - df['col'].min())
Z-score: (df['col'] - df['col'].mean()) / df['col'].std()
Or use cuML.preprocessing.MinMaxScaler / StandardScaler — cuML wrappers
Log transform: cupy.log1p(df['col'].values) — use cuPy for element-wise math
Clip outliers: df['col'].clip(lower=0, upper=99)

Categorical Encoding

Label encoding: df['cat'].astype('category').cat.codes — assign integer code per category
One-hot encoding: cudf.get_dummies(df, columns=['cat_col']) — creates binary columns per category
Target encoding: merge target mean per category using groupby + merge
Ordinal encoding: map categories to integers using df['col'].map({'Low':0,'Med':1,'High':2})

Custom Transforms with Numba

apply_rows(): applies a Numba-JIT function row-wise across the DataFrame on GPU
Use @numba.cuda.jit for custom GPU kernels when cuDF built-ins aren't sufficient
CuPy integration: df['col'].values returns a CuPy array — apply any CuPy math operation
Avoid Python-level loops over DataFrame rows — not parallelized

Data Deduplication & Outlier Handling

Deduplication

df.drop_duplicates(subset=['id','timestamp']) — GPU-parallel duplicate detection and removal
df.duplicated(subset=['col'], keep='first') — returns boolean mask of duplicate rows
For approximate deduplication on large datasets: hash-based fingerprinting + groupby

Outlier Detection & Removal

Z-score method: compute (x - mean)/std; flag rows where |z| > 3
IQR method: Q1 = 25th percentile, Q3 = 75th; outlier if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
df['col'].quantile([0.25, 0.75]) — GPU quantile computation
Statistical outlier methods scale effortlessly to 100M+ rows on GPU

File I/O & String Operations

cuIO brings GPU acceleration to file reading and writing. The cuDF strings module handles text data — regex, splitting, tokenizing — entirely on the GPU.

cuIO — GPU-Accelerated File I/O

Supported File Formats

CSV: cudf.read_csv('file.csv') — multi-threaded GPU parsing; specify dtype dict to avoid inference overhead
Parquet: cudf.read_parquet('file.parquet') — columnar format; ideal for cuDF (column-oriented); supports predicate pushdown
ORC: cudf.read_orc() — compressed columnar; common in Hive/Spark ecosystems
JSON: cudf.read_json() — line-delimited JSON (JSONL); GPU-accelerated parsing
Avro: cudf.read_avro() — row-oriented; less common but supported

I/O Performance Tips

Parquet > CSV for repeated reads — compressed, columnar, supports column pruning
Predicate pushdown: cudf.read_parquet('file.parquet', filters=[('col','>',100)]) — skip row groups at read time
Column pruning: cudf.read_csv('file.csv', usecols=['col_a','col_b']) — read only needed columns
GDS (GPUDirect Storage): reads data directly from NVMe SSD to GPU memory — bypasses CPU RAM entirely; requires NVIDIA Magnum IO
Write: df.to_parquet('out.parquet'), df.to_csv('out.csv')

String Operations

cuDF Strings Module

Access via df['text_col'].str accessor — same as pandas
.str.contains('pattern', regex=True) — GPU regex matching
.str.replace('old','new') — GPU string replacement
.str.split(delimiter) — returns list-type column
.str.lower() / .str.upper() / .str.strip()
.str.len() — character count per string

String Encoding & Tokenization

cuDF strings stored as variable-length UTF-8 with separate offsets array
.str.token_count(delimiter=' ') — word count on GPU
.str.ngrams(n=2) — character n-grams generation
For NLP preprocessing at scale: use cuDF strings → pass to cuML or RAPIDS NLP tools
Convert to category: df['str_col'].astype('category') — efficient for repeated values

Synthetic Data Generation

Use cupy.random for GPU-generated random numerical data
cupy.random.randn(n) — standard normal distribution on GPU
cupy.random.randint(low, high, size=n) — uniform integers
Wrap in cuDF: cudf.Series(cupy.random.randn(1_000_000))
Useful for testing pipeline performance before real data arrives

Practice Quiz — Data Preparation with cuDF

10 questions covering cuDF operations, data cleaning, file I/O, and string processing.

Memory Hooks

Lock in cuDF's key behaviors and tradeoffs before exam day.

🔄

Null vs NaN

"cuDF Uses Null Bitmasks, Not NaN Floats"

pandas represents missing values as NaN (a float trick). cuDF uses Apache Arrow's separate bitmask — works for any dtype, not just floats. Always use isnull() to check, and fillna() to fill, in both.

🚫

Avoid Mid-Pipeline Conversions

"to_pandas() Breaks the GPU Pipeline"

Every to_pandas() call fires a PCIe D2H transfer. One call is fine; ten calls in a loop kills performance. Design pipelines to stay in cuDF from file read to model input — convert to pandas only for final output.

🏷️

Categorical Dtype

"Category Codes = Memory Win + GroupBy Speedup"

A string column with 10 unique values across 100M rows stores 100M string copies. Casting to category stores 10 strings once + 100M integer codes. Groupby on integer codes is far faster than on strings.

📄

Parquet > CSV

"Parquet: Columns + Compression + Predicate Pushdown"

Parquet is columnar like cuDF — reads only the columns you need. It's compressed, so smaller files transfer faster. Predicate pushdown skips entire row groups at read time. CSV re-reads everything, every time.

🔑

GPU Hash Join

"Hash Join: Build Small Table, Probe Big Table"

cuDF's merge builds a GPU hash table on the smaller DataFrame and probes it with the larger one — all in parallel. Inner join is fastest; outer join requires more memory. Always put the smaller table on the right side of the merge.

🧵

String Operations

"df.str.* Works — But It's Not Free"

The .str accessor works identically in cuDF and pandas. On GPU, regex and string ops on 100M strings are orders of magnitude faster than pandas. For best performance, cast repeated string values to category before groupby.

Flashcards & Advisor

Click a card to reveal the answer

cuDF Null Representation

How does cuDF store missing values?

Apache Arrow bitmask — a separate per-column bit array where 0 = null, 1 = valid. This works for any dtype (int, float, string). Unlike pandas which uses NaN (float trick), cuDF nulls are dtype-agnostic.

cudf.get_dummies()

What does it do and when do you use it?

One-hot encoding on GPU. Creates one binary column per unique category value. Use for nominal (unordered) categorical features before ML. High-cardinality columns produce many columns — consider embedding or target encoding instead.

Parquet Predicate Pushdown

What is it and how do you use it in cuDF?

Skips entire Parquet row groups that don't match a filter condition — data is never read from disk. In cuDF: cudf.read_parquet('file.parquet', filters=[('age', '>', 30)]). Dramatically reduces I/O on large files.

Categorical dtype in cuDF

What memory advantage does it provide?

Stores unique string values once, then replaces each row value with a small integer code. For a column with 20 unique strings across 100M rows: stores 20 strings + 100M int32 codes vs. 100M full string copies. Also speeds up groupby operations.

apply_rows() vs apply()

Why is cuDF's apply_rows() different from pandas apply()?

pandas apply(func) calls a Python function row-by-row (serial). cuDF's apply_rows(kernel) requires a Numba CUDA JIT function that runs in parallel across all rows on the GPU — orders of magnitude faster for large DataFrames.

GPUDirect Storage (GDS)

What bottleneck does it eliminate?

Reads data directly from NVMe SSD → GPU VRAM, bypassing CPU RAM entirely. Eliminates the SSD→CPU→GPU copy bottleneck. Requires NVIDIA Magnum IO. Enables near-storage-bandwidth data loading without CPU involvement.

IQR Outlier Method

What are the bounds and how do you compute them in cuDF?

IQR = Q3 − Q1. Outlier bounds: lower = Q1 − 1.5×IQR, upper = Q3 + 1.5×IQR. In cuDF: q = df['col'].quantile([0.25, 0.75]) → compute IQR → filter with boolean mask. Fully GPU-parallelized on 100M+ rows.

__cuda_array_interface__

What does this enable between cuDF, cuPy, and PyTorch?

A standard protocol for zero-copy sharing of GPU array data between libraries. cuDF Series/columns, CuPy arrays, and PyTorch CUDA tensors all implement it — data pointer is shared with no GPU-to-GPU copy needed.

Study Advisor

Core DataFrame Operations

GroupBy: hash-based parallel aggregation; multi-column groupby with agg dict; 10–100× over pandas
Merge: hash join on GPU; build smaller table, probe with larger; inner join fastest
Sort: GPU radix sort on numeric; multi-column sort with ascending list; nlargest/nsmallest for top-K
Filter: boolean mask with & (not and) for multiple conditions; query() for string-based filtering
Rolling: df['col'].rolling(7).mean() — sliding window on GPU; shift() for lag features
Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch via __cuda_array_interface__ — no data copy

Data Preparation with cuDF

Topic 2: Data Preparation with cuDF

cuDF Design Philosophy

Key Performance Principles

cuDF vs pandas — API Comparison

Core cuDF DataFrame Operations

GPU GroupBy Mechanics

Joins & Merges

Sorting

Filtering & Selection

Window & Rolling Functions

Data Cleaning & Transformations

Null Detection & Filling

Type Conversions & Casting

Normalization & Scaling

Categorical Encoding

Custom Transforms with Numba

Deduplication

Outlier Detection & Removal

File I/O & String Operations

Supported File Formats

I/O Performance Tips

cuDF Strings Module

String Encoding & Tokenization

Synthetic Data Generation

Practice Quiz — Data Preparation with cuDF

Memory Hooks

Flashcards & Advisor

Study Advisor

Core DataFrame Operations

Ready to Pass the NCP-ADS?