Topic 2: Data Preparation with cuDF
Data preparation is typically 60–80% of any data science project. cuDF brings this step to the GPU, enabling massive datasets to be cleaned, transformed, and joined at speeds impossible with CPU pandas. This topic covers everything from basic DataFrame ops to GPU-native file I/O.
cuDF Design Philosophy
- pandas API compatibility: intentional — existing pandas code often runs with just
import cudf as pd - Apache Arrow columnar layout: data stored column-by-column in GPU memory; ideal for analytical (OLAP) queries
- Null representation: separate bitmask for null values (Arrow standard); not NaN float tricks
- Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch tensors via
__cuda_array_interface__— no data copy needed - Numba UDFs: write Python functions decorated with
@cuda.jitor useapply_rowsfor custom GPU operations
Key Performance Principles
- Keep all operations in cuDF — avoid
to_pandas()mid-pipeline (triggers PCIe transfer) - Prefer built-in cuDF methods over Python loops — loops are not GPU-parallelized
- Use RMM PoolMemoryResource for workloads with many small allocations
- Read files directly with cuDF's cuIO (GPU-accelerated) rather than pandas then converting
- Use categorical dtypes for low-cardinality string columns — reduces memory and speeds groupby
cuDF vs pandas — API Comparison
| Operation | pandas | cuDF equivalent | Notes |
|---|---|---|---|
| Read CSV | pd.read_csv() | cudf.read_csv() | GPU-accelerated cuIO; much faster on large files |
| GroupBy | df.groupby().agg() | Same API | Hash-based parallel aggregation on GPU |
| Merge/Join | pd.merge() | cudf.merge() | GPU hash join; faster on large tables |
| Apply UDF | df.apply(func) | df.apply_rows(kernel) | cuDF UDF uses Numba CUDA JIT |
| Sort | df.sort_values() | Same API | GPU radix/merge sort |
| String ops | df.str.contains() | Same API | GPU-accelerated via cuDF strings module |
| Convert to CPU | N/A | df.to_pandas() | Triggers PCIe D2H transfer — use sparingly |
Core cuDF DataFrame Operations
These are the workhorses of GPU ETL — groupby aggregations, joins, and sorting are where cuDF delivers the most dramatic speedups over pandas.
GPU GroupBy Mechanics
- Hash-based implementation: cuDF builds a GPU hash table mapping group keys to row indices — massively parallel across all CUDA cores
- Supported aggregations:
sum,mean,count,min,max,std,var,first,last,nunique - Multi-column groupby:
df.groupby(['col_a','col_b']).agg({'col_c':'sum','col_d':'mean'}) - Named aggregations:
df.groupby('key').agg(total=('value','sum'), avg=('value','mean')) - Speedup: 10–100× over pandas groupby on datasets >10M rows
Joins & Merges
- Inner join: returns rows with matching keys in both DataFrames — most common
- Left join: all rows from left; matching rows from right; NaN for non-matches
- Outer join: all rows from both; NaN where no match
- GPU implementation: hash join — build hash table on smaller table, probe with larger table; embarrassingly parallel
- Multi-key joins:
cudf.merge(df1, df2, on=['key1','key2'], how='inner') - Broadcast join: when one table is very small — replicate to all GPU threads
Sorting
df.sort_values('col', ascending=False)— GPU radix sort on numeric data- Multi-column sort:
df.sort_values(['col_a','col_b'], ascending=[True,False]) - Stable sort by default; GPU radix sort is O(n) for fixed-width types
df.nlargest(k, 'col')/df.nsmallest(k, 'col')— top-K without full sort
Filtering & Selection
- Boolean mask:
df[df['age'] > 30]— creates boolean array on GPU, applies mask - Multiple conditions:
df[(df['a']>0) & (df['b']<100)]— use¬and df.query("age > 30 and salary < 100000")— string-based query (cuDF subset supported)df.isin(['val1','val2'])— membership test on GPU- Column selection:
df[['col_a','col_b']]— zero-copy view of Arrow buffer
Window & Rolling Functions
df['col'].rolling(window=7).mean()— sliding window aggregation on GPUdf.groupby('key')['val'].cumsum()— cumulative operations within groups- Ranking:
df['col'].rank(method='average') - Shift:
df['col'].shift(1)— lag/lead for time series feature engineering
Data Cleaning & Transformations
Data rarely arrives clean. cuDF provides GPU-accelerated tools for handling missing values, type conversions, normalization, encoding, and custom transforms — all without leaving GPU memory.
Null Detection & Filling
- cuDF nulls: stored as a separate Arrow bitmask — distinct from NaN (which is a float value)
df.isnull()/df.notnull()— GPU-parallel null detectiondf['col'].fillna(value)— fill with scalar;fillna(method='ffill')— forward filldf.dropna(subset=['col_a','col_b'])— drop rows with nulls in specified columns- Imputation:
df['col'].fillna(df['col'].mean())— mean imputation on GPU
Type Conversions & Casting
df['col'].astype('float32')— cast column to new dtype; in-place on GPU- Common casts:
object→categoryfor low-cardinality strings;int64→int32to halve memory - Categorical dtype: stores unique string values once; each row stores an integer code — ideal for groupby
pd.to_datetime(df['ts'])→cudf.to_datetime(df['ts'])— GPU datetime parsing- Datetime extraction:
df['ts'].dt.year,.dt.month,.dt.dayofweek,.dt.hour
Normalization & Scaling
- Min-max:
(df['col'] - df['col'].min()) / (df['col'].max() - df['col'].min()) - Z-score:
(df['col'] - df['col'].mean()) / df['col'].std() - Or use
cuML.preprocessing.MinMaxScaler/StandardScaler— cuML wrappers - Log transform:
cupy.log1p(df['col'].values)— use cuPy for element-wise math - Clip outliers:
df['col'].clip(lower=0, upper=99)
Categorical Encoding
- Label encoding:
df['cat'].astype('category').cat.codes— assign integer code per category - One-hot encoding:
cudf.get_dummies(df, columns=['cat_col'])— creates binary columns per category - Target encoding: merge target mean per category using groupby + merge
- Ordinal encoding: map categories to integers using
df['col'].map({'Low':0,'Med':1,'High':2})
Custom Transforms with Numba
apply_rows(): applies a Numba-JIT function row-wise across the DataFrame on GPU- Use
@numba.cuda.jitfor custom GPU kernels when cuDF built-ins aren't sufficient - CuPy integration:
df['col'].valuesreturns a CuPy array — apply any CuPy math operation - Avoid Python-level loops over DataFrame rows — not parallelized
Deduplication
df.drop_duplicates(subset=['id','timestamp'])— GPU-parallel duplicate detection and removaldf.duplicated(subset=['col'], keep='first')— returns boolean mask of duplicate rows- For approximate deduplication on large datasets: hash-based fingerprinting + groupby
Outlier Detection & Removal
- Z-score method: compute
(x - mean)/std; flag rows where |z| > 3 - IQR method: Q1 = 25th percentile, Q3 = 75th; outlier if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
df['col'].quantile([0.25, 0.75])— GPU quantile computation- Statistical outlier methods scale effortlessly to 100M+ rows on GPU
File I/O & String Operations
cuIO brings GPU acceleration to file reading and writing. The cuDF strings module handles text data — regex, splitting, tokenizing — entirely on the GPU.
Supported File Formats
- CSV:
cudf.read_csv('file.csv')— multi-threaded GPU parsing; specifydtypedict to avoid inference overhead - Parquet:
cudf.read_parquet('file.parquet')— columnar format; ideal for cuDF (column-oriented); supports predicate pushdown - ORC:
cudf.read_orc()— compressed columnar; common in Hive/Spark ecosystems - JSON:
cudf.read_json()— line-delimited JSON (JSONL); GPU-accelerated parsing - Avro:
cudf.read_avro()— row-oriented; less common but supported
I/O Performance Tips
- Parquet > CSV for repeated reads — compressed, columnar, supports column pruning
- Predicate pushdown:
cudf.read_parquet('file.parquet', filters=[('col','>',100)])— skip row groups at read time - Column pruning:
cudf.read_csv('file.csv', usecols=['col_a','col_b'])— read only needed columns - GDS (GPUDirect Storage): reads data directly from NVMe SSD to GPU memory — bypasses CPU RAM entirely; requires NVIDIA Magnum IO
- Write:
df.to_parquet('out.parquet'),df.to_csv('out.csv')
cuDF Strings Module
- Access via
df['text_col'].straccessor — same as pandas .str.contains('pattern', regex=True)— GPU regex matching.str.replace('old','new')— GPU string replacement.str.split(delimiter)— returns list-type column.str.lower()/.str.upper()/.str.strip().str.len()— character count per string
String Encoding & Tokenization
- cuDF strings stored as variable-length UTF-8 with separate offsets array
.str.token_count(delimiter=' ')— word count on GPU.str.ngrams(n=2)— character n-grams generation- For NLP preprocessing at scale: use cuDF strings → pass to cuML or RAPIDS NLP tools
- Convert to category:
df['str_col'].astype('category')— efficient for repeated values
Synthetic Data Generation
- Use
cupy.randomfor GPU-generated random numerical data cupy.random.randn(n)— standard normal distribution on GPUcupy.random.randint(low, high, size=n)— uniform integers- Wrap in cuDF:
cudf.Series(cupy.random.randn(1_000_000)) - Useful for testing pipeline performance before real data arrives
Practice Quiz — Data Preparation with cuDF
10 questions covering cuDF operations, data cleaning, file I/O, and string processing.
Memory Hooks
Lock in cuDF's key behaviors and tradeoffs before exam day.
isnull() to check, and fillna() to fill, in both.to_pandas() call fires a PCIe D2H transfer. One call is fine; ten calls in a loop kills performance. Design pipelines to stay in cuDF from file read to model input — convert to pandas only for final output.category stores 10 strings once + 100M integer codes. Groupby on integer codes is far faster than on strings..str accessor works identically in cuDF and pandas. On GPU, regex and string ops on 100M strings are orders of magnitude faster than pandas. For best performance, cast repeated string values to category before groupby.Flashcards & Advisor
Click a card to reveal the answer
cudf.read_parquet('file.parquet', filters=[('age', '>', 30)]). Dramatically reduces I/O on large files.apply(func) calls a Python function row-by-row (serial). cuDF's apply_rows(kernel) requires a Numba CUDA JIT function that runs in parallel across all rows on the GPU — orders of magnitude faster for large DataFrames.q = df['col'].quantile([0.25, 0.75]) → compute IQR → filter with boolean mask. Fully GPU-parallelized on 100M+ rows.Study Advisor
Core DataFrame Operations
- GroupBy: hash-based parallel aggregation; multi-column groupby with agg dict; 10–100× over pandas
- Merge: hash join on GPU; build smaller table, probe with larger; inner join fastest
- Sort: GPU radix sort on numeric; multi-column sort with ascending list; nlargest/nsmallest for top-K
- Filter: boolean mask with & (not and) for multiple conditions; query() for string-based filtering
- Rolling: df['col'].rolling(7).mean() — sliding window on GPU; shift() for lag features
- Zero-copy interop: cuDF ↔ cuPy ↔ PyTorch via __cuda_array_interface__ — no data copy