NCA-ADS: Data Manipulation & Preparation

Topic Overview: Data Manipulation & Preparation

The heaviest domain on the NCA-ADS exam at 23%, this topic covers the full data preparation lifecycle — from raw data ingestion to model-ready features. Mastery here is essential for passing the exam.

NCA-ADS Exam Weight Distribution

Topic	Weight
Data Manipulation and Preparation	23%
Machine Learning With RAPIDS	16%
Data Science Pipelines & Workflow Automation	13%
Descriptive Analysis and Visualization	13%
Foundations of Accelerated Data Science	12%
Introductory MLOps Practices	10%
Advanced Data Structures	7%
Software and Environment Management	6%

About This Topic

At 23%, Data Manipulation and Preparation is the single largest domain on the NCA-ADS exam. It covers the complete data preparation lifecycle — reading data into GPU memory, cleaning and transforming it, joining multiple sources, engineering features, handling imbalanced targets, and writing optimized output formats. Every other domain in the exam assumes you can perform these operations efficiently with RAPIDS cuDF.

Sub-Topic Coverage on This Page

Sub-Topic 1

cuDF Basics & GPU DataFrames

Sub-Topic 2

Data Integration — Joins & Merges

Sub-Topic 3

Data Cleaning & Quality

Sub-Topic 4

Feature Engineering — Numerical

Sub-Topic 5

Feature Engineering — Categorical

Sub-Topic 6

Class Imbalance Handling

Sub-Topic 7

Dimensionality Reduction & Sampling

Sub-Topic 8

Parquet, Storage & GPU ETL

Core Concepts

Nine detailed concept blocks covering every sub-topic of Data Manipulation and Preparation for the NCA-ADS exam.

1. cuDF Basics — GPU DataFrames

Creating DataFrames:

cudf.DataFrame({'col': [1,2,3]}) — from Python dict
cudf.read_csv('file.csv') — GPU-accelerated CSV ingestion via cuIO
cudf.read_parquet('file.parquet') — columnar format, fastest read

Selecting and Filtering:

df['col'] — select single column (returns Series)
df[['col1','col2']] — select multiple columns
df.iloc[0:5] — integer-location based row selection
df[df['col'] > 5] — boolean mask filtering

Basic Operations:

.head() / .tail() — preview first / last rows
.shape — tuple of (rows, columns)
.dtypes — data type of each column
.describe() — summary statistics (count, mean, std, min, max)

Null Handling:

.isnull() — boolean mask of null positions
.fillna(value) — replace nulls with a value
.dropna() — remove rows containing nulls
Key exam fact: cuDF nulls use Apache Arrow bitmask — represented as NA, not NaN. This is a fundamental difference from pandas float-NaN nulls.

2. Data Integration — Joins and Merges

Core syntax: cudf.merge(left, right, on='key', how='inner')

Inner join: only rows with matching keys in both DataFrames
Left join: all rows from left DataFrame; NaN where right has no match
Right join: all rows from right DataFrame; NaN where left has no match
Outer join: all rows from both DataFrames; NaN where no match on either side

Join Decision Table:

Scenario	Use
Keep only records with data in both tables	`how='inner'`
Keep all customers, add order data where available	`how='left'`
Keep all orders, add customer data where available	`how='right'`
Keep every record from both tables	`how='outer'`

Concatenation: cudf.concat([df1, df2]) — stacks DataFrames vertically (same schema)
Deduplication after merge: .drop_duplicates(subset=['col']) — removes duplicate rows that can appear when join keys are not unique

3. Data Cleaning and Quality

Identifying missing data:

df.isnull().sum() — count of nulls per column
df.notnull() — boolean mask of non-null positions

Imputation strategies:

df['col'].fillna(df['col'].mean()) — mean imputation for numerical columns
df['col'].fillna(df['col'].mode()[0]) — mode imputation for categorical columns
df['col'].fillna(method='ffill') — forward fill for time-series data

Outlier detection:

Z-score method: flag values beyond ±3 standard deviations from the mean
IQR method: flag values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR

Other cleaning operations:

Data type conversion: .astype('float32'), .astype('category')
String cleaning: .str.strip(), .str.lower(), .str.replace()
Removing duplicates: .drop_duplicates()
Data governance: identify and handle PII columns before modeling — mask, anonymize, or drop sensitive data (names, SSNs, email addresses) to comply with regulations

4. Feature Engineering — Numerical Variables

Scaling methods:

Min-Max normalization: scales to range 0–1. Formula: (x − min) / (max − min)
Standard scaling (Z-score): transforms to mean=0, std=1. Formula: (x − mean) / std

When to scale:

Required for: distance-based algorithms (KNN, SVM), linear models, neural networks — these algorithms use numeric distances between points
Not required for: tree-based models (RandomForest, XGBoost) — trees split on thresholds, not distances

Other numerical transformations:

Binning: cudf.cut() (equal-width bins) or cudf.qcut() (equal-frequency bins) — converts continuous to ordinal categorical
Polynomial features: x², x×y — capture non-linear relationships between features
Log transform: cupy.log(df['col']) — normalizes right-skewed distributions (common for income, prices)
Interaction features: multiply or add two columns to capture joint effects between predictors

5. Feature Engineering — Categorical Variables

One-hot encoding: creates one binary column per unique category value

Use for: nominal (unordered) categories; linear models; neural networks
Code: cudf.get_dummies(df, columns=['category_col'])
Caution: high-cardinality columns (hundreds+ unique values) create too many columns — memory and sparsity issues

Label encoding: assigns an integer to each unique category

Use for: ordinal (ordered) categories (Small < Medium < Large); tree-based models
Code: df['col'] = df['col'].astype('category').cat.codes
Warning: implies ordering — incorrect for nominal categories with linear models (implies false ordinality)

cuDF category dtype:

Stores unique string values once in a dictionary + integer codes per row
Benefits: lower memory usage, faster groupby and sort (integer vs string comparison)
Create: df['col'].astype('category')

High-cardinality columns (thousands of unique values): use target encoding (encode by target mean) or hashing instead of one-hot to avoid dimensionality explosion.

6. Class Imbalance — Handling Skewed Target Variables

The problem: when the target variable is highly imbalanced (e.g., 99% non-fraud, 1% fraud), a naive model achieves high accuracy by always predicting the majority class — while completely failing its real purpose.

Solutions:

Oversampling minority (SMOTE): Synthetic Minority Oversampling Technique — generates synthetic samples by interpolating between existing minority class examples. Better than random duplication.
Undersampling majority: randomly remove samples from the majority class to balance the dataset
Class weights: pass class_weight='balanced' to cuML models — the model is penalized more for misclassifying the minority class. Available in cuML LogisticRegression and SVC.

Evaluation metrics for imbalanced data:

Accuracy is misleading — a model predicting majority class 100% of the time achieves high accuracy
Use: Precision, Recall, F1-score, AUC-ROC to measure performance on the minority class

7. Dimensionality Reduction and Data Sampling

Why reduce dimensions:

Curse of dimensionality: too many features degrades model performance and increases training time
Remove redundant or correlated features
Speed up downstream training

PCA (Principal Component Analysis) — linear:

Finds directions of maximum variance in the data
Code: cuml.PCA(n_components=50).fit_transform(X)
Choose n_components by explained variance ratio — aim for ~95% cumulative explained variance
Best for: feature compression before modeling, removing correlated features

UMAP (Uniform Manifold Approximation and Projection) — non-linear:

Preserves local neighborhood structure and cluster patterns better than PCA
Code: cuml.UMAP(n_components=2).fit_transform(X)
cuML UMAP is dramatically faster than CPU umap-learn on large datasets
Best for: 2D/3D visualization of clusters and high-dimensional data exploration

Sampling:

Random sampling: df.sample(frac=0.1) — take 10% random sample for quick EDA
Stratified sampling: preserve class distribution in sample — critical for imbalanced datasets to avoid sampling bias

8. Efficient Storage — Parquet and Modern Formats

Parquet — columnar storage format (vs row-based CSV):

Column pruning: read only needed columns — cudf.read_parquet('file.parquet', columns=['col1','col2'])
Predicate pushdown: filter rows at read time without loading the full file
Compression: Snappy or ZSTD compression reduces file size dramatically
GPU-native: cudf.read_parquet() reads directly into GPU VRAM

Format	Layout	Compression	Column Pruning	RAPIDS Support
Parquet	Columnar	Yes (Snappy/ZSTD)	Yes	Primary format
CSV	Row-based	No (by default)	No	Supported but slow
ORC	Columnar	Yes	Yes	Supported

Best practice: convert CSV sources to Parquet for repeated processing — dramatically faster reads and much lower I/O cost.

9. GPU-Accelerated ETL Workflows

ETL = Extract, Transform, Load

Extract: cudf.read_parquet() or cudf.read_csv() — loads data into GPU VRAM
Transform: all cuDF operations (filter, merge, fillna, groupby, etc.) execute on GPU
Load: df.to_parquet() writes results back to storage

Keeping the pipeline on GPU: avoid .to_pandas() mid-pipeline — it triggers a slow PCIe transfer from GPU VRAM to CPU RAM, then back again for the next GPU operation.

Scaling beyond a single GPU:

Dask: distributes cuDF operations across multiple GPUs or nodes — dask_cudf.read_parquet() returns a lazy Dask DataFrame that executes across a cluster
GPUDirect Storage: reads NVMe data directly into GPU VRAM, bypassing CPU RAM entirely — the fastest possible data ingestion path for GPU workloads

Memory Hooks

Six memorable mnemonics and mental models to lock in the key exam concepts for Data Manipulation and Preparation.

🔁

Join Type Decision

"IROL: Inner=both match, Right=all right, Outer=all both, Left=all left"

When you see a join scenario on the exam, run through IROL: Inner keeps only matched rows. Right keeps all right-side rows. Outer keeps everything. Left keeps all left-side rows. The most common is LEFT — "keep all my records, add the related data where available."

⚖

When to Scale Features

"Distance needs scale, Trees don't care"

Any algorithm that computes distances between points (KNN, SVM, logistic regression, neural networks) requires feature scaling. Tree-based models (RandomForest, XGBoost, decision trees) split on thresholds — the absolute scale of a feature doesn't affect where the split goes.

⚖

Imbalance Fixes — OUW

"O=Oversample minority (SMOTE), U=Undersample majority, W=Weight classes"

Three tools for class imbalance: Oversample the minority with SMOTE (generate synthetic samples). Undersample the majority (remove samples). Weight classes with class_weight='balanced' in cuML. Also remember: never use accuracy as your metric — use F1 and AUC-ROC.

📦

Parquet vs CSV

"Parquet is CAPS: Columnar, All-compressed, Pushdown-filtered, Selective columns"

CSV is row-based, uncompressed, and must read all columns every time. Parquet is Columnar storage, Always compressed (Snappy/ZSTD), supports Predicate pushdown (filter at read time), and Selective column reading (column pruning). Convert CSVs to Parquet for any repeated processing workload.

🏷

Categorical Encoding Rule

"Nominal→One-hot, Ordinal→Label, High-cardinality→Target/Hash"

Nominal categories (no order: city, color, brand) get one-hot encoding. Ordinal categories (have order: Small < Medium < Large, Low/Med/High rating) get label encoding. High-cardinality nominals (500+ unique values) use target encoding or feature hashing to avoid dimensionality explosion.

📈

PCA vs UMAP

"PCA = Linear compression (for modeling), UMAP = Non-linear visualization (for plotting)"

PCA is linear — great for compressing 200 features down to 50 for faster model training, and for removing correlated features. UMAP is non-linear — great for compressing to 2D/3D for scatter plot visualization where you want to see clusters. Choose by purpose: modeling vs visualization.

Practice Quiz

10 scenario-based questions at Associate conceptual level. Select an answer then check it to see the explanation.

Question 1 of 10

Score: 0 / 0

Quiz Complete!

Flashcards

12 flashcards covering essential cuDF methods, feature engineering patterns, and storage concepts. Click any card to flip it.

💡 Click a card to reveal the answer

Study Advisor

Personalized study recommendations based on your background. Select your role to get a targeted study plan for this domain.

Data Analyst Study Plan

Study Resources

Official NVIDIA and RAPIDS documentation plus practice materials for the NCA-ADS exam.

RAPIDS cuDF API Reference

Complete API documentation for cuDF DataFrame operations, including all methods covered in this domain.

Open Documentation →

RAPIDS Getting Started

Official RAPIDS.ai getting started guide — installation, first steps with cuDF, and key workflow patterns.

Get Started →

NCA-ADS Exam Guide

Official NVIDIA NCA-ADS certification page with exam objectives, format, and registration information.

NVIDIA Certifications →

cuML API Reference

GPU-accelerated scikit-learn compatible ML library documentation — covers PCA, UMAP, scalers, and classifiers.

cuML Docs →

RAPIDS Notebooks Gallery

End-to-end example notebooks showing cuDF data manipulation, feature engineering, and model training workflows.

View Notebooks →

Apache Arrow Documentation

Understanding Arrow columnar format and bitmask null representation — the foundation of cuDF's memory model.

Arrow Docs →

Parquet Format Specification

Deep dive into the Parquet columnar format, compression codecs, and predicate pushdown internals.

Parquet Docs →

imbalanced-learn (SMOTE)

Documentation for SMOTE and other oversampling/undersampling techniques for handling class imbalance.

imbalanced-learn →

NCA-ADS: Data Manipulation& Preparation