Topic Overview: Data Manipulation & Preparation
The heaviest domain on the NCA-ADS exam at 23%, this topic covers the full data preparation lifecycle — from raw data ingestion to model-ready features. Mastery here is essential for passing the exam.
NCA-ADS Exam Weight Distribution
| Topic | Weight |
|---|---|
| Data Manipulation and Preparation | 23% |
| Machine Learning With RAPIDS | 16% |
| Data Science Pipelines & Workflow Automation | 13% |
| Descriptive Analysis and Visualization | 13% |
| Foundations of Accelerated Data Science | 12% |
| Introductory MLOps Practices | 10% |
| Advanced Data Structures | 7% |
| Software and Environment Management | 6% |
About This Topic
At 23%, Data Manipulation and Preparation is the single largest domain on the NCA-ADS exam. It covers the complete data preparation lifecycle — reading data into GPU memory, cleaning and transforming it, joining multiple sources, engineering features, handling imbalanced targets, and writing optimized output formats. Every other domain in the exam assumes you can perform these operations efficiently with RAPIDS cuDF.
Sub-Topic Coverage on This Page
Core Concepts
Nine detailed concept blocks covering every sub-topic of Data Manipulation and Preparation for the NCA-ADS exam.
1. cuDF Basics — GPU DataFrames
Creating DataFrames:
cudf.DataFrame({'col': [1,2,3]})— from Python dictcudf.read_csv('file.csv')— GPU-accelerated CSV ingestion via cuIOcudf.read_parquet('file.parquet')— columnar format, fastest read
Selecting and Filtering:
df['col']— select single column (returns Series)df[['col1','col2']]— select multiple columnsdf.iloc[0:5]— integer-location based row selectiondf[df['col'] > 5]— boolean mask filtering
Basic Operations:
.head()/.tail()— preview first / last rows.shape— tuple of (rows, columns).dtypes— data type of each column.describe()— summary statistics (count, mean, std, min, max)
Null Handling:
.isnull()— boolean mask of null positions.fillna(value)— replace nulls with a value.dropna()— remove rows containing nulls- Key exam fact: cuDF nulls use Apache Arrow bitmask — represented as
NA, notNaN. This is a fundamental difference from pandas float-NaN nulls.
2. Data Integration — Joins and Merges
Core syntax: cudf.merge(left, right, on='key', how='inner')
- Inner join: only rows with matching keys in both DataFrames
- Left join: all rows from left DataFrame; NaN where right has no match
- Right join: all rows from right DataFrame; NaN where left has no match
- Outer join: all rows from both DataFrames; NaN where no match on either side
Join Decision Table:
| Scenario | Use |
|---|---|
| Keep only records with data in both tables | how='inner' |
| Keep all customers, add order data where available | how='left' |
| Keep all orders, add customer data where available | how='right' |
| Keep every record from both tables | how='outer' |
- Concatenation:
cudf.concat([df1, df2])— stacks DataFrames vertically (same schema) - Deduplication after merge:
.drop_duplicates(subset=['col'])— removes duplicate rows that can appear when join keys are not unique
3. Data Cleaning and Quality
Identifying missing data:
df.isnull().sum()— count of nulls per columndf.notnull()— boolean mask of non-null positions
Imputation strategies:
df['col'].fillna(df['col'].mean())— mean imputation for numerical columnsdf['col'].fillna(df['col'].mode()[0])— mode imputation for categorical columnsdf['col'].fillna(method='ffill')— forward fill for time-series data
Outlier detection:
- Z-score method: flag values beyond ±3 standard deviations from the mean
- IQR method: flag values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR
Other cleaning operations:
- Data type conversion:
.astype('float32'),.astype('category') - String cleaning:
.str.strip(),.str.lower(),.str.replace() - Removing duplicates:
.drop_duplicates() - Data governance: identify and handle PII columns before modeling — mask, anonymize, or drop sensitive data (names, SSNs, email addresses) to comply with regulations
4. Feature Engineering — Numerical Variables
Scaling methods:
- Min-Max normalization: scales to range 0–1. Formula: (x − min) / (max − min)
- Standard scaling (Z-score): transforms to mean=0, std=1. Formula: (x − mean) / std
When to scale:
- Required for: distance-based algorithms (KNN, SVM), linear models, neural networks — these algorithms use numeric distances between points
- Not required for: tree-based models (RandomForest, XGBoost) — trees split on thresholds, not distances
Other numerical transformations:
- Binning:
cudf.cut()(equal-width bins) orcudf.qcut()(equal-frequency bins) — converts continuous to ordinal categorical - Polynomial features:
x²,x×y— capture non-linear relationships between features - Log transform:
cupy.log(df['col'])— normalizes right-skewed distributions (common for income, prices) - Interaction features: multiply or add two columns to capture joint effects between predictors
5. Feature Engineering — Categorical Variables
One-hot encoding: creates one binary column per unique category value
- Use for: nominal (unordered) categories; linear models; neural networks
- Code:
cudf.get_dummies(df, columns=['category_col']) - Caution: high-cardinality columns (hundreds+ unique values) create too many columns — memory and sparsity issues
Label encoding: assigns an integer to each unique category
- Use for: ordinal (ordered) categories (Small < Medium < Large); tree-based models
- Code:
df['col'] = df['col'].astype('category').cat.codes - Warning: implies ordering — incorrect for nominal categories with linear models (implies false ordinality)
cuDF category dtype:
- Stores unique string values once in a dictionary + integer codes per row
- Benefits: lower memory usage, faster groupby and sort (integer vs string comparison)
- Create:
df['col'].astype('category')
High-cardinality columns (thousands of unique values): use target encoding (encode by target mean) or hashing instead of one-hot to avoid dimensionality explosion.
6. Class Imbalance — Handling Skewed Target Variables
The problem: when the target variable is highly imbalanced (e.g., 99% non-fraud, 1% fraud), a naive model achieves high accuracy by always predicting the majority class — while completely failing its real purpose.
Solutions:
- Oversampling minority (SMOTE): Synthetic Minority Oversampling Technique — generates synthetic samples by interpolating between existing minority class examples. Better than random duplication.
- Undersampling majority: randomly remove samples from the majority class to balance the dataset
- Class weights: pass
class_weight='balanced'to cuML models — the model is penalized more for misclassifying the minority class. Available in cuML LogisticRegression and SVC.
Evaluation metrics for imbalanced data:
- Accuracy is misleading — a model predicting majority class 100% of the time achieves high accuracy
- Use: Precision, Recall, F1-score, AUC-ROC to measure performance on the minority class
7. Dimensionality Reduction and Data Sampling
Why reduce dimensions:
- Curse of dimensionality: too many features degrades model performance and increases training time
- Remove redundant or correlated features
- Speed up downstream training
PCA (Principal Component Analysis) — linear:
- Finds directions of maximum variance in the data
- Code:
cuml.PCA(n_components=50).fit_transform(X) - Choose n_components by explained variance ratio — aim for ~95% cumulative explained variance
- Best for: feature compression before modeling, removing correlated features
UMAP (Uniform Manifold Approximation and Projection) — non-linear:
- Preserves local neighborhood structure and cluster patterns better than PCA
- Code:
cuml.UMAP(n_components=2).fit_transform(X) - cuML UMAP is dramatically faster than CPU umap-learn on large datasets
- Best for: 2D/3D visualization of clusters and high-dimensional data exploration
Sampling:
- Random sampling:
df.sample(frac=0.1)— take 10% random sample for quick EDA - Stratified sampling: preserve class distribution in sample — critical for imbalanced datasets to avoid sampling bias
8. Efficient Storage — Parquet and Modern Formats
Parquet — columnar storage format (vs row-based CSV):
- Column pruning: read only needed columns —
cudf.read_parquet('file.parquet', columns=['col1','col2']) - Predicate pushdown: filter rows at read time without loading the full file
- Compression: Snappy or ZSTD compression reduces file size dramatically
- GPU-native:
cudf.read_parquet()reads directly into GPU VRAM
| Format | Layout | Compression | Column Pruning | RAPIDS Support |
|---|---|---|---|---|
| Parquet | Columnar | Yes (Snappy/ZSTD) | Yes | Primary format |
| CSV | Row-based | No (by default) | No | Supported but slow |
| ORC | Columnar | Yes | Yes | Supported |
Best practice: convert CSV sources to Parquet for repeated processing — dramatically faster reads and much lower I/O cost.
9. GPU-Accelerated ETL Workflows
ETL = Extract, Transform, Load
- Extract:
cudf.read_parquet()orcudf.read_csv()— loads data into GPU VRAM - Transform: all cuDF operations (filter, merge, fillna, groupby, etc.) execute on GPU
- Load:
df.to_parquet()writes results back to storage
Keeping the pipeline on GPU: avoid .to_pandas() mid-pipeline — it triggers a slow PCIe transfer from GPU VRAM to CPU RAM, then back again for the next GPU operation.
Scaling beyond a single GPU:
- Dask: distributes cuDF operations across multiple GPUs or nodes —
dask_cudf.read_parquet()returns a lazy Dask DataFrame that executes across a cluster - GPUDirect Storage: reads NVMe data directly into GPU VRAM, bypassing CPU RAM entirely — the fastest possible data ingestion path for GPU workloads
Memory Hooks
Six memorable mnemonics and mental models to lock in the key exam concepts for Data Manipulation and Preparation.
Practice Quiz
10 scenario-based questions at Associate conceptual level. Select an answer then check it to see the explanation.
Quiz Complete!
Flashcards
12 flashcards covering essential cuDF methods, feature engineering patterns, and storage concepts. Click any card to flip it.
💡 Click a card to reveal the answer
Study Advisor
Personalized study recommendations based on your background. Select your role to get a targeted study plan for this domain.
Data Analyst Study Plan
Study Resources
Official NVIDIA and RAPIDS documentation plus practice materials for the NCA-ADS exam.
RAPIDS cuDF API Reference
Complete API documentation for cuDF DataFrame operations, including all methods covered in this domain.
Open Documentation →RAPIDS Getting Started
Official RAPIDS.ai getting started guide — installation, first steps with cuDF, and key workflow patterns.
Get Started →NCA-ADS Exam Guide
Official NVIDIA NCA-ADS certification page with exam objectives, format, and registration information.
NVIDIA Certifications →cuML API Reference
GPU-accelerated scikit-learn compatible ML library documentation — covers PCA, UMAP, scalers, and classifiers.
cuML Docs →RAPIDS Notebooks Gallery
End-to-end example notebooks showing cuDF data manipulation, feature engineering, and model training workflows.
View Notebooks →Apache Arrow Documentation
Understanding Arrow columnar format and bitmask null representation — the foundation of cuDF's memory model.
Arrow Docs →Parquet Format Specification
Deep dive into the Parquet columnar format, compression codecs, and predicate pushdown internals.
Parquet Docs →imbalanced-learn (SMOTE)
Documentation for SMOTE and other oversampling/undersampling techniques for handling class imbalance.
imbalanced-learn →