FlashGenius Logo FlashGenius
NVIDIA Certified · NCA-ADS · Associate Level

NCA-ADS: Data Manipulation
& Preparation

Topic 2 of 8  |  23% of Exam — Largest Domain  |  Accelerated Data Science Associate

23% Weight
60 min Exam
70% Passing
50–60 Questions
Associate Level

Topic Overview: Data Manipulation & Preparation

The heaviest domain on the NCA-ADS exam at 23%, this topic covers the full data preparation lifecycle — from raw data ingestion to model-ready features. Mastery here is essential for passing the exam.

NCA-ADS Exam Weight Distribution

TopicWeight
Data Manipulation and Preparation23%
Machine Learning With RAPIDS16%
Data Science Pipelines & Workflow Automation13%
Descriptive Analysis and Visualization13%
Foundations of Accelerated Data Science12%
Introductory MLOps Practices10%
Advanced Data Structures7%
Software and Environment Management6%

About This Topic

At 23%, Data Manipulation and Preparation is the single largest domain on the NCA-ADS exam. It covers the complete data preparation lifecycle — reading data into GPU memory, cleaning and transforming it, joining multiple sources, engineering features, handling imbalanced targets, and writing optimized output formats. Every other domain in the exam assumes you can perform these operations efficiently with RAPIDS cuDF.

Sub-Topic Coverage on This Page

Sub-Topic 1
cuDF Basics & GPU DataFrames
Sub-Topic 2
Data Integration — Joins & Merges
Sub-Topic 3
Data Cleaning & Quality
Sub-Topic 4
Feature Engineering — Numerical
Sub-Topic 5
Feature Engineering — Categorical
Sub-Topic 6
Class Imbalance Handling
Sub-Topic 7
Dimensionality Reduction & Sampling
Sub-Topic 8
Parquet, Storage & GPU ETL

Core Concepts

Nine detailed concept blocks covering every sub-topic of Data Manipulation and Preparation for the NCA-ADS exam.

1. cuDF Basics — GPU DataFrames

Creating DataFrames:

  • cudf.DataFrame({'col': [1,2,3]}) — from Python dict
  • cudf.read_csv('file.csv') — GPU-accelerated CSV ingestion via cuIO
  • cudf.read_parquet('file.parquet') — columnar format, fastest read

Selecting and Filtering:

  • df['col'] — select single column (returns Series)
  • df[['col1','col2']] — select multiple columns
  • df.iloc[0:5] — integer-location based row selection
  • df[df['col'] > 5] — boolean mask filtering

Basic Operations:

  • .head() / .tail() — preview first / last rows
  • .shape — tuple of (rows, columns)
  • .dtypes — data type of each column
  • .describe() — summary statistics (count, mean, std, min, max)

Null Handling:

  • .isnull() — boolean mask of null positions
  • .fillna(value) — replace nulls with a value
  • .dropna() — remove rows containing nulls
  • Key exam fact: cuDF nulls use Apache Arrow bitmask — represented as NA, not NaN. This is a fundamental difference from pandas float-NaN nulls.

2. Data Integration — Joins and Merges

Core syntax: cudf.merge(left, right, on='key', how='inner')

  • Inner join: only rows with matching keys in both DataFrames
  • Left join: all rows from left DataFrame; NaN where right has no match
  • Right join: all rows from right DataFrame; NaN where left has no match
  • Outer join: all rows from both DataFrames; NaN where no match on either side

Join Decision Table:

ScenarioUse
Keep only records with data in both tableshow='inner'
Keep all customers, add order data where availablehow='left'
Keep all orders, add customer data where availablehow='right'
Keep every record from both tableshow='outer'
  • Concatenation: cudf.concat([df1, df2]) — stacks DataFrames vertically (same schema)
  • Deduplication after merge: .drop_duplicates(subset=['col']) — removes duplicate rows that can appear when join keys are not unique

3. Data Cleaning and Quality

Identifying missing data:

  • df.isnull().sum() — count of nulls per column
  • df.notnull() — boolean mask of non-null positions

Imputation strategies:

  • df['col'].fillna(df['col'].mean()) — mean imputation for numerical columns
  • df['col'].fillna(df['col'].mode()[0]) — mode imputation for categorical columns
  • df['col'].fillna(method='ffill') — forward fill for time-series data

Outlier detection:

  • Z-score method: flag values beyond ±3 standard deviations from the mean
  • IQR method: flag values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR

Other cleaning operations:

  • Data type conversion: .astype('float32'), .astype('category')
  • String cleaning: .str.strip(), .str.lower(), .str.replace()
  • Removing duplicates: .drop_duplicates()
  • Data governance: identify and handle PII columns before modeling — mask, anonymize, or drop sensitive data (names, SSNs, email addresses) to comply with regulations

4. Feature Engineering — Numerical Variables

Scaling methods:

  • Min-Max normalization: scales to range 0–1. Formula: (x − min) / (max − min)
  • Standard scaling (Z-score): transforms to mean=0, std=1. Formula: (x − mean) / std

When to scale:

  • Required for: distance-based algorithms (KNN, SVM), linear models, neural networks — these algorithms use numeric distances between points
  • Not required for: tree-based models (RandomForest, XGBoost) — trees split on thresholds, not distances

Other numerical transformations:

  • Binning: cudf.cut() (equal-width bins) or cudf.qcut() (equal-frequency bins) — converts continuous to ordinal categorical
  • Polynomial features: , x×y — capture non-linear relationships between features
  • Log transform: cupy.log(df['col']) — normalizes right-skewed distributions (common for income, prices)
  • Interaction features: multiply or add two columns to capture joint effects between predictors

5. Feature Engineering — Categorical Variables

One-hot encoding: creates one binary column per unique category value

  • Use for: nominal (unordered) categories; linear models; neural networks
  • Code: cudf.get_dummies(df, columns=['category_col'])
  • Caution: high-cardinality columns (hundreds+ unique values) create too many columns — memory and sparsity issues

Label encoding: assigns an integer to each unique category

  • Use for: ordinal (ordered) categories (Small < Medium < Large); tree-based models
  • Code: df['col'] = df['col'].astype('category').cat.codes
  • Warning: implies ordering — incorrect for nominal categories with linear models (implies false ordinality)

cuDF category dtype:

  • Stores unique string values once in a dictionary + integer codes per row
  • Benefits: lower memory usage, faster groupby and sort (integer vs string comparison)
  • Create: df['col'].astype('category')

High-cardinality columns (thousands of unique values): use target encoding (encode by target mean) or hashing instead of one-hot to avoid dimensionality explosion.

6. Class Imbalance — Handling Skewed Target Variables

The problem: when the target variable is highly imbalanced (e.g., 99% non-fraud, 1% fraud), a naive model achieves high accuracy by always predicting the majority class — while completely failing its real purpose.

Solutions:

  • Oversampling minority (SMOTE): Synthetic Minority Oversampling Technique — generates synthetic samples by interpolating between existing minority class examples. Better than random duplication.
  • Undersampling majority: randomly remove samples from the majority class to balance the dataset
  • Class weights: pass class_weight='balanced' to cuML models — the model is penalized more for misclassifying the minority class. Available in cuML LogisticRegression and SVC.

Evaluation metrics for imbalanced data:

  • Accuracy is misleading — a model predicting majority class 100% of the time achieves high accuracy
  • Use: Precision, Recall, F1-score, AUC-ROC to measure performance on the minority class

7. Dimensionality Reduction and Data Sampling

Why reduce dimensions:

  • Curse of dimensionality: too many features degrades model performance and increases training time
  • Remove redundant or correlated features
  • Speed up downstream training

PCA (Principal Component Analysis) — linear:

  • Finds directions of maximum variance in the data
  • Code: cuml.PCA(n_components=50).fit_transform(X)
  • Choose n_components by explained variance ratio — aim for ~95% cumulative explained variance
  • Best for: feature compression before modeling, removing correlated features

UMAP (Uniform Manifold Approximation and Projection) — non-linear:

  • Preserves local neighborhood structure and cluster patterns better than PCA
  • Code: cuml.UMAP(n_components=2).fit_transform(X)
  • cuML UMAP is dramatically faster than CPU umap-learn on large datasets
  • Best for: 2D/3D visualization of clusters and high-dimensional data exploration

Sampling:

  • Random sampling: df.sample(frac=0.1) — take 10% random sample for quick EDA
  • Stratified sampling: preserve class distribution in sample — critical for imbalanced datasets to avoid sampling bias

8. Efficient Storage — Parquet and Modern Formats

Parquet — columnar storage format (vs row-based CSV):

  • Column pruning: read only needed columns — cudf.read_parquet('file.parquet', columns=['col1','col2'])
  • Predicate pushdown: filter rows at read time without loading the full file
  • Compression: Snappy or ZSTD compression reduces file size dramatically
  • GPU-native: cudf.read_parquet() reads directly into GPU VRAM
FormatLayoutCompressionColumn PruningRAPIDS Support
ParquetColumnarYes (Snappy/ZSTD)YesPrimary format
CSVRow-basedNo (by default)NoSupported but slow
ORCColumnarYesYesSupported

Best practice: convert CSV sources to Parquet for repeated processing — dramatically faster reads and much lower I/O cost.

9. GPU-Accelerated ETL Workflows

ETL = Extract, Transform, Load

  • Extract: cudf.read_parquet() or cudf.read_csv() — loads data into GPU VRAM
  • Transform: all cuDF operations (filter, merge, fillna, groupby, etc.) execute on GPU
  • Load: df.to_parquet() writes results back to storage

Keeping the pipeline on GPU: avoid .to_pandas() mid-pipeline — it triggers a slow PCIe transfer from GPU VRAM to CPU RAM, then back again for the next GPU operation.

Scaling beyond a single GPU:

  • Dask: distributes cuDF operations across multiple GPUs or nodes — dask_cudf.read_parquet() returns a lazy Dask DataFrame that executes across a cluster
  • GPUDirect Storage: reads NVMe data directly into GPU VRAM, bypassing CPU RAM entirely — the fastest possible data ingestion path for GPU workloads

Memory Hooks

Six memorable mnemonics and mental models to lock in the key exam concepts for Data Manipulation and Preparation.

🔁
Join Type Decision
"IROL: Inner=both match, Right=all right, Outer=all both, Left=all left"
When you see a join scenario on the exam, run through IROL: Inner keeps only matched rows. Right keeps all right-side rows. Outer keeps everything. Left keeps all left-side rows. The most common is LEFT — "keep all my records, add the related data where available."
When to Scale Features
"Distance needs scale, Trees don't care"
Any algorithm that computes distances between points (KNN, SVM, logistic regression, neural networks) requires feature scaling. Tree-based models (RandomForest, XGBoost, decision trees) split on thresholds — the absolute scale of a feature doesn't affect where the split goes.
Imbalance Fixes — OUW
"O=Oversample minority (SMOTE), U=Undersample majority, W=Weight classes"
Three tools for class imbalance: Oversample the minority with SMOTE (generate synthetic samples). Undersample the majority (remove samples). Weight classes with class_weight='balanced' in cuML. Also remember: never use accuracy as your metric — use F1 and AUC-ROC.
📦
Parquet vs CSV
"Parquet is CAPS: Columnar, All-compressed, Pushdown-filtered, Selective columns"
CSV is row-based, uncompressed, and must read all columns every time. Parquet is Columnar storage, Always compressed (Snappy/ZSTD), supports Predicate pushdown (filter at read time), and Selective column reading (column pruning). Convert CSVs to Parquet for any repeated processing workload.
🏷
Categorical Encoding Rule
"Nominal→One-hot, Ordinal→Label, High-cardinality→Target/Hash"
Nominal categories (no order: city, color, brand) get one-hot encoding. Ordinal categories (have order: Small < Medium < Large, Low/Med/High rating) get label encoding. High-cardinality nominals (500+ unique values) use target encoding or feature hashing to avoid dimensionality explosion.
📈
PCA vs UMAP
"PCA = Linear compression (for modeling), UMAP = Non-linear visualization (for plotting)"
PCA is linear — great for compressing 200 features down to 50 for faster model training, and for removing correlated features. UMAP is non-linear — great for compressing to 2D/3D for scatter plot visualization where you want to see clusters. Choose by purpose: modeling vs visualization.

Practice Quiz

10 scenario-based questions at Associate conceptual level. Select an answer then check it to see the explanation.

Question 1 of 10
Score: 0 / 0

Quiz Complete!

Flashcards

12 flashcards covering essential cuDF methods, feature engineering patterns, and storage concepts. Click any card to flip it.

💡 Click a card to reveal the answer

Study Advisor

Personalized study recommendations based on your background. Select your role to get a targeted study plan for this domain.

Data Analyst Study Plan

    Study Resources

    Official NVIDIA and RAPIDS documentation plus practice materials for the NCA-ADS exam.

    RAPIDS cuDF API Reference

    Complete API documentation for cuDF DataFrame operations, including all methods covered in this domain.

    Open Documentation →

    RAPIDS Getting Started

    Official RAPIDS.ai getting started guide — installation, first steps with cuDF, and key workflow patterns.

    Get Started →

    NCA-ADS Exam Guide

    Official NVIDIA NCA-ADS certification page with exam objectives, format, and registration information.

    NVIDIA Certifications →

    cuML API Reference

    GPU-accelerated scikit-learn compatible ML library documentation — covers PCA, UMAP, scalers, and classifiers.

    cuML Docs →

    RAPIDS Notebooks Gallery

    End-to-end example notebooks showing cuDF data manipulation, feature engineering, and model training workflows.

    View Notebooks →

    Apache Arrow Documentation

    Understanding Arrow columnar format and bitmask null representation — the foundation of cuDF's memory model.

    Arrow Docs →

    Parquet Format Specification

    Deep dive into the Parquet columnar format, compression codecs, and predicate pushdown internals.

    Parquet Docs →

    imbalanced-learn (SMOTE)

    Documentation for SMOTE and other oversampling/undersampling techniques for handling class imbalance.

    imbalanced-learn →