Topic 3 of 8 | 13% of Exam | Accelerated Data Science Associate
NCA-ADS exam structure, topic weights, and what you'll master on this page.
| Topic | Weight |
|---|---|
| Data Manipulation and Preparation | 23% |
| Machine Learning With RAPIDS | 16% |
| Data Science Pipelines & Workflow Automation | 13% |
| Descriptive Analysis and Visualization | 13% |
| Foundations of Accelerated Data Science | 12% |
| Introductory MLOps Practices | 10% |
| Advanced Data Structures | 7% |
| Software and Environment Management | 6% |
Highlighted row = topic covered on this page (13% exam weight)
Pipeline questions test your understanding of how steps connect — not just individual APIs. Expect scenario questions about data leakage, reproducibility, Dask scaling decisions, and which augmentation technique fits which problem.
Standard stage order, cuML Pipeline object, and what each step does
Why split-first matters, how Pipeline objects prevent leakage by design
Feature selection methods, importance ranking, transformation rules
L1/L2 regularization, cross-validation, early stopping, and pipeline fixes
SMOTE, noise injection, feature interaction — when and why to augment
Lazy evaluation, multi-GPU distributed processing, .compute() trigger
Seed, version, log, document — the four pillars of reproducible pipelines
Common bugs, type mismatches, memory overflow strategies, optimization tips
Detailed concept blocks covering all pipeline and workflow automation topics for NCA-ADS.
A pipeline chains all data science steps into a reproducible, automated workflow. Each stage's output becomes the next stage's input — no manual hand-offs.
cudf.read_parquet() / cudf.read_csv()fillna(), drop_duplicates(), type conversioncuml.model_selection.train_test_split()cuml or XGBoost .fit()pickle / model.save_model()from cuml.pipeline import Pipeline
from cuml.preprocessing import StandardScaler
from cuml.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipe.fit(X_train, y_train) # scaler fits+transforms, then clf trains
pipe.predict(X_test) # scaler transforms only, then clf predicts
The Pipeline object chains preprocessing + model into one object. It automatically prevents data leakage by ensuring X_test is only transformed (never fit) at predict time.
model.feature_importances_ (XGBoost) to rank and drop weak featuresAlways scaler.fit(X_train) then scaler.transform(X_train) and scaler.transform(X_test).
Never fit the scaler on X_test or the full dataset before splitting — that causes data leakage.
| Leakage Type | Example | Prevention |
|---|---|---|
| Scaler fitted on full dataset | StandardScaler().fit(X_all) before split | Use Pipeline — fit only on X_train |
| Encoder fitted on full dataset | LabelEncoder on full target before split | Split first, encode train only |
| Target leakage | Feature derived from target included in X | Review feature list before training |
| Technique | What it Does | Best For |
|---|---|---|
| SMOTE | Interpolates between existing minority class samples to create synthetic new ones | Class imbalance (fraud, rare disease) |
| Noise Injection | Adds small random noise to numerical features | Making model robust to measurement noise |
| Feature Interaction | Creates new features as products of existing ones (A x B) | Capturing non-linear relationships |
# Horizontal: more rows (same columns, different time periods)
combined = cudf.concat([df_jan, df_feb, df_mar])
# Vertical: more columns (join related tables by key)
enriched = cudf.merge(customers_df, transactions_df, on='customer_id')
After integration: always check for new nulls, duplicate keys, and schema mismatches before proceeding to model training.
A single GPU cannot fit all data in VRAM when the dataset exceeds GPU memory. Dask enables multi-GPU and distributed processing by partitioning data across multiple GPUs.
import dask_cudf
# Reads partitioned Parquet files lazily across multiple GPUs
ddf = dask_cudf.read_parquet('data/*.parquet')
# Operations build a task graph (nothing executes yet)
result = ddf.groupby('customer_id').agg({'amount': 'sum'})
# .compute() triggers actual computation on all GPUs
final = result.compute()
Dask builds a task graph when you write transformations — nothing executes until you call .compute(). This enables Dask to optimize the entire computation before executing it across GPUs.
| Pillar | Action | Tool |
|---|---|---|
| Seed | Set all random seeds before any stochastic operation | np.random.seed(42), cuml.common.seed(42) |
| Version | Pin all library versions | conda environment.yml or requirements.txt |
| Log | Record all parameters and metrics for every run | MLflow, Weights & Biases |
| Document | Version control code AND data references together | git + DVC |
| Bug | Symptom | Fix |
|---|---|---|
| Data leakage | Test accuracy much higher than expected | Use Pipeline object; split before fitting transformers |
| Type mismatch | Runtime error — float32 vs float64 conflict | Explicitly cast: df['col'].astype('float32') |
| Memory overflow | CUDA out-of-memory error | Switch to Dask for multi-GPU; use batch processing |
| NaN propagation | Model outputs NaN predictions | Add imputation step before model in pipeline |
nvidia-smi during run — low GPU utilization means bottleneck is elsewhere (I/O or CPU).to_pandas() calls mid-pipelineimport time
start = time.time()
# ... run pipeline ...
elapsed = time.time() - start
print(f"Pipeline completed in {elapsed:.1f}s")
Also use nvidia-smi during run: GPU utilization %, VRAM used/total. Low GPU% with high CPU% = I/O bottleneck. High VRAM = near memory limit.
| Factor | Why It Matters | Example |
|---|---|---|
| GPU VRAM | Determines max dataset size for single-GPU processing | T4=16GB, A100=40/80GB, H100=80GB |
| GPU Count | Multiple GPUs enable Dask multi-GPU scaling | 4x A100 = 4x throughput with Dask |
| NVLink | High-bandwidth GPU-GPU interconnect for multi-GPU Dask | A100 NVLink: 600 GB/s vs PCIe: 64 GB/s |
| Storage | NVMe SSD enables fast data loading; GPUDirect for direct NVMe to GPU transfer | GPUDirect Storage bypasses CPU RAM entirely |
Six high-retention mnemonics for NCA-ADS pipeline exam questions.
Ingest → Clean → Engineer → Feature-select → Train → Evaluate → Model-save. Never skip Clean before Engineer — dirty data produces garbage features.
The golden rule of pipeline design. Split your data into train/test before fitting any transformer. The Pipeline object enforces this automatically — transformers only .fit() on training data.
L1 (Lasso) drives irrelevant feature weights to exactly 0 — it does feature selection. L2 (Ridge) shrinks all weights but rarely to zero. Use L1 when you suspect many irrelevant features.
Operations on a Dask DataFrame build a task graph — nothing actually runs. Call .compute() to trigger real execution across all GPUs. Forgetting .compute() is the #1 Dask beginner mistake.
Seed the random, Version the libs, Log the params, Document the data. All four are required — missing any one means results cannot be exactly reproduced by another researcher.
Augmentation creates more training rows or synthetic samples (SMOTE, noise injection). Integration combines from external sources — concat for more rows, merge for more columns.
10 associate-level scenario questions on pipelines, workflow automation, and scalability.
12 cards covering all key pipeline and workflow automation concepts. Click to flip.
Personalized study plans based on your background. Select your role.
You understand pipelines and orchestration well. Focus on ML-specific pipeline concepts that differ from ETL engineering — especially data leakage, feature transformers, and Dask for GPU workloads.
Unlike ETL pipelines, cuML Pipeline chains transformers + model into one object that enforces fit-on-train, transform-on-test discipline. Read Concept Block 1. Practice the Pipeline([('scaler', ...), ('clf', ...)]) syntax.
In ETL, you're used to processing all data together. In ML pipelines, fitting transformers on the full dataset before splitting corrupts test evaluation. Drill the Split-First rule until it's automatic. Quiz Q1 tests this.
Dask is familiar territory (distributed processing) but with GPU-specific quirks. Focus on: dask_cudf.read_parquet(), lazy evaluation task graph, and the .compute() trigger. Concept Block 5 + Flashcard 3 cover this.
cudf.concat (horizontal, more rows) and cudf.merge (vertical, more columns) will feel familiar. Focus on the post-integration quality checks: new nulls, duplicate keys, schema mismatches — these appear as exam scenarios.
Your orchestration background helps here. The exam tests whether pipelines should have error handling and alerting. Quiz Q5 shows a silent failure scenario — know that pipelines must log failures and send alerts.
Seed, Version, Log, Document. Environment.yml + git + MLflow is the expected answer for reproducibility questions. Concept Block 6 covers this. Flashcard 8 is the quickest review.
After studying Concepts and Flashcards, take the Quiz. Flag any question about regularization (L1 vs L2) or SMOTE — those are ML-specialist topics that Data Engineers should review before the exam.
You know ML concepts well. Focus on the RAPIDS/GPU-specific implementation details — especially the cuML Pipeline API, dask_cudf scaling patterns, and reproducibility tooling.
The cuML Pipeline object mirrors scikit-learn's API. Review that Pipeline([('scaler', StandardScaler()), ('clf', ...)]).fit() chains all steps. The exam tests whether you know what each call does — don't assume it's identical in every detail.
Focus on the lazy evaluation model: operations = task graph, .compute() = execution. Quiz Q9 tests this directly. Know: dask_cudf.read_parquet(), dask groupby, and when to use Dask vs single-GPU cuDF.
L1 (Lasso) = coefficient sparsity = feature selection. L2 (Ridge) = weight shrinkage = no sparsity. For cuML: LogisticRegression(penalty='l1') vs penalty='l2'. Quiz Q10 + Memory Hook 3 + Flashcard 6 cover this.
XGBoost .feature_importances_ for embedded feature selection. Quiz Q4 presents a scenario where 3 of 50 features explain 95% of decisions — know the right next step (retrain with top 3, then validate).
SMOTE = Synthetic Minority Oversampling = interpolates between minority class samples. Key rule: apply SMOTE ONLY to training set, never test set. Flashcard 7 + Quiz Q6 cover this pattern.
Know VRAM capacities (T4=16GB, A100=80GB, H100=80GB), NVLink for multi-GPU Dask, GPUDirect Storage for fast data loading. Concept Block 8 covers these — they appear in hardware recommendation scenarios.
Take the full Quiz. If you miss any question, read that concept block immediately. Pay special attention to pipeline scenario questions (Q2 Dask, Q3 Pipeline internals, Q7 reproducibility) which test GPU-specific reasoning.
You know the statistical concepts well. Focus on GPU-specific tools — especially where RAPIDS + Dask differ from pandas + scikit-learn workflows you use daily.
Ingest, Clean, Engineer, Feature-select, Train, Evaluate, Model-save. The NCA-ADS exam presents scenarios that deliberately scramble these stages. Drilling the correct order prevents confusing cleaning and engineering steps.
In manual workflows, it's easy to accidentally fit the scaler on full data. The Pipeline object enforces correct behavior automatically. Know why Pipeline prevents leakage — not just that it does. Quiz Q3 tests Pipeline internals.
As a DS you may be used to loading all data at once. When datasets exceed single GPU VRAM, dask_cudf distributes across multiple GPUs. Understand the lazy evaluation model and .compute() trigger. Concept Block 5 is essential.
Random seeds for cuML, numpy, and XGBoost separately. Environment.yml for version pinning. MLflow for logging. DVC for data versioning. Quiz Q7 tests the fixed-seed + time-based split pattern for reproducible model comparison.
Image augmentation may be familiar, but NCA-ADS covers tabular-specific methods: SMOTE for class imbalance, noise injection for robustness, feature interactions for non-linearity. Know which technique fits which problem. Quiz Q6 tests SMOTE.
Know what to look for: low GPU utilization = bottleneck is I/O or CPU, not computation. High VRAM usage = memory pressure, switch to Dask. This diagnostic reasoning appears in pipeline optimization scenarios (Concept Block 7).
Take the Quiz twice. Second pass: for any question you got right, still verify you understand the explanation — some choices are tricky distractors. Run through all 12 Flashcards on exam day morning for fresh recall.
Official documentation and references for NCA-ADS pipeline and workflow automation topics.
Official exam information, objectives, registration, and Certiverse proctoring details for the Accelerated Data Science Associate credential.
nvidia.com/en-us/learn/certification/accelerated-data-science-associate/ ↗Official API reference for dask_cudf — distributed GPU DataFrames built on Dask and cuDF. Covers read_parquet, groupby, merge, and .compute() patterns for multi-GPU workflows.
docs.rapids.ai/api/dask-cudf/stable/ ↗Full API reference for the cuML Pipeline object — how to chain preprocessing steps and models, fit/predict behavior, and integration with cuDF DataFrames.
docs.rapids.ai/api/cuml/stable/ ↗Open-source tool for versioning datasets and ML experiments alongside code. Enables reproducible pipelines by tracking exact data versions used for each model training run.
dvc.org/ ↗All pages in this series — study each topic before exam day.
Foundations of Accelerated Data Science & Environment Setup — ~18% of exam
Data Manipulation and Preparation — 23% of exam
Data Science Pipelines & Workflow Automation — 13% of exam
Machine Learning With RAPIDS — 16% of exam
Introductory MLOps & Descriptive Analysis — 23% of exam
Advanced Data Structures — 7% of exam