NCA-ADS: Machine Learning with RAPIDS & Introductory MLOps

Exam Overview — NCA-ADS

The NCA-ADS (Accelerated Data Science Associate) certification validates GPU-accelerated data science skills using RAPIDS. This page covers Topics 2 and 6 — Machine Learning with RAPIDS and Introductory MLOps — together worth approximately 26% of the exam.

Full Exam Topic Weight Table

#	Topic	Weight	Covered Here
1	Data Science Foundations & GPU Ecosystem	~12%	No
2	Machine Learning with RAPIDS (cuML, XGBoost)	~16%	Yes
3	Data Preparation with cuDF	~14%	No
4	Exploratory Data Analysis & Visualization	~12%	No
5	Feature Engineering	~14%	No
6	Introductory MLOps (MLflow, Drift, Serving)	~10%	Yes
7	Performance Optimization & Profiling	~10%	No
8	End-to-End Pipelines & Deployment	~12%	No

Page Summary

Topic 2 Focus

cuML, XGBoost GPU, Model Evaluation, Tuning

Topic 6 Focus

MLflow, Drift Detection, Model Serving

Key Algorithms

RandomForest, KMeans, DBSCAN, PCA, UMAP, XGBoost

Evaluation Metrics

Accuracy, Precision, Recall, F1, AUC-ROC, MAE, RMSE, R²

MLOps Tools

MLflow (log_param, log_metric, log_artifact)

Drift Types

Data drift (inputs change) vs Concept drift (relationship changes)

Tuning Methods

Grid Search, Random Search, Bayesian Optimization, Early Stopping

Practice Quiz

10 Associate-level conceptual questions

Flashcards

12 flip cards covering all core concepts

Exam Tip: At Associate level, the exam tests conceptual understanding — you need to know what each algorithm does, when to use it, what each metric measures, and how MLflow tracks experiments. Code memorization is less important than understanding the purpose of each tool.

Core Concepts

Ten essential knowledge areas spanning GPU-accelerated machine learning and introductory MLOps for the NCA-ADS exam.

Topic 2 — Machine Learning with RAPIDS

1. cuML — GPU-Accelerated Machine Learning

What is cuML?

cuML is the GPU-accelerated equivalent of scikit-learn — same algorithms, same API, runs entirely on GPU
Same .fit() / .predict() / .transform() interface as scikit-learn
Accepts cuDF DataFrames directly — no .to_pandas() needed
Drop-in migration: change from sklearn to from cuml and code runs on GPU
Speedups of 10–50× over CPU scikit-learn on large datasets

cuML Algorithm Categories

Regression: LinearRegression, Ridge, Lasso, ElasticNet
Classification: LogisticRegression, RandomForestClassifier, SVC
Clustering: KMeans, DBSCAN
Dimensionality Reduction: PCA, UMAP, TSNE
Neighbors: KNeighborsClassifier, NearestNeighbors

Task	Algorithm	cuML Class	Key Detail
Regression	Linear Regression	cuml.LinearRegression	GPU OLS; same as sklearn
Classification	Logistic Regression	cuml.LogisticRegression	Binary/multiclass; L1/L2 support
Classification	Random Forest	cuml.RandomForestClassifier	Parallel tree building on GPU
Classification	SVM	cuml.SVC	GPU-accelerated support vector machine
Clustering	K-Means	cuml.KMeans	Requires k specified upfront
Clustering	DBSCAN	cuml.DBSCAN	Density-based; finds k automatically
Dimensionality	PCA	cuml.PCA	GPU SVD; linear compression
Dimensionality	UMAP	cuml.UMAP	Non-linear; great for visualization

2. Supervised vs Unsupervised Learning

Supervised Learning

Requires: labeled training data — each example has an input and a known output
Regression: predicts continuous output (house price, temperature)
Classification: predicts categorical output (spam/not spam, fraud/not fraud)
Goal: learn a mapping from input features to output labels
Examples: LinearRegression, LogisticRegression, RandomForest, XGBoost

Unsupervised Learning

Requires: unlabeled data — no known outputs; model finds structure
Clustering: group similar data points — KMeans, DBSCAN
Dimensionality Reduction: compress many features to fewer — PCA, UMAP
Goal: discover hidden patterns or structure in data
Semi-supervised: small labeled + large unlabeled — hybrid approach

3. XGBoost on GPU

What is XGBoost?

Extreme Gradient Boosting — ensemble of decision trees trained sequentially
Each tree corrects the errors of the previous tree
Consistently wins on structured/tabular data problems
GPU acceleration: device='cuda', tree_method='hist'
Accepts cuDF DataFrames via xgb.DMatrix(cudf_df, label=cudf_labels)

Key Hyperparameters

n_estimators — number of trees; more = more accurate but slower, risk of overfitting
max_depth — maximum tree depth; deeper = more complex, more overfitting risk
learning_rate (eta) — step size; smaller = more careful learning, needs more trees
subsample — fraction of training data per tree; reduces overfitting
colsample_bytree — fraction of features per tree; reduces overfitting

4. Model Evaluation Metrics

Metric	Type	Formula	When to Use
Accuracy	Classification	(TP+TN) / total	Balanced classes only — misleading for imbalanced data
Precision	Classification	TP / (TP+FP)	Minimize false positives (spam filter — don't block real emails)
Recall (Sensitivity)	Classification	TP / (TP+FN)	Minimize false negatives (disease diagnosis — don't miss sick patients)
F1-Score	Classification	2(PR)/(P+R)	Imbalanced classes — balances precision and recall
AUC-ROC	Classification	Area under ROC curve	Model comparison; robust to class imbalance; 0.5=random, 1.0=perfect
MAE	Regression	mean(\|y-ŷ\|)	Easy to interpret; same units as target; not sensitive to outliers
RMSE	Regression	sqrt(mean((y-ŷ)²))	Penalizes large errors more; sensitive to outliers
R²	Regression	1 - SS_res/SS_tot	Proportion of variance explained; 1.0=perfect, 0=no better than mean

Confusion Matrix: The foundation of all classification metrics. A 2×2 table with TP (correctly predicted positive), TN (correctly predicted negative), FP (predicted positive but actually negative — false alarm), and FN (predicted negative but actually positive — missed case).

5. Overfitting vs Underfitting

Underfitting (High Bias)

Sign: high training error AND high test error
Cause: model too simple to capture patterns in data
Fix: more complex model, more features, reduce regularization, train longer

Overfitting (High Variance)

Sign: low training error BUT high test error
Cause: model memorized training data, doesn't generalize
Fix: more training data, regularization (L1/L2), simpler model, dropout, early stopping

Bias-Variance Trade-off

Increasing model complexity reduces bias but increases variance
Goal: find the sweet spot — low train error AND low test error
Perfect fit: generalizes well to new, unseen data
Regularization controls this trade-off by penalizing complexity

6. Cross-Validation

Why Cross-Validate?

A single train/test split is sensitive to how data was split — could be lucky or unlucky
K-Fold CV gives a more reliable estimate of true model performance
K-Fold: split into k folds; train on k-1, test on 1; repeat k times; average scores
Stratified K-Fold: each fold has same class distribution — important for classification
Rule of thumb: k=5 or k=10 most common

cuML Cross-Validation

cuml.model_selection.train_test_split() — GPU train/test splitting
CV runs each fold's training entirely on GPU
Same interface as sklearn's cross_validate()
Stratified splitting available for classification problems
All intermediate datasets remain on GPU — no CPU round-trips

7. Hyperparameter Tuning

Hyperparameters vs Parameters

Parameters: learned from data during training (model weights, coefficients)
Hyperparameters: set before training by the data scientist (max_depth, learning_rate, n_estimators)
Hyperparameter tuning = finding the best hyperparameter values via systematic search

Tuning Strategies

Grid Search: try every combination of specified values — exhaustive but slow
Random Search: try random combinations — faster, often finds good solutions
Bayesian Optimization: uses prior results to guide next search — most sample-efficient
Early Stopping (XGBoost): early_stopping_rounds=10 — stop if no improvement for 10 rounds; prevents overfitting

Topic 6 — Introductory MLOps

8. MLflow — Experiment Tracking

What is MLflow?

Open-source platform to track, manage, and reproduce ML experiments
Experiment: a named collection of related runs
Run: one execution of ML code — one training job
MLflow UI: web interface to compare runs visually
Benefits: compare runs, reproduce best run, track what changed between experiments

MLflow Logging — PAM

Parameters — mlflow.log_param("max_depth", 6) — hyperparameters set before training
Artifacts — mlflow.log_artifact("model.pkl") — saved model files, plots, data files
Metrics — mlflow.log_metric("accuracy", 0.94) — evaluation scores from training/test
Memory trick: PAM — Parameters, Artifacts, Metrics

9. Model Saving, Loading, and Serving

Saving and Loading Models

XGBoost: model.save_model('model.json') and model.load_model('model.json')
cuML: pickle.dump(model, open('model.pkl','wb')) and pickle.load(open('model.pkl','rb'))
Save alongside metadata: training date, feature list, evaluation metrics, data version
Prediction: model.predict(X_test) — same API for all cuML/XGBoost models

Model Artifacts Best Practices

Log model files as MLflow artifacts for reproducibility
Save the full feature list used during training — essential for serving
Record the exact training dataset version (date, hash, or path)
Use MLflow Model Registry to track model versions in production
Version control ensures any past run can be reproduced exactly

10. Model Drift and Production Monitoring

Data Drift

Definition: the statistical distribution of model input features changes over time
Result: model receives data unlike what it was trained on
Example: model trained on 2023 customer behavior used in 2026 when behavior changed
Detection: monitor feature statistics in production vs training baseline
Response: retrain on recent data; alert on-call engineer

Concept Drift

Definition: the relationship between input features and the target label changes
Result: model predictions become stale even if input distribution looks the same
Example: fraud patterns change; what was "normal" before is now fraudulent
Monitoring signals: accuracy degradation, distribution shift in predictions, feature statistics drift
Tools: custom monitoring scripts, W&B monitoring, MLflow model registry

Memory Hooks

Six high-retention memory anchors to lock in the most exam-critical concepts for Topics 2 and 6.

🏫

Supervised vs Unsupervised

Supervised = teacher provides answers (labels).
Unsupervised = student finds patterns alone.

Supervised learning requires labeled data — you know the correct output for each input. Unsupervised learning gets raw data with no labels and discovers hidden structure like clusters or compressed representations.

⚖️

Precision vs Recall Trade-off

Precision: when you say YES, are you right?
Recall: did you find ALL the YESes?

Fraud detection: maximize Recall — catch all fraud even with false alarms. Spam filter: maximize Precision — don't block real emails even if some spam slips through. F1 balances both when you can't choose.

🎯

Overfitting Signs

Train good, Test bad = Overfit.
Train bad, Test bad = Underfit.
Both good = Just right.

The train vs test error comparison is the single most reliable diagnostic. High training accuracy + low test accuracy = the model memorized training examples and cannot generalize. Low both = too simple.

🌳

XGBoost Params Rule

More trees + smaller learning_rate = more careful = better generalization (but slower).

Setting learning_rate=0.01 with n_estimators=1000 learns more carefully than learning_rate=0.3 with n_estimators=100. Add early_stopping_rounds to find the right stopping point automatically.

📋

MLflow 3 Things to Log

PAM: Parameters, Artifacts, Metrics.

Parameters = hyperparameters set before training. Artifacts = saved model files, plots. Metrics = evaluation scores after training. Log all three in every experiment run so any result can be reproduced and compared later.

📉

Drift Types

Data drift = input changed.
Concept drift = relationship changed.

Both require monitoring; both require retraining. Data drift: feature distributions shift (new customer demographics). Concept drift: same features, different meaning (fraud patterns evolve). Monitor production accuracy as an early warning signal for both.

Practice Quiz

10 Associate-level conceptual questions covering Topics 2 and 6. Select an answer, then click Check to see the explanation.

Flashcards

12 flip cards covering all core NCA-ADS ML and MLOps concepts. Click any card to reveal the answer. Filter by topic tag.

Click a card to flip it and reveal the explanation.

Study Advisor

Personalized study plans for Topics 2 and 6 based on your background. Select your role to see tailored priorities.

Official Resources

Authoritative sources for NCA-ADS exam preparation, cuML, XGBoost, and MLflow.

NVIDIA NCA-ADS Certification Page

Official exam guide, objectives, registration, and recommended training paths

Visit →

cuML Documentation (RAPIDS)

Full API reference for all cuML algorithms — LinearRegression, KMeans, DBSCAN, PCA, UMAP, and more

Visit →

XGBoost Documentation

GPU acceleration guide, hyperparameter reference, early stopping, and DMatrix usage

Visit →

MLflow Documentation

Experiment tracking API, model registry, MLflow UI, and deployment guides

Visit →

NCA-ADS: Machine Learning
with RAPIDS & MLOps

Exam Overview — NCA-ADS

Full Exam Topic Weight Table

Page Summary

Core Concepts

1. cuML — GPU-Accelerated Machine Learning

What is cuML?

cuML Algorithm Categories

2. Supervised vs Unsupervised Learning

Supervised Learning

Unsupervised Learning

3. XGBoost on GPU

What is XGBoost?

Key Hyperparameters

4. Model Evaluation Metrics

5. Overfitting vs Underfitting

Underfitting (High Bias)

Overfitting (High Variance)

Bias-Variance Trade-off

6. Cross-Validation

Why Cross-Validate?

cuML Cross-Validation

7. Hyperparameter Tuning

Hyperparameters vs Parameters

Tuning Strategies

8. MLflow — Experiment Tracking

What is MLflow?

MLflow Logging — PAM

9. Model Saving, Loading, and Serving

Saving and Loading Models

Model Artifacts Best Practices

10. Model Drift and Production Monitoring

Data Drift

Concept Drift

Memory Hooks

Practice Quiz

Flashcards

Study Advisor

Official Resources

Ready to pass the NCA-ADS exam?

NCA-ADS: Machine Learningwith RAPIDS & MLOps

Exam Overview — NCA-ADS

Full Exam Topic Weight Table

Page Summary

Core Concepts

1. cuML — GPU-Accelerated Machine Learning

What is cuML?

cuML Algorithm Categories

2. Supervised vs Unsupervised Learning

Supervised Learning

Unsupervised Learning

3. XGBoost on GPU

What is XGBoost?

Key Hyperparameters

4. Model Evaluation Metrics

5. Overfitting vs Underfitting

Underfitting (High Bias)

Overfitting (High Variance)

Bias-Variance Trade-off

6. Cross-Validation

Why Cross-Validate?

cuML Cross-Validation

7. Hyperparameter Tuning

Hyperparameters vs Parameters

Tuning Strategies

8. MLflow — Experiment Tracking

What is MLflow?

MLflow Logging — PAM

9. Model Saving, Loading, and Serving

Saving and Loading Models

Model Artifacts Best Practices

10. Model Drift and Production Monitoring

Data Drift

Concept Drift

Memory Hooks

Practice Quiz

Flashcards

Study Advisor

Official Resources

Ready to pass the NCA-ADS exam?

NCA-ADS: Machine Learning
with RAPIDS & MLOps