Exam Overview — NCA-ADS
The NCA-ADS (Accelerated Data Science Associate) certification validates GPU-accelerated data science skills using RAPIDS. This page covers Topics 2 and 6 — Machine Learning with RAPIDS and Introductory MLOps — together worth approximately 26% of the exam.
Full Exam Topic Weight Table
| # | Topic | Weight | Covered Here |
|---|---|---|---|
| 1 | Data Science Foundations & GPU Ecosystem | ~12% | No |
| 2 | Machine Learning with RAPIDS (cuML, XGBoost) | ~16% | Yes |
| 3 | Data Preparation with cuDF | ~14% | No |
| 4 | Exploratory Data Analysis & Visualization | ~12% | No |
| 5 | Feature Engineering | ~14% | No |
| 6 | Introductory MLOps (MLflow, Drift, Serving) | ~10% | Yes |
| 7 | Performance Optimization & Profiling | ~10% | No |
| 8 | End-to-End Pipelines & Deployment | ~12% | No |
Page Summary
Core Concepts
Ten essential knowledge areas spanning GPU-accelerated machine learning and introductory MLOps for the NCA-ADS exam.
1. cuML — GPU-Accelerated Machine Learning
What is cuML?
- cuML is the GPU-accelerated equivalent of scikit-learn — same algorithms, same API, runs entirely on GPU
- Same
.fit()/.predict()/.transform()interface as scikit-learn - Accepts cuDF DataFrames directly — no
.to_pandas()needed - Drop-in migration: change
from sklearntofrom cumland code runs on GPU - Speedups of 10–50× over CPU scikit-learn on large datasets
cuML Algorithm Categories
- Regression: LinearRegression, Ridge, Lasso, ElasticNet
- Classification: LogisticRegression, RandomForestClassifier, SVC
- Clustering: KMeans, DBSCAN
- Dimensionality Reduction: PCA, UMAP, TSNE
- Neighbors: KNeighborsClassifier, NearestNeighbors
| Task | Algorithm | cuML Class | Key Detail |
|---|---|---|---|
| Regression | Linear Regression | cuml.LinearRegression | GPU OLS; same as sklearn |
| Classification | Logistic Regression | cuml.LogisticRegression | Binary/multiclass; L1/L2 support |
| Classification | Random Forest | cuml.RandomForestClassifier | Parallel tree building on GPU |
| Classification | SVM | cuml.SVC | GPU-accelerated support vector machine |
| Clustering | K-Means | cuml.KMeans | Requires k specified upfront |
| Clustering | DBSCAN | cuml.DBSCAN | Density-based; finds k automatically |
| Dimensionality | PCA | cuml.PCA | GPU SVD; linear compression |
| Dimensionality | UMAP | cuml.UMAP | Non-linear; great for visualization |
2. Supervised vs Unsupervised Learning
Supervised Learning
- Requires: labeled training data — each example has an input and a known output
- Regression: predicts continuous output (house price, temperature)
- Classification: predicts categorical output (spam/not spam, fraud/not fraud)
- Goal: learn a mapping from input features to output labels
- Examples: LinearRegression, LogisticRegression, RandomForest, XGBoost
Unsupervised Learning
- Requires: unlabeled data — no known outputs; model finds structure
- Clustering: group similar data points — KMeans, DBSCAN
- Dimensionality Reduction: compress many features to fewer — PCA, UMAP
- Goal: discover hidden patterns or structure in data
- Semi-supervised: small labeled + large unlabeled — hybrid approach
3. XGBoost on GPU
What is XGBoost?
- Extreme Gradient Boosting — ensemble of decision trees trained sequentially
- Each tree corrects the errors of the previous tree
- Consistently wins on structured/tabular data problems
- GPU acceleration:
device='cuda',tree_method='hist' - Accepts cuDF DataFrames via
xgb.DMatrix(cudf_df, label=cudf_labels)
Key Hyperparameters
n_estimators— number of trees; more = more accurate but slower, risk of overfittingmax_depth— maximum tree depth; deeper = more complex, more overfitting risklearning_rate(eta) — step size; smaller = more careful learning, needs more treessubsample— fraction of training data per tree; reduces overfittingcolsample_bytree— fraction of features per tree; reduces overfitting
4. Model Evaluation Metrics
| Metric | Type | Formula | When to Use |
|---|---|---|---|
| Accuracy | Classification | (TP+TN) / total | Balanced classes only — misleading for imbalanced data |
| Precision | Classification | TP / (TP+FP) | Minimize false positives (spam filter — don't block real emails) |
| Recall (Sensitivity) | Classification | TP / (TP+FN) | Minimize false negatives (disease diagnosis — don't miss sick patients) |
| F1-Score | Classification | 2*(P*R)/(P+R) | Imbalanced classes — balances precision and recall |
| AUC-ROC | Classification | Area under ROC curve | Model comparison; robust to class imbalance; 0.5=random, 1.0=perfect |
| MAE | Regression | mean(|y-ŷ|) | Easy to interpret; same units as target; not sensitive to outliers |
| RMSE | Regression | sqrt(mean((y-ŷ)²)) | Penalizes large errors more; sensitive to outliers |
| R² | Regression | 1 - SS_res/SS_tot | Proportion of variance explained; 1.0=perfect, 0=no better than mean |
5. Overfitting vs Underfitting
Underfitting (High Bias)
- Sign: high training error AND high test error
- Cause: model too simple to capture patterns in data
- Fix: more complex model, more features, reduce regularization, train longer
Overfitting (High Variance)
- Sign: low training error BUT high test error
- Cause: model memorized training data, doesn't generalize
- Fix: more training data, regularization (L1/L2), simpler model, dropout, early stopping
Bias-Variance Trade-off
- Increasing model complexity reduces bias but increases variance
- Goal: find the sweet spot — low train error AND low test error
- Perfect fit: generalizes well to new, unseen data
- Regularization controls this trade-off by penalizing complexity
6. Cross-Validation
Why Cross-Validate?
- A single train/test split is sensitive to how data was split — could be lucky or unlucky
- K-Fold CV gives a more reliable estimate of true model performance
- K-Fold: split into k folds; train on k-1, test on 1; repeat k times; average scores
- Stratified K-Fold: each fold has same class distribution — important for classification
- Rule of thumb: k=5 or k=10 most common
cuML Cross-Validation
cuml.model_selection.train_test_split()— GPU train/test splitting- CV runs each fold's training entirely on GPU
- Same interface as sklearn's
cross_validate() - Stratified splitting available for classification problems
- All intermediate datasets remain on GPU — no CPU round-trips
7. Hyperparameter Tuning
Hyperparameters vs Parameters
- Parameters: learned from data during training (model weights, coefficients)
- Hyperparameters: set before training by the data scientist (max_depth, learning_rate, n_estimators)
- Hyperparameter tuning = finding the best hyperparameter values via systematic search
Tuning Strategies
- Grid Search: try every combination of specified values — exhaustive but slow
- Random Search: try random combinations — faster, often finds good solutions
- Bayesian Optimization: uses prior results to guide next search — most sample-efficient
- Early Stopping (XGBoost):
early_stopping_rounds=10— stop if no improvement for 10 rounds; prevents overfitting
8. MLflow — Experiment Tracking
What is MLflow?
- Open-source platform to track, manage, and reproduce ML experiments
- Experiment: a named collection of related runs
- Run: one execution of ML code — one training job
- MLflow UI: web interface to compare runs visually
- Benefits: compare runs, reproduce best run, track what changed between experiments
MLflow Logging — PAM
- Parameters —
mlflow.log_param("max_depth", 6)— hyperparameters set before training - Artifacts —
mlflow.log_artifact("model.pkl")— saved model files, plots, data files - Metrics —
mlflow.log_metric("accuracy", 0.94)— evaluation scores from training/test - Memory trick: PAM — Parameters, Artifacts, Metrics
9. Model Saving, Loading, and Serving
Saving and Loading Models
- XGBoost:
model.save_model('model.json')andmodel.load_model('model.json') - cuML:
pickle.dump(model, open('model.pkl','wb'))andpickle.load(open('model.pkl','rb')) - Save alongside metadata: training date, feature list, evaluation metrics, data version
- Prediction:
model.predict(X_test)— same API for all cuML/XGBoost models
Model Artifacts Best Practices
- Log model files as MLflow artifacts for reproducibility
- Save the full feature list used during training — essential for serving
- Record the exact training dataset version (date, hash, or path)
- Use MLflow Model Registry to track model versions in production
- Version control ensures any past run can be reproduced exactly
10. Model Drift and Production Monitoring
Data Drift
- Definition: the statistical distribution of model input features changes over time
- Result: model receives data unlike what it was trained on
- Example: model trained on 2023 customer behavior used in 2026 when behavior changed
- Detection: monitor feature statistics in production vs training baseline
- Response: retrain on recent data; alert on-call engineer
Concept Drift
- Definition: the relationship between input features and the target label changes
- Result: model predictions become stale even if input distribution looks the same
- Example: fraud patterns change; what was "normal" before is now fraudulent
- Monitoring signals: accuracy degradation, distribution shift in predictions, feature statistics drift
- Tools: custom monitoring scripts, W&B monitoring, MLflow model registry
Memory Hooks
Six high-retention memory anchors to lock in the most exam-critical concepts for Topics 2 and 6.
Unsupervised = student finds patterns alone.
Recall: did you find ALL the YESes?
Train bad, Test bad = Underfit.
Both good = Just right.
Concept drift = relationship changed.
Practice Quiz
10 Associate-level conceptual questions covering Topics 2 and 6. Select an answer, then click Check to see the explanation.
Flashcards
12 flip cards covering all core NCA-ADS ML and MLOps concepts. Click any card to reveal the answer. Filter by topic tag.
Study Advisor
Personalized study plans for Topics 2 and 6 based on your background. Select your role to see tailored priorities.
Official Resources
Authoritative sources for NCA-ADS exam preparation, cuML, XGBoost, and MLflow.