FlashGenius Logo FlashGenius
NCP-ADS Exam Prep · Topic 3

Machine Learning with cuML & XGBoost

GPU Algorithms · Model Training · UMAP · Hyperparameter Tuning

Study with Practice Tests →

Machine Learning with cuML & XGBoost

cuML brings the entire scikit-learn ML algorithm catalog to the GPU, while XGBoost's CUDA backend delivers state-of-the-art gradient boosting on tabular data. Together they form the ML engine of NCP-ADS.

cuML at a Glance

  • Drop-in replacement: same .fit(), .predict(), .transform() API as scikit-learn
  • Native cuDF: accepts cuDF DataFrames — no CPU round-trip required
  • Coverage: linear models, clustering, neighbors, decomposition, manifold, SVM, ensemble, multi-GPU
  • Speedup: 10–50× faster than CPU scikit-learn on large datasets

XGBoost on GPU

  • Best-in-class tabular ML: gradient boosted trees consistently win on structured data
  • GPU backend: device='cuda' enables GPU-accelerated tree building
  • Histogram method: tree_method='hist' — parallelizes bin finding and node splitting
  • cuDF integration: xgb.DMatrix() accepts cuDF DataFrames directly

Model Evaluation Pipeline

  • GPU split: cuml.model_selection.train_test_split()
  • Cross-validation: cross_validate() — GPU parallel fold evaluation
  • Metrics: cuml.metrics covers accuracy, AUC-ROC, F1, MSE, R², and more
  • Tuning: GridSearchCV, RandomizedSearchCV, Optuna, Bayesian optimization

ML Algorithm Quick Reference

AlgorithmcuML ClassTaskKey Note
Linear Regressioncuml.LinearRegressionRegressionGPU-accelerated OLS
Logistic Regressioncuml.LogisticRegressionClassificationL1/L2 regularization supported
K-Meanscuml.KMeansClusteringParallel centroid updates on GPU
DBSCANcuml.DBSCANClusteringNo K needed; density-based
PCAcuml.PCADecompositionGPU SVD; reduces dimensions
UMAPcuml.UMAPManifold/VizGPU UMAP drastically faster than CPU umap-learn
Random Forestcuml.RandomForestClassifierEnsembleGPU parallel tree building
XGBoostxgb.XGBClassifier(device='cuda')Gradient BoostingBest for tabular data

cuML Algorithms

cuML provides a scikit-learn-compatible GPU ML library. All algorithms accept cuDF DataFrames natively and execute entirely on GPU — no pandas round-trip needed.

API Compatibility

Scikit-learn Compatible API

  • .fit(X, y) — train the model on GPU data
  • .predict(X) — generate predictions; returns cuDF Series or cupy array
  • .transform(X) — dimensionality reduction, scaling (unsupervised transforms)
  • .fit_transform(X) — fit then transform in one call
  • Accepts cuDF DataFrames directly — no to_pandas() or CPU round-trip
  • Drop-in: change from sklearn to from cuml and the same code runs on GPU

Linear Models

  • LinearRegression: ordinary least squares regression on GPU
  • LogisticRegression: binary and multiclass classification; supports L1/L2/ElasticNet
  • Ridge: L2-regularized linear regression — penalizes large coefficients
  • Lasso: L1-regularized regression — drives sparse coefficients to zero; feature selection
  • ElasticNet: combination of L1 + L2 penalties for balanced regularization
Clustering & Neighbors

Clustering

  • KMeans: GPU parallel centroid updates; scales to millions of points; requires K upfront
  • DBSCAN: density-based spatial clustering; no K required; identifies noise points as outliers; GPU-accelerated neighbor search

Neighbors

  • NearestNeighbors: brute-force and approximate nearest neighbor search on GPU
  • KNeighborsClassifier: k-NN classification using GPU-accelerated distance computation
  • Distance metrics: Euclidean, cosine, L1, L∞ — all computed in parallel on GPU cores

Decomposition

  • PCA: principal component analysis via GPU SVD — fast dimensionality reduction
  • TruncatedSVD: randomized SVD for sparse data; reduces dimensions without centering
  • Outputs are cupy arrays that can be used directly in downstream cuML models
Advanced Algorithms

UMAP — Manifold Learning

  • Uniform Manifold Approximation and Projection — non-linear dimensionality reduction
  • Best for visualizing high-dimensional embeddings (e.g., sentence embeddings, image features)
  • GPU cuML UMAP is drastically faster than CPU umap-learn — minutes vs. hours on large datasets
  • Preserves both local and global structure better than t-SNE
  • Common use: reduce 768-dim transformer embeddings to 2D for visualization

SVM & Ensemble

  • SVC / SVR: support vector classification and regression — GPU-accelerated via cuML kernel solvers
  • RandomForestClassifier / RandomForestRegressor: GPU parallel tree building; handles large feature counts
  • Random Forest builds trees in parallel across GPU threads — significantly faster than CPU sklearn

Multi-GPU Training

  • cuml.dask.* wrappers enable distributed training across multiple GPUs
  • Example: from cuml.dask.ensemble import RandomForestClassifier
  • Works with LocalCUDACluster or multi-node Dask clusters via RAPIDS UCX
  • Each GPU trains on a partition; results merged for final model

XGBoost on GPU

XGBoost is often the best performer on tabular data. The CUDA backend enables fully GPU-accelerated gradient boosted tree training with the same API as CPU XGBoost.

Enabling GPU Training

GPU Device Configuration

  • Functional API: xgb.train(params={'device':'cuda'}, dtrain, ...)
  • Sklearn API: XGBClassifier(device='cuda') or XGBRegressor(device='cuda')
  • Tree method: tree_method='hist' — histogram-based algorithm parallelizes bin finding and node splitting on GPU
  • Histogram binning is precomputed once; GPU massively parallelizes split candidate evaluation

DMatrix — XGBoost's Data Structure

  • xgb.DMatrix(data=df, label=y) — optimized sparse data structure for XGBoost
  • Supports cuDF DataFrames directly — no pandas conversion needed
  • Precomputes feature statistics for fast split finding
  • Stores data in compressed column format for cache efficiency
  • For GPU: data stays in GPU memory throughout training
Key Hyperparameters
ParameterRoleTuning Direction
n_estimatorsNumber of boosting rounds (trees)More = better fit; use early stopping
max_depthMax tree depth; controls complexityHigher = more overfit; typical range 3–10
learning_rate (eta)Shrinkage applied to each tree's contributionLower = needs more trees; slower overfit
subsampleRow sampling fraction per tree0.6–0.9 reduces variance
colsample_bytreeFeature sampling fraction per tree0.6–0.9 adds randomness; prevents overfit
reg_alphaL1 regularization on leaf weightsIncreases sparsity; drives weights to zero
reg_lambdaL2 regularization on leaf weightsSmooths weights; default is 1
Early Stopping & Feature Importance

Early Stopping

  • early_stopping_rounds=50 — stops training when validation metric doesn't improve for 50 rounds
  • Prevents overfitting without manually tuning n_estimators
  • Requires an eval_set: model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
  • Best model is automatically restored from the round with best validation score

Feature Importance

  • model.feature_importances_ — sklearn-style array of importance scores
  • xgb.plot_importance(model) — bar chart of feature importances
  • Weight: number of times a feature is used to split
  • Gain: average training loss reduction from splits using this feature (most informative)
  • Cover: average number of samples affected by splits using this feature

Model Evaluation & Tuning

Rigorous evaluation and systematic hyperparameter tuning are critical for building production-quality models. cuML provides GPU-accelerated versions of cross-validation, metrics, and search utilities.

Train/Test Split & Cross-Validation

Data Splitting

  • cuml.model_selection.train_test_split() — GPU version of sklearn's split; data stays on GPU
  • Returns cuDF DataFrames/Series for both train and test sets
  • Parameters: test_size, random_state, stratify — same interface as sklearn

Cross-Validation

  • K-fold CV: splits data into K folds; trains on K-1, validates on 1; repeats K times
  • cross_validate() in cuML — GPU parallel fold evaluation
  • Gives more reliable performance estimate than single train/test split
  • Stratified K-fold: preserves class proportions in each fold — use for imbalanced data
Metrics

Classification Metrics (cuml.metrics)

  • Accuracy: fraction of correct predictions; misleading for imbalanced classes
  • Precision: TP / (TP + FP) — of predicted positives, how many are correct
  • Recall: TP / (TP + FN) — of actual positives, how many were found
  • F1: harmonic mean of precision and recall; balances both
  • AUC-ROC: area under ROC curve; threshold-independent; 0.5 = random, 1.0 = perfect

Regression Metrics (cuml.metrics)

  • MSE (Mean Squared Error): average of squared residuals; penalizes large errors heavily
  • RMSE: square root of MSE; same units as target variable
  • MAE (Mean Absolute Error): average absolute residuals; robust to outliers
  • R² (R-squared): proportion of variance explained by model; 1.0 = perfect, 0 = mean baseline
Hyperparameter Tuning Strategies

Grid & Random Search

  • GridSearchCV: exhaustive search over all parameter combinations; expensive for large grids
  • RandomizedSearchCV: randomly samples from param distributions; more efficient for large spaces
  • Both available via cuML or sklearn (sklearn can wrap cuML estimators)

Bayesian Optimization

  • Most sample-efficient method for expensive GPU training runs
  • Builds a surrogate model of the objective function; samples where improvement is likely
  • Tools: Optuna, WandB Sweeps, Ax — all GPU-friendly
  • Best choice when each training run takes minutes or hours

Learning Curves & Diagnostics

  • Plot train vs. validation metric over training iterations
  • Overfitting: train metric improves, val metric worsens — add regularization or reduce complexity
  • Underfitting: both train and val metrics are low — increase model complexity or features
  • SHAP values: model-agnostic feature attribution; shows each feature's contribution to a prediction
  • XGBoost gain/weight/cover: built-in feature importance measures

Practice Quiz — Machine Learning with cuML & XGBoost

10 questions covering cuML algorithms, XGBoost GPU configuration, and model evaluation. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical cuML and XGBoost concepts.

🔌
cuML API Compatibility
"Fit, Predict, Transform — Same Game, GPU Flame"
cuML uses identical .fit(), .predict(), .transform() as scikit-learn. Change the import, keep the code — it runs on GPU.
🗺️
UMAP vs PCA
"PCA is Linear Lens; UMAP is Curved Map"
PCA does linear projection (SVD). UMAP does non-linear manifold learning — preserves cluster structure in 2D. GPU cuML UMAP is drastically faster than CPU umap-learn for large embedding sets.
🌲
XGBoost GPU Key
"CUDA Device + Hist = Trees at GPU Speed"
Set device='cuda' and tree_method='hist'. The histogram method parallelizes bin computation across GPU threads — that's the speed secret.
🛑
Early Stopping
"50 Rounds No Better? Stop the Setter!"
early_stopping_rounds=50 halts training when validation metric fails to improve for 50 consecutive rounds. Best model from best round is kept automatically — prevents overfitting.
📊
Bayesian vs Grid Search
"Bayes Learns, Grid Burns"
GridSearchCV tries every combination — O(n^k) evaluations. Bayesian optimization builds a surrogate and targets promising regions — most sample-efficient for expensive GPU runs.
📈
Overfit vs Underfit
"Train Up Val Down = Overfit Town; Both Low = Time to Grow"
Learning curve diagnosis: if train metric improves but val worsens → overfitting (regularize). If both stay low → underfitting (add complexity or features).

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for topic-specific guidance.

Click a card to flip it

UMAP in cuML
What type of algorithm is it and when is GPU UMAP preferred?
UMAP is a non-linear dimensionality reduction algorithm for visualizing high-dimensional embeddings. GPU cuML UMAP is drastically faster than CPU umap-learn — preferred any time dataset has >50K points.
XGBoost device='cuda'
What two params enable full GPU training?
Set device='cuda' and tree_method='hist'. The histogram method parallelizes bin finding and node splitting across GPU threads, giving massive speedup over CPU training.
DMatrix
What is it and what's special about cuDF support?
XGBoost's optimized internal data structure. xgb.DMatrix(data=df, label=y) accepts cuDF DataFrames directly — data stays in GPU memory throughout training without CPU round-trip.
DBSCAN vs KMeans
Key difference in how they define clusters
KMeans requires K clusters upfront; assigns by centroid distance. DBSCAN requires no K; finds dense regions; marks low-density points as noise (-1) — useful for anomaly detection.
Early Stopping in XGBoost
How do you configure it and what does it prevent?
Set early_stopping_rounds=N and provide an eval_set. Training stops when validation metric doesn't improve for N consecutive rounds — prevents overfitting and avoids manual n_estimators tuning.
AUC-ROC
What does it measure and what values indicate good/bad?
Area Under ROC curve — measures classifier's ability to discriminate between classes independent of threshold. 0.5 = random guessing, 1.0 = perfect classifier. Useful for imbalanced class problems.
Bayesian Optimization
Why prefer it over GridSearchCV for GPU tuning?
Bayesian optimization builds a surrogate model of the objective and samples where improvement is most likely — most sample-efficient method. Grid search requires evaluating every combination, which is prohibitively expensive for GPU training.
cuml.dask.*
What does it enable and how does it work?
Multi-GPU distributed training via Dask. cuml.dask.ensemble.RandomForestClassifier trains each partition on a separate GPU, then merges results. Requires LocalCUDACluster or multi-node Dask cluster.

Study Advisor

cuML API Essentials

  • Import: from cuml import LinearRegression — same as sklearn import path change
  • Train: model.fit(X_cudf, y_cudf) — X and y must be cuDF DataFrames/Series
  • Predict: preds = model.predict(X_test) — returns cupy/cuDF array on GPU
  • Transform: X_reduced = model.transform(X) — for PCA, UMAP, scalers
  • No .to_pandas() needed before fitting — data stays in GPU memory throughout
  • Multi-GPU: prefix with cuml.dask. and run inside a Dask cluster context

Ready to Pass NCP-ADS?

Test your cuML & XGBoost knowledge with full practice exams on FlashGenius.

Unlock Full Practice Tests on FlashGenius →