NCP-ADS: Machine Learning with cuML & XGBoost

Machine Learning with cuML & XGBoost

cuML brings the entire scikit-learn ML algorithm catalog to the GPU, while XGBoost's CUDA backend delivers state-of-the-art gradient boosting on tabular data. Together they form the ML engine of NCP-ADS.

cuML at a Glance

Drop-in replacement: same .fit(), .predict(), .transform() API as scikit-learn
Native cuDF: accepts cuDF DataFrames — no CPU round-trip required
Coverage: linear models, clustering, neighbors, decomposition, manifold, SVM, ensemble, multi-GPU
Speedup: 10–50× faster than CPU scikit-learn on large datasets

XGBoost on GPU

Best-in-class tabular ML: gradient boosted trees consistently win on structured data
GPU backend: device='cuda' enables GPU-accelerated tree building
Histogram method: tree_method='hist' — parallelizes bin finding and node splitting
cuDF integration: xgb.DMatrix() accepts cuDF DataFrames directly

Model Evaluation Pipeline

GPU split: cuml.model_selection.train_test_split()
Cross-validation: cross_validate() — GPU parallel fold evaluation
Metrics: cuml.metrics covers accuracy, AUC-ROC, F1, MSE, R², and more
Tuning: GridSearchCV, RandomizedSearchCV, Optuna, Bayesian optimization

ML Algorithm Quick Reference

Algorithm	cuML Class	Task	Key Note
Linear Regression	cuml.LinearRegression	Regression	GPU-accelerated OLS
Logistic Regression	cuml.LogisticRegression	Classification	L1/L2 regularization supported
K-Means	cuml.KMeans	Clustering	Parallel centroid updates on GPU
DBSCAN	cuml.DBSCAN	Clustering	No K needed; density-based
PCA	cuml.PCA	Decomposition	GPU SVD; reduces dimensions
UMAP	cuml.UMAP	Manifold/Viz	GPU UMAP drastically faster than CPU umap-learn
Random Forest	cuml.RandomForestClassifier	Ensemble	GPU parallel tree building
XGBoost	xgb.XGBClassifier(device='cuda')	Gradient Boosting	Best for tabular data

cuML Algorithms

cuML provides a scikit-learn-compatible GPU ML library. All algorithms accept cuDF DataFrames natively and execute entirely on GPU — no pandas round-trip needed.

API Compatibility

Scikit-learn Compatible API

.fit(X, y) — train the model on GPU data
.predict(X) — generate predictions; returns cuDF Series or cupy array
.transform(X) — dimensionality reduction, scaling (unsupervised transforms)
.fit_transform(X) — fit then transform in one call
Accepts cuDF DataFrames directly — no to_pandas() or CPU round-trip
Drop-in: change from sklearn to from cuml and the same code runs on GPU

Linear Models

LinearRegression: ordinary least squares regression on GPU
LogisticRegression: binary and multiclass classification; supports L1/L2/ElasticNet
Ridge: L2-regularized linear regression — penalizes large coefficients
Lasso: L1-regularized regression — drives sparse coefficients to zero; feature selection
ElasticNet: combination of L1 + L2 penalties for balanced regularization

Clustering & Neighbors

Clustering

KMeans: GPU parallel centroid updates; scales to millions of points; requires K upfront
DBSCAN: density-based spatial clustering; no K required; identifies noise points as outliers; GPU-accelerated neighbor search

Neighbors

NearestNeighbors: brute-force and approximate nearest neighbor search on GPU
KNeighborsClassifier: k-NN classification using GPU-accelerated distance computation
Distance metrics: Euclidean, cosine, L1, L∞ — all computed in parallel on GPU cores

Decomposition

PCA: principal component analysis via GPU SVD — fast dimensionality reduction
TruncatedSVD: randomized SVD for sparse data; reduces dimensions without centering
Outputs are cupy arrays that can be used directly in downstream cuML models

Advanced Algorithms

UMAP — Manifold Learning

Uniform Manifold Approximation and Projection — non-linear dimensionality reduction
Best for visualizing high-dimensional embeddings (e.g., sentence embeddings, image features)
GPU cuML UMAP is drastically faster than CPU umap-learn — minutes vs. hours on large datasets
Preserves both local and global structure better than t-SNE
Common use: reduce 768-dim transformer embeddings to 2D for visualization

SVM & Ensemble

SVC / SVR: support vector classification and regression — GPU-accelerated via cuML kernel solvers
RandomForestClassifier / RandomForestRegressor: GPU parallel tree building; handles large feature counts
Random Forest builds trees in parallel across GPU threads — significantly faster than CPU sklearn

Multi-GPU Training

cuml.dask.* wrappers enable distributed training across multiple GPUs
Example: from cuml.dask.ensemble import RandomForestClassifier
Works with LocalCUDACluster or multi-node Dask clusters via RAPIDS UCX
Each GPU trains on a partition; results merged for final model

XGBoost on GPU

XGBoost is often the best performer on tabular data. The CUDA backend enables fully GPU-accelerated gradient boosted tree training with the same API as CPU XGBoost.

Enabling GPU Training

GPU Device Configuration

Functional API: xgb.train(params={'device':'cuda'}, dtrain, ...)
Sklearn API: XGBClassifier(device='cuda') or XGBRegressor(device='cuda')
Tree method: tree_method='hist' — histogram-based algorithm parallelizes bin finding and node splitting on GPU
Histogram binning is precomputed once; GPU massively parallelizes split candidate evaluation

DMatrix — XGBoost's Data Structure

xgb.DMatrix(data=df, label=y) — optimized sparse data structure for XGBoost
Supports cuDF DataFrames directly — no pandas conversion needed
Precomputes feature statistics for fast split finding
Stores data in compressed column format for cache efficiency
For GPU: data stays in GPU memory throughout training

Key Hyperparameters

Parameter	Role	Tuning Direction
n_estimators	Number of boosting rounds (trees)	More = better fit; use early stopping
max_depth	Max tree depth; controls complexity	Higher = more overfit; typical range 3–10
learning_rate (eta)	Shrinkage applied to each tree's contribution	Lower = needs more trees; slower overfit
subsample	Row sampling fraction per tree	0.6–0.9 reduces variance
colsample_bytree	Feature sampling fraction per tree	0.6–0.9 adds randomness; prevents overfit
reg_alpha	L1 regularization on leaf weights	Increases sparsity; drives weights to zero
reg_lambda	L2 regularization on leaf weights	Smooths weights; default is 1

Early Stopping & Feature Importance

Early Stopping

early_stopping_rounds=50 — stops training when validation metric doesn't improve for 50 rounds
Prevents overfitting without manually tuning n_estimators
Requires an eval_set: model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
Best model is automatically restored from the round with best validation score

Feature Importance

model.feature_importances_ — sklearn-style array of importance scores
xgb.plot_importance(model) — bar chart of feature importances
Weight: number of times a feature is used to split
Gain: average training loss reduction from splits using this feature (most informative)
Cover: average number of samples affected by splits using this feature

Model Evaluation & Tuning

Rigorous evaluation and systematic hyperparameter tuning are critical for building production-quality models. cuML provides GPU-accelerated versions of cross-validation, metrics, and search utilities.

Train/Test Split & Cross-Validation

Data Splitting

cuml.model_selection.train_test_split() — GPU version of sklearn's split; data stays on GPU
Returns cuDF DataFrames/Series for both train and test sets
Parameters: test_size, random_state, stratify — same interface as sklearn

Cross-Validation

K-fold CV: splits data into K folds; trains on K-1, validates on 1; repeats K times
cross_validate() in cuML — GPU parallel fold evaluation
Gives more reliable performance estimate than single train/test split
Stratified K-fold: preserves class proportions in each fold — use for imbalanced data

Metrics

Classification Metrics (cuml.metrics)

Accuracy: fraction of correct predictions; misleading for imbalanced classes
Precision: TP / (TP + FP) — of predicted positives, how many are correct
Recall: TP / (TP + FN) — of actual positives, how many were found
F1: harmonic mean of precision and recall; balances both
AUC-ROC: area under ROC curve; threshold-independent; 0.5 = random, 1.0 = perfect

Regression Metrics (cuml.metrics)

MSE (Mean Squared Error): average of squared residuals; penalizes large errors heavily
RMSE: square root of MSE; same units as target variable
MAE (Mean Absolute Error): average absolute residuals; robust to outliers
R² (R-squared): proportion of variance explained by model; 1.0 = perfect, 0 = mean baseline

Hyperparameter Tuning Strategies

Grid & Random Search

GridSearchCV: exhaustive search over all parameter combinations; expensive for large grids
RandomizedSearchCV: randomly samples from param distributions; more efficient for large spaces
Both available via cuML or sklearn (sklearn can wrap cuML estimators)

Bayesian Optimization

Most sample-efficient method for expensive GPU training runs
Builds a surrogate model of the objective function; samples where improvement is likely
Tools: Optuna, WandB Sweeps, Ax — all GPU-friendly
Best choice when each training run takes minutes or hours

Learning Curves & Diagnostics

Plot train vs. validation metric over training iterations
Overfitting: train metric improves, val metric worsens — add regularization or reduce complexity
Underfitting: both train and val metrics are low — increase model complexity or features
SHAP values: model-agnostic feature attribution; shows each feature's contribution to a prediction
XGBoost gain/weight/cover: built-in feature importance measures

Practice Quiz — Machine Learning with cuML & XGBoost

10 questions covering cuML algorithms, XGBoost GPU configuration, and model evaluation. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical cuML and XGBoost concepts.

🔌

cuML API Compatibility

"Fit, Predict, Transform — Same Game, GPU Flame"

cuML uses identical .fit(), .predict(), .transform() as scikit-learn. Change the import, keep the code — it runs on GPU.

🗺️

UMAP vs PCA

"PCA is Linear Lens; UMAP is Curved Map"

PCA does linear projection (SVD). UMAP does non-linear manifold learning — preserves cluster structure in 2D. GPU cuML UMAP is drastically faster than CPU umap-learn for large embedding sets.

🌲

XGBoost GPU Key

"CUDA Device + Hist = Trees at GPU Speed"

Set device='cuda' and tree_method='hist'. The histogram method parallelizes bin computation across GPU threads — that's the speed secret.

🛑

Early Stopping

"50 Rounds No Better? Stop the Setter!"

early_stopping_rounds=50 halts training when validation metric fails to improve for 50 consecutive rounds. Best model from best round is kept automatically — prevents overfitting.

📊

Bayesian vs Grid Search

"Bayes Learns, Grid Burns"

GridSearchCV tries every combination — O(n^k) evaluations. Bayesian optimization builds a surrogate and targets promising regions — most sample-efficient for expensive GPU runs.

📈

Overfit vs Underfit

"Train Up Val Down = Overfit Town; Both Low = Time to Grow"

Learning curve diagnosis: if train metric improves but val worsens → overfitting (regularize). If both stay low → underfitting (add complexity or features).

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for topic-specific guidance.

Click a card to flip it

UMAP in cuML

What type of algorithm is it and when is GPU UMAP preferred?

UMAP is a non-linear dimensionality reduction algorithm for visualizing high-dimensional embeddings. GPU cuML UMAP is drastically faster than CPU umap-learn — preferred any time dataset has >50K points.

XGBoost device='cuda'

What two params enable full GPU training?

Set device='cuda' and tree_method='hist'. The histogram method parallelizes bin finding and node splitting across GPU threads, giving massive speedup over CPU training.

DMatrix

What is it and what's special about cuDF support?

XGBoost's optimized internal data structure. xgb.DMatrix(data=df, label=y) accepts cuDF DataFrames directly — data stays in GPU memory throughout training without CPU round-trip.

DBSCAN vs KMeans

Key difference in how they define clusters

KMeans requires K clusters upfront; assigns by centroid distance. DBSCAN requires no K; finds dense regions; marks low-density points as noise (-1) — useful for anomaly detection.

Early Stopping in XGBoost

How do you configure it and what does it prevent?

Set early_stopping_rounds=N and provide an eval_set. Training stops when validation metric doesn't improve for N consecutive rounds — prevents overfitting and avoids manual n_estimators tuning.

AUC-ROC

What does it measure and what values indicate good/bad?

Area Under ROC curve — measures classifier's ability to discriminate between classes independent of threshold. 0.5 = random guessing, 1.0 = perfect classifier. Useful for imbalanced class problems.

Bayesian Optimization

Why prefer it over GridSearchCV for GPU tuning?

Bayesian optimization builds a surrogate model of the objective and samples where improvement is most likely — most sample-efficient method. Grid search requires evaluating every combination, which is prohibitively expensive for GPU training.

cuml.dask.*

What does it enable and how does it work?

Multi-GPU distributed training via Dask. cuml.dask.ensemble.RandomForestClassifier trains each partition on a separate GPU, then merges results. Requires LocalCUDACluster or multi-node Dask cluster.

Study Advisor

cuML API Essentials

Import: from cuml import LinearRegression — same as sklearn import path change
Train: model.fit(X_cudf, y_cudf) — X and y must be cuDF DataFrames/Series
Predict: preds = model.predict(X_test) — returns cupy/cuDF array on GPU
Transform: X_reduced = model.transform(X) — for PCA, UMAP, scalers
No .to_pandas() needed before fitting — data stays in GPU memory throughout
Multi-GPU: prefix with cuml.dask. and run inside a Dask cluster context

Machine Learning with cuML & XGBoost

Machine Learning with cuML & XGBoost

cuML at a Glance

XGBoost on GPU

Model Evaluation Pipeline

ML Algorithm Quick Reference

cuML Algorithms

Scikit-learn Compatible API

Linear Models

Clustering

Neighbors

Decomposition

UMAP — Manifold Learning

SVM & Ensemble

Multi-GPU Training

XGBoost on GPU

GPU Device Configuration

DMatrix — XGBoost's Data Structure

Early Stopping

Feature Importance

Model Evaluation & Tuning

Data Splitting

Cross-Validation

Classification Metrics (cuml.metrics)

Regression Metrics (cuml.metrics)

Grid & Random Search

Bayesian Optimization

Learning Curves & Diagnostics

Practice Quiz — Machine Learning with cuML & XGBoost

Memory Hooks

Flashcards & Advisor

Study Advisor

cuML API Essentials

Ready to Pass NCP-ADS?