Machine Learning with cuML & XGBoost
cuML brings the entire scikit-learn ML algorithm catalog to the GPU, while XGBoost's CUDA backend delivers state-of-the-art gradient boosting on tabular data. Together they form the ML engine of NCP-ADS.
cuML at a Glance
- Drop-in replacement: same
.fit(),.predict(),.transform()API as scikit-learn - Native cuDF: accepts cuDF DataFrames — no CPU round-trip required
- Coverage: linear models, clustering, neighbors, decomposition, manifold, SVM, ensemble, multi-GPU
- Speedup: 10–50× faster than CPU scikit-learn on large datasets
XGBoost on GPU
- Best-in-class tabular ML: gradient boosted trees consistently win on structured data
- GPU backend:
device='cuda'enables GPU-accelerated tree building - Histogram method:
tree_method='hist'— parallelizes bin finding and node splitting - cuDF integration:
xgb.DMatrix()accepts cuDF DataFrames directly
Model Evaluation Pipeline
- GPU split:
cuml.model_selection.train_test_split() - Cross-validation:
cross_validate()— GPU parallel fold evaluation - Metrics: cuml.metrics covers accuracy, AUC-ROC, F1, MSE, R², and more
- Tuning: GridSearchCV, RandomizedSearchCV, Optuna, Bayesian optimization
ML Algorithm Quick Reference
| Algorithm | cuML Class | Task | Key Note |
|---|---|---|---|
| Linear Regression | cuml.LinearRegression | Regression | GPU-accelerated OLS |
| Logistic Regression | cuml.LogisticRegression | Classification | L1/L2 regularization supported |
| K-Means | cuml.KMeans | Clustering | Parallel centroid updates on GPU |
| DBSCAN | cuml.DBSCAN | Clustering | No K needed; density-based |
| PCA | cuml.PCA | Decomposition | GPU SVD; reduces dimensions |
| UMAP | cuml.UMAP | Manifold/Viz | GPU UMAP drastically faster than CPU umap-learn |
| Random Forest | cuml.RandomForestClassifier | Ensemble | GPU parallel tree building |
| XGBoost | xgb.XGBClassifier(device='cuda') | Gradient Boosting | Best for tabular data |
cuML Algorithms
cuML provides a scikit-learn-compatible GPU ML library. All algorithms accept cuDF DataFrames natively and execute entirely on GPU — no pandas round-trip needed.
Scikit-learn Compatible API
.fit(X, y)— train the model on GPU data.predict(X)— generate predictions; returns cuDF Series or cupy array.transform(X)— dimensionality reduction, scaling (unsupervised transforms).fit_transform(X)— fit then transform in one call- Accepts cuDF DataFrames directly — no
to_pandas()or CPU round-trip - Drop-in: change
from sklearntofrom cumland the same code runs on GPU
Linear Models
- LinearRegression: ordinary least squares regression on GPU
- LogisticRegression: binary and multiclass classification; supports L1/L2/ElasticNet
- Ridge: L2-regularized linear regression — penalizes large coefficients
- Lasso: L1-regularized regression — drives sparse coefficients to zero; feature selection
- ElasticNet: combination of L1 + L2 penalties for balanced regularization
Clustering
- KMeans: GPU parallel centroid updates; scales to millions of points; requires K upfront
- DBSCAN: density-based spatial clustering; no K required; identifies noise points as outliers; GPU-accelerated neighbor search
Neighbors
- NearestNeighbors: brute-force and approximate nearest neighbor search on GPU
- KNeighborsClassifier: k-NN classification using GPU-accelerated distance computation
- Distance metrics: Euclidean, cosine, L1, L∞ — all computed in parallel on GPU cores
Decomposition
- PCA: principal component analysis via GPU SVD — fast dimensionality reduction
- TruncatedSVD: randomized SVD for sparse data; reduces dimensions without centering
- Outputs are cupy arrays that can be used directly in downstream cuML models
UMAP — Manifold Learning
- Uniform Manifold Approximation and Projection — non-linear dimensionality reduction
- Best for visualizing high-dimensional embeddings (e.g., sentence embeddings, image features)
- GPU cuML UMAP is drastically faster than CPU umap-learn — minutes vs. hours on large datasets
- Preserves both local and global structure better than t-SNE
- Common use: reduce 768-dim transformer embeddings to 2D for visualization
SVM & Ensemble
- SVC / SVR: support vector classification and regression — GPU-accelerated via cuML kernel solvers
- RandomForestClassifier / RandomForestRegressor: GPU parallel tree building; handles large feature counts
- Random Forest builds trees in parallel across GPU threads — significantly faster than CPU sklearn
Multi-GPU Training
cuml.dask.*wrappers enable distributed training across multiple GPUs- Example:
from cuml.dask.ensemble import RandomForestClassifier - Works with
LocalCUDAClusteror multi-node Dask clusters via RAPIDS UCX - Each GPU trains on a partition; results merged for final model
XGBoost on GPU
XGBoost is often the best performer on tabular data. The CUDA backend enables fully GPU-accelerated gradient boosted tree training with the same API as CPU XGBoost.
GPU Device Configuration
- Functional API:
xgb.train(params={'device':'cuda'}, dtrain, ...) - Sklearn API:
XGBClassifier(device='cuda')orXGBRegressor(device='cuda') - Tree method:
tree_method='hist'— histogram-based algorithm parallelizes bin finding and node splitting on GPU - Histogram binning is precomputed once; GPU massively parallelizes split candidate evaluation
DMatrix — XGBoost's Data Structure
xgb.DMatrix(data=df, label=y)— optimized sparse data structure for XGBoost- Supports cuDF DataFrames directly — no pandas conversion needed
- Precomputes feature statistics for fast split finding
- Stores data in compressed column format for cache efficiency
- For GPU: data stays in GPU memory throughout training
| Parameter | Role | Tuning Direction |
|---|---|---|
| n_estimators | Number of boosting rounds (trees) | More = better fit; use early stopping |
| max_depth | Max tree depth; controls complexity | Higher = more overfit; typical range 3–10 |
| learning_rate (eta) | Shrinkage applied to each tree's contribution | Lower = needs more trees; slower overfit |
| subsample | Row sampling fraction per tree | 0.6–0.9 reduces variance |
| colsample_bytree | Feature sampling fraction per tree | 0.6–0.9 adds randomness; prevents overfit |
| reg_alpha | L1 regularization on leaf weights | Increases sparsity; drives weights to zero |
| reg_lambda | L2 regularization on leaf weights | Smooths weights; default is 1 |
Early Stopping
early_stopping_rounds=50— stops training when validation metric doesn't improve for 50 rounds- Prevents overfitting without manually tuning
n_estimators - Requires an
eval_set:model.fit(X_train, y_train, eval_set=[(X_val, y_val)]) - Best model is automatically restored from the round with best validation score
Feature Importance
model.feature_importances_— sklearn-style array of importance scoresxgb.plot_importance(model)— bar chart of feature importances- Weight: number of times a feature is used to split
- Gain: average training loss reduction from splits using this feature (most informative)
- Cover: average number of samples affected by splits using this feature
Model Evaluation & Tuning
Rigorous evaluation and systematic hyperparameter tuning are critical for building production-quality models. cuML provides GPU-accelerated versions of cross-validation, metrics, and search utilities.
Data Splitting
cuml.model_selection.train_test_split()— GPU version of sklearn's split; data stays on GPU- Returns cuDF DataFrames/Series for both train and test sets
- Parameters:
test_size,random_state,stratify— same interface as sklearn
Cross-Validation
- K-fold CV: splits data into K folds; trains on K-1, validates on 1; repeats K times
cross_validate()in cuML — GPU parallel fold evaluation- Gives more reliable performance estimate than single train/test split
- Stratified K-fold: preserves class proportions in each fold — use for imbalanced data
Classification Metrics (cuml.metrics)
- Accuracy: fraction of correct predictions; misleading for imbalanced classes
- Precision: TP / (TP + FP) — of predicted positives, how many are correct
- Recall: TP / (TP + FN) — of actual positives, how many were found
- F1: harmonic mean of precision and recall; balances both
- AUC-ROC: area under ROC curve; threshold-independent; 0.5 = random, 1.0 = perfect
Regression Metrics (cuml.metrics)
- MSE (Mean Squared Error): average of squared residuals; penalizes large errors heavily
- RMSE: square root of MSE; same units as target variable
- MAE (Mean Absolute Error): average absolute residuals; robust to outliers
- R² (R-squared): proportion of variance explained by model; 1.0 = perfect, 0 = mean baseline
Grid & Random Search
- GridSearchCV: exhaustive search over all parameter combinations; expensive for large grids
- RandomizedSearchCV: randomly samples from param distributions; more efficient for large spaces
- Both available via cuML or sklearn (sklearn can wrap cuML estimators)
Bayesian Optimization
- Most sample-efficient method for expensive GPU training runs
- Builds a surrogate model of the objective function; samples where improvement is likely
- Tools: Optuna, WandB Sweeps, Ax — all GPU-friendly
- Best choice when each training run takes minutes or hours
Learning Curves & Diagnostics
- Plot train vs. validation metric over training iterations
- Overfitting: train metric improves, val metric worsens — add regularization or reduce complexity
- Underfitting: both train and val metrics are low — increase model complexity or features
- SHAP values: model-agnostic feature attribution; shows each feature's contribution to a prediction
- XGBoost gain/weight/cover: built-in feature importance measures
Practice Quiz — Machine Learning with cuML & XGBoost
10 questions covering cuML algorithms, XGBoost GPU configuration, and model evaluation. Select one answer per question, then click Submit.
Memory Hooks
Six mnemonic devices to lock in the most exam-critical cuML and XGBoost concepts.
.fit(), .predict(), .transform() as scikit-learn. Change the import, keep the code — it runs on GPU.device='cuda' and tree_method='hist'. The histogram method parallelizes bin computation across GPU threads — that's the speed secret.early_stopping_rounds=50 halts training when validation metric fails to improve for 50 consecutive rounds. Best model from best round is kept automatically — prevents overfitting.Flashcards & Advisor
Click any flashcard to flip and reveal the answer. Then use the Study Advisor for topic-specific guidance.
Click a card to flip it
device='cuda' and tree_method='hist'. The histogram method parallelizes bin finding and node splitting across GPU threads, giving massive speedup over CPU training.xgb.DMatrix(data=df, label=y) accepts cuDF DataFrames directly — data stays in GPU memory throughout training without CPU round-trip.early_stopping_rounds=N and provide an eval_set. Training stops when validation metric doesn't improve for N consecutive rounds — prevents overfitting and avoids manual n_estimators tuning.cuml.dask.ensemble.RandomForestClassifier trains each partition on a separate GPU, then merges results. Requires LocalCUDACluster or multi-node Dask cluster.Study Advisor
cuML API Essentials
- Import:
from cuml import LinearRegression— same as sklearn import path change - Train:
model.fit(X_cudf, y_cudf)— X and y must be cuDF DataFrames/Series - Predict:
preds = model.predict(X_test)— returns cupy/cuDF array on GPU - Transform:
X_reduced = model.transform(X)— for PCA, UMAP, scalers - No
.to_pandas()needed before fitting — data stays in GPU memory throughout - Multi-GPU: prefix with
cuml.dask.and run inside a Dask cluster context