Graph Analytics, Time-Series & EDA
cuGraph scales graph analytics to billion-edge networks on GPU. cuDF powers GPU time-series feature engineering and statistical EDA — enabling fast discovery of patterns, anomalies, and data quality issues before modeling.
cuGraph at a Glance
- NetworkX-compatible API: drop-in GPU replacement for CPU graph analytics
- Input: edge list as cuDF DataFrame with
srcanddstcolumns - Scale: handles billion-edge graphs; CPU NetworkX would time out
- Integration: graph results returnable as cuDF DataFrames for further GPU analysis
- Key algorithms: PageRank, BFS, DFS, SSSP, Betweenness Centrality, Louvain, Jaccard
Time-Series with cuDF
- Datetime parsing:
cudf.to_datetime()— GPU date parsing - Rolling windows:
.rolling(window=7).mean()— sliding aggregations on GPU - Lag features:
.shift(n)— n-step lag for forecasting features - Resampling:
.resample('1H').agg(...)— downsample to coarser time grain - Use cases: sensor telemetry, financial tick data, system metrics
EDA & Anomaly Detection
- EDA goal: understand distributions, correlations, outliers, and class balance before modeling
- GPU descriptive stats:
df.describe(),df.corr(),df.isnull().sum() - Z-score: |z| > 3 flags outliers assuming Gaussian distribution
- IQR method: Q1 - 1.5×IQR to Q3 + 1.5×IQR — distribution-free outlier bounds
- cuML IsolationForest: anomalies isolated by shorter path lengths in random trees
Algorithm Quick Reference
| Algorithm | Library | What it Computes | Use Case |
|---|---|---|---|
| PageRank | cuGraph | Node importance via random walk scores | Web ranking, influence analysis |
| Louvain | cuGraph | Community detection via modularity | Network clustering, segmentation |
| Betweenness Centrality | cuGraph | Nodes that bridge clusters | Infrastructure bottleneck analysis |
| Rolling Mean | cuDF | Sliding window average | Trend smoothing in time series |
| IsolationForest | cuML | Anomaly score via path length | Outlier detection in large datasets |
| DBSCAN (noise) | cuML | Low-density region labels (-1) | Density-based anomaly detection |
cuGraph Algorithms
cuGraph is NVIDIA's GPU graph analytics library. It uses the same API patterns as NetworkX but runs algorithms in parallel across thousands of GPU cores — enabling billion-edge graph processing.
Graph Construction
- Input: edge list as a cuDF DataFrame with
srcanddstcolumns (node IDs as integers) - Optional:
weightcolumn for weighted graph algorithms - NetworkX-compatible API: many functions match NetworkX signatures
- Results returned as cuDF DataFrames — seamlessly chain into downstream cuDF operations
- Handles directed and undirected graphs
Scale Advantage
- cuGraph handles billion-edge graphs in GPU memory
- CPU NetworkX would time out or run out of RAM on the same graphs
- GPU parallelism: all nodes/edges processed simultaneously in parallel threads
- Typical speedup: 10–1000× over CPU NetworkX depending on algorithm and graph size
PageRank
- Computes importance score for each node via random walk simulation
- Higher PageRank = more likely to be visited by a random surfer following edges
- Use: web page ranking, social influence scoring, citation analysis
cuGraph.pagerank(G, alpha=0.85)— damping factor alpha typical at 0.85
BFS & DFS
- BFS (Breadth-First Search): explores all neighbors at current depth before going deeper; finds shortest path in unweighted graphs
- DFS (Depth-First Search): explores as deep as possible before backtracking; useful for cycle detection, topological sort
- Both GPU-accelerated in cuGraph for massive graphs
Shortest Path (SSSP)
- Single Source Shortest Path — finds shortest weighted paths from one source node to all others
- Uses Bellman-Ford or Dijkstra variants on GPU
- Applications: route optimization, network latency analysis
Betweenness Centrality
- Measures how often a node lies on the shortest path between other nodes
- High betweenness = bridge node connecting otherwise separate clusters
- Critical for identifying network bottlenecks, key influencers, and infrastructure vulnerabilities
- Computationally expensive on CPU; GPU parallelism makes it practical for large graphs
Jaccard Similarity
- Measures similarity between pairs of nodes based on shared neighbors
- Formula: |shared neighbors| / |union of neighbors|
- Used in recommendation systems (people with many shared connections may know each other)
- Result: similarity score 0.0 (no shared neighbors) to 1.0 (identical neighbor sets)
Louvain Community Detection
- Identifies communities (clusters) in a network by maximizing modularity
- Modularity measures density of edges within communities vs. between communities
- Iterative greedy algorithm — works bottom-up merging nodes into clusters
- Applications: social network segmentation, protein interaction networks, fraud ring detection
Triangle Counting
- Counts the number of closed triangles (3-node cliques) each node participates in
- High triangle count = node is part of a tightly-knit community
- Used in social network analysis (friend-of-friend clustering coefficient)
Connected Components
- Weakly Connected Components: connected if ignoring edge direction — finds isolated subgraphs
- Strongly Connected Components: directed; every node reachable from every other node in the component
- Use: fraud detection (disconnected subgraphs), data quality (isolated records)
Time-Series Analysis
cuDF provides GPU-accelerated datetime operations, rolling windows, lag features, and resampling — enabling fast feature engineering for forecasting and temporal analysis on large datasets.
Date Parsing & Extraction
cudf.to_datetime(df['timestamp'])— GPU datetime parsing (same aspd.to_datetime())- Extract components:
.dt.year,.dt.month,.dt.dayofweek,.dt.hour,.dt.minute - These extracted features become columns for ML models (e.g., hour-of-day as cyclical feature)
- All datetime operations run on GPU — no CPU round-trip needed
Rolling Windows
df['value'].rolling(window=7).mean()— 7-day moving average- Common aggregations:
.mean(),.sum(),.std(),.min(),.max() - Rolling features smooth noise and capture local trend — critical for forecasting models
- Window size selection: match the temporal granularity of the underlying pattern
Lag Features
df['lag_1'] = df['value'].shift(1)— value at time t-1df['lag_7'] = df['value'].shift(7)— value 7 periods ago (e.g., same day last week)- Lag features encode temporal autocorrelation as numeric inputs to ML models
- Essential for ARIMA alternatives: XGBoost on lag features outperforms ARIMA on many real datasets
Resampling
df.resample('1H').agg({'value':'mean'})— downsample to hourly means- Downsample: aggregate fine-grain data (e.g., seconds → hours)
- Upsample: fill missing timestamps (e.g., hours → minutes with interpolation)
- Useful for aligning datasets at different time granularities before joining
Autocorrelation & Seasonality
- Autocorrelation: correlation of a series with its own lagged values
- High autocorrelation at lag k indicates the series has memory at that period
- Seasonality detection: ACF (autocorrelation function) plot shows periodic peaks at seasonal lags
- Seasonal decomposition: trend + seasonality + residual using statsmodels or cuML
- Common use cases: sensor telemetry, financial tick data, system metrics, IoT streams
ARIMA & Prophet on GPU
- ARIMA (AutoRegressive Integrated Moving Average): uses past values + past errors to forecast
- cuML provides GPU-accelerated ARIMA fitting for large datasets
- Prophet: Facebook's decomposition model; handles multiple seasonality and holidays automatically
- Both suitable for NCP-ADS exam context — know they exist as GPU-compatible options
EDA & Anomaly Detection
GPU-accelerated EDA with cuDF allows descriptive statistics, correlations, and missing data audits at scale. cuML provides multiple anomaly detection algorithms for identifying outliers before modeling.
Descriptive Statistics
df.describe()— count, mean, std, min, 25%, 50%, 75%, max; all GPU-computeddf['col'].value_counts()— frequency per category on GPUdf.isnull().sum()— null count per column — essential data quality auditdf.dtypes— verify column types loaded correctly (int, float, datetime)- EDA goal: understand distributions, correlations, outliers, and class balance before modeling
Correlation Analysis
df.corr()— pairwise Pearson correlation matrix; all computed on GPU- Visualize as heatmap using cuDF results piped to matplotlib or seaborn
- High correlation between features: may indicate multicollinearity — consider dropping one
- High correlation between feature and target: strong predictor — prioritize in modeling
- Near-zero correlation: may still be useful with non-linear models
Class Imbalance
- Check with
df['label'].value_counts() - Imbalanced classes cause classifiers to predict majority class — skewed metrics
- Fixes: resampling (oversample minority, undersample majority), class_weight='balanced', SMOTE
- Use F1 or AUC-ROC instead of accuracy for imbalanced problems
Z-Score Method
- Z = (x - mean) / std; |z| > 3 flags outliers
- Assumes Gaussian distribution — fails for skewed or multimodal data
- Fast and simple; good starting point for symmetric distributions
- All stats computed on GPU via cuDF — extremely fast for large datasets
IQR Method
- Bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR; points outside are outliers
- IQR = Q3 - Q1 (interquartile range)
- Distribution-free: works on skewed data; robust to extreme outliers influencing mean/std
- Same as box plot whisker calculation — visually intuitive
Isolation Forest (cuML)
cuml.IsolationForest().fit_predict(X)— returns -1 for anomaly, 1 for normal- Principle: anomalies are easier to isolate — they require fewer splits to separate from others
- Shorter average path length in isolation trees = more anomalous
- Does not assume any distribution — works for multivariate anomalies
- GPU-accelerated in cuML for high-dimensional data
Autoencoder Anomaly Detection
- Train autoencoder on normal data only; anomalies have high reconstruction error
- Threshold: set reconstruction error cutoff based on normal distribution of training errors
- Especially effective for multivariate time series — captures complex temporal correlations
- Requires more data and tuning than statistical methods but catches complex anomalies
DBSCAN Noise Points
- DBSCAN labels points in low-density regions as noise (-1)
- These noise points effectively serve as anomaly flags in density-based outlier detection
- Advantage: no assumed distribution; finds clusters of arbitrary shape
- Key params:
eps(neighborhood radius),min_samples(minimum density threshold)
Practice Quiz — Graph Analytics, Time-Series & EDA
10 questions covering cuGraph algorithms, time-series feature engineering, and anomaly detection. Select one answer per question, then click Submit.
Memory Hooks
Six mnemonic devices to lock in the most exam-critical cuGraph, time-series, and anomaly detection concepts.
.shift(n) in cuDF creates lag-n features — value n periods ago. Essential for forecasting: XGBoost on lag features captures temporal patterns that ARIMA models capture through p, d, q parameters..rolling(window=7).mean() computes a 7-period sliding average — removes short-term noise to reveal underlying trend. Window size choice: match the temporal granularity of the pattern you want to capture.Flashcards & Advisor
Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review guidance.
Click a card to flip it
.shift(n) creates a lag-n feature — the value n periods earlier. Encodes temporal autocorrelation as a numeric column that ML models (XGBoost, Random Forest) can learn from directly.Study Advisor
cuGraph Key Points
- Input format: edge list cuDF DataFrame with
srcanddstinteger columns - PageRank: importance via random walk; alpha=0.85 damping factor is standard
- Louvain: maximizes modularity — finds communities; iterative greedy algorithm
- Betweenness: counts shortest paths through a node — bridge detection
- Jaccard: shared neighbors / union of neighbors — node similarity for recommendations
- Scale: billion-edge graphs on GPU; CPU NetworkX times out at fraction of that size
- Integration: cuGraph results return as cuDF DataFrames for downstream GPU analysis