NCP-ADS: Graph Analytics, Time-Series & EDA

Graph Analytics, Time-Series & EDA

cuGraph scales graph analytics to billion-edge networks on GPU. cuDF powers GPU time-series feature engineering and statistical EDA — enabling fast discovery of patterns, anomalies, and data quality issues before modeling.

cuGraph at a Glance

NetworkX-compatible API: drop-in GPU replacement for CPU graph analytics
Input: edge list as cuDF DataFrame with src and dst columns
Scale: handles billion-edge graphs; CPU NetworkX would time out
Integration: graph results returnable as cuDF DataFrames for further GPU analysis
Key algorithms: PageRank, BFS, DFS, SSSP, Betweenness Centrality, Louvain, Jaccard

Time-Series with cuDF

Datetime parsing: cudf.to_datetime() — GPU date parsing
Rolling windows: .rolling(window=7).mean() — sliding aggregations on GPU
Lag features: .shift(n) — n-step lag for forecasting features
Resampling: .resample('1H').agg(...) — downsample to coarser time grain
Use cases: sensor telemetry, financial tick data, system metrics

EDA & Anomaly Detection

EDA goal: understand distributions, correlations, outliers, and class balance before modeling
GPU descriptive stats: df.describe(), df.corr(), df.isnull().sum()
Z-score: |z| > 3 flags outliers assuming Gaussian distribution
IQR method: Q1 - 1.5×IQR to Q3 + 1.5×IQR — distribution-free outlier bounds
cuML IsolationForest: anomalies isolated by shorter path lengths in random trees

Algorithm Quick Reference

Algorithm	Library	What it Computes	Use Case
PageRank	cuGraph	Node importance via random walk scores	Web ranking, influence analysis
Louvain	cuGraph	Community detection via modularity	Network clustering, segmentation
Betweenness Centrality	cuGraph	Nodes that bridge clusters	Infrastructure bottleneck analysis
Rolling Mean	cuDF	Sliding window average	Trend smoothing in time series
IsolationForest	cuML	Anomaly score via path length	Outlier detection in large datasets
DBSCAN (noise)	cuML	Low-density region labels (-1)	Density-based anomaly detection

cuGraph Algorithms

cuGraph is NVIDIA's GPU graph analytics library. It uses the same API patterns as NetworkX but runs algorithms in parallel across thousands of GPU cores — enabling billion-edge graph processing.

Data Input & Setup

Graph Construction

Input: edge list as a cuDF DataFrame with src and dst columns (node IDs as integers)
Optional: weight column for weighted graph algorithms
NetworkX-compatible API: many functions match NetworkX signatures
Results returned as cuDF DataFrames — seamlessly chain into downstream cuDF operations
Handles directed and undirected graphs

Scale Advantage

cuGraph handles billion-edge graphs in GPU memory
CPU NetworkX would time out or run out of RAM on the same graphs
GPU parallelism: all nodes/edges processed simultaneously in parallel threads
Typical speedup: 10–1000× over CPU NetworkX depending on algorithm and graph size

Core Algorithms

PageRank

Computes importance score for each node via random walk simulation
Higher PageRank = more likely to be visited by a random surfer following edges
Use: web page ranking, social influence scoring, citation analysis
cuGraph.pagerank(G, alpha=0.85) — damping factor alpha typical at 0.85

BFS & DFS

BFS (Breadth-First Search): explores all neighbors at current depth before going deeper; finds shortest path in unweighted graphs
DFS (Depth-First Search): explores as deep as possible before backtracking; useful for cycle detection, topological sort
Both GPU-accelerated in cuGraph for massive graphs

Shortest Path (SSSP)

Single Source Shortest Path — finds shortest weighted paths from one source node to all others
Uses Bellman-Ford or Dijkstra variants on GPU
Applications: route optimization, network latency analysis

Betweenness Centrality

Measures how often a node lies on the shortest path between other nodes
High betweenness = bridge node connecting otherwise separate clusters
Critical for identifying network bottlenecks, key influencers, and infrastructure vulnerabilities
Computationally expensive on CPU; GPU parallelism makes it practical for large graphs

Jaccard Similarity

Measures similarity between pairs of nodes based on shared neighbors
Formula: |shared neighbors| / |union of neighbors|
Used in recommendation systems (people with many shared connections may know each other)
Result: similarity score 0.0 (no shared neighbors) to 1.0 (identical neighbor sets)

Louvain Community Detection

Identifies communities (clusters) in a network by maximizing modularity
Modularity measures density of edges within communities vs. between communities
Iterative greedy algorithm — works bottom-up merging nodes into clusters
Applications: social network segmentation, protein interaction networks, fraud ring detection

Additional Algorithms

Triangle Counting

Counts the number of closed triangles (3-node cliques) each node participates in
High triangle count = node is part of a tightly-knit community
Used in social network analysis (friend-of-friend clustering coefficient)

Connected Components

Weakly Connected Components: connected if ignoring edge direction — finds isolated subgraphs
Strongly Connected Components: directed; every node reachable from every other node in the component
Use: fraud detection (disconnected subgraphs), data quality (isolated records)

Time-Series Analysis

cuDF provides GPU-accelerated datetime operations, rolling windows, lag features, and resampling — enabling fast feature engineering for forecasting and temporal analysis on large datasets.

Datetime Operations

Date Parsing & Extraction

cudf.to_datetime(df['timestamp']) — GPU datetime parsing (same as pd.to_datetime())
Extract components: .dt.year, .dt.month, .dt.dayofweek, .dt.hour, .dt.minute
These extracted features become columns for ML models (e.g., hour-of-day as cyclical feature)
All datetime operations run on GPU — no CPU round-trip needed

Rolling Windows

df['value'].rolling(window=7).mean() — 7-day moving average
Common aggregations: .mean(), .sum(), .std(), .min(), .max()
Rolling features smooth noise and capture local trend — critical for forecasting models
Window size selection: match the temporal granularity of the underlying pattern

Feature Engineering for Forecasting

Lag Features

df['lag_1'] = df['value'].shift(1) — value at time t-1
df['lag_7'] = df['value'].shift(7) — value 7 periods ago (e.g., same day last week)
Lag features encode temporal autocorrelation as numeric inputs to ML models
Essential for ARIMA alternatives: XGBoost on lag features outperforms ARIMA on many real datasets

Resampling

df.resample('1H').agg({'value':'mean'}) — downsample to hourly means
Downsample: aggregate fine-grain data (e.g., seconds → hours)
Upsample: fill missing timestamps (e.g., hours → minutes with interpolation)
Useful for aligning datasets at different time granularities before joining

Autocorrelation & Seasonality

Autocorrelation: correlation of a series with its own lagged values
High autocorrelation at lag k indicates the series has memory at that period
Seasonality detection: ACF (autocorrelation function) plot shows periodic peaks at seasonal lags
Seasonal decomposition: trend + seasonality + residual using statsmodels or cuML
Common use cases: sensor telemetry, financial tick data, system metrics, IoT streams

Forecasting Methods

ARIMA & Prophet on GPU

ARIMA (AutoRegressive Integrated Moving Average): uses past values + past errors to forecast
cuML provides GPU-accelerated ARIMA fitting for large datasets
Prophet: Facebook's decomposition model; handles multiple seasonality and holidays automatically
Both suitable for NCP-ADS exam context — know they exist as GPU-compatible options

EDA & Anomaly Detection

GPU-accelerated EDA with cuDF allows descriptive statistics, correlations, and missing data audits at scale. cuML provides multiple anomaly detection algorithms for identifying outliers before modeling.

Exploratory Data Analysis

Descriptive Statistics

df.describe() — count, mean, std, min, 25%, 50%, 75%, max; all GPU-computed
df['col'].value_counts() — frequency per category on GPU
df.isnull().sum() — null count per column — essential data quality audit
df.dtypes — verify column types loaded correctly (int, float, datetime)
EDA goal: understand distributions, correlations, outliers, and class balance before modeling

Correlation Analysis

df.corr() — pairwise Pearson correlation matrix; all computed on GPU
Visualize as heatmap using cuDF results piped to matplotlib or seaborn
High correlation between features: may indicate multicollinearity — consider dropping one
High correlation between feature and target: strong predictor — prioritize in modeling
Near-zero correlation: may still be useful with non-linear models

Class Imbalance

Check with df['label'].value_counts()
Imbalanced classes cause classifiers to predict majority class — skewed metrics
Fixes: resampling (oversample minority, undersample majority), class_weight='balanced', SMOTE
Use F1 or AUC-ROC instead of accuracy for imbalanced problems

Anomaly Detection Methods

Z-Score Method

Z = (x - mean) / std; |z| > 3 flags outliers
Assumes Gaussian distribution — fails for skewed or multimodal data
Fast and simple; good starting point for symmetric distributions
All stats computed on GPU via cuDF — extremely fast for large datasets

IQR Method

Bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR; points outside are outliers
IQR = Q3 - Q1 (interquartile range)
Distribution-free: works on skewed data; robust to extreme outliers influencing mean/std
Same as box plot whisker calculation — visually intuitive

Isolation Forest (cuML)

cuml.IsolationForest().fit_predict(X) — returns -1 for anomaly, 1 for normal
Principle: anomalies are easier to isolate — they require fewer splits to separate from others
Shorter average path length in isolation trees = more anomalous
Does not assume any distribution — works for multivariate anomalies
GPU-accelerated in cuML for high-dimensional data

Autoencoder Anomaly Detection

Train autoencoder on normal data only; anomalies have high reconstruction error
Threshold: set reconstruction error cutoff based on normal distribution of training errors
Especially effective for multivariate time series — captures complex temporal correlations
Requires more data and tuning than statistical methods but catches complex anomalies

DBSCAN Noise Points

DBSCAN labels points in low-density regions as noise (-1)
These noise points effectively serve as anomaly flags in density-based outlier detection
Advantage: no assumed distribution; finds clusters of arbitrary shape
Key params: eps (neighborhood radius), min_samples (minimum density threshold)

Practice Quiz — Graph Analytics, Time-Series & EDA

10 questions covering cuGraph algorithms, time-series feature engineering, and anomaly detection. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical cuGraph, time-series, and anomaly detection concepts.

🌐

PageRank Logic

"Random Walker Visits Important Nodes More Often"

PageRank simulates a random surfer following edges. Nodes that many high-rank nodes point to get higher scores. GPU cuGraph computes this for billion-edge graphs where CPU NetworkX times out.

🏘️

Louvain Communities

"Max Modularity = Tight Clusters, Loose Bridges"

Louvain maximizes modularity — dense edges within communities, sparse between. It iteratively merges nodes to maximize this score. Use for social network segmentation and fraud ring detection.

📅

Lag Features

"Shift Back to See the Past"

.shift(n) in cuDF creates lag-n features — value n periods ago. Essential for forecasting: XGBoost on lag features captures temporal patterns that ARIMA models capture through p, d, q parameters.

🌡️

Anomaly Detection Methods

"Z Scores for Gaussian; IQR for Skewed; Forest for Complex"

Z-score assumes Gaussian — fails on skewed data. IQR is distribution-free. Isolation Forest handles multivariate anomalies. Autoencoder targets high reconstruction error in time series.

🕸️

Betweenness Centrality

"Bridge Nodes Have High Betweenness"

Betweenness centrality counts how many shortest paths pass through a node. High betweenness = the node connects otherwise separate clusters. Remove it and the graph fragments — critical infrastructure node.

🔄

Rolling Window

"Slide the Window, Smooth the Noise"

.rolling(window=7).mean() computes a 7-period sliding average — removes short-term noise to reveal underlying trend. Window size choice: match the temporal granularity of the pattern you want to capture.

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review guidance.

Click a card to flip it

PageRank in cuGraph

What does it measure and how is it computed?

PageRank assigns an importance score to each node based on a random walk simulation. Nodes pointed to by many high-rank nodes get higher scores. GPU cuGraph processes billion-edge graphs where CPU NetworkX times out.

Louvain Algorithm

What does it optimize and what does it find?

Louvain finds communities in a network by maximizing modularity — a measure of edge density within communities vs. between them. Iterative greedy approach; widely used for social network segmentation.

Betweenness Centrality

What does high betweenness mean for a node?

High betweenness centrality means the node lies on many shortest paths between other nodes — it is a bridge connecting otherwise separate clusters. Removing it fragments the network.

cuDF .shift(n)

What does it create and why is it useful for forecasting?

.shift(n) creates a lag-n feature — the value n periods earlier. Encodes temporal autocorrelation as a numeric column that ML models (XGBoost, Random Forest) can learn from directly.

Isolation Forest

What is the core principle behind anomaly detection?

Anomalies are easier to isolate — they require fewer random splits to separate from the rest. Shorter average path length in isolation trees = higher anomaly score. No distributional assumption required.

IQR Anomaly Bounds

What are the outlier thresholds and what is the advantage?

Outlier bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR. The IQR method is distribution-free — works on skewed or non-Gaussian data where Z-score fails. Same as box plot whisker calculation.

Jaccard Similarity in cuGraph

How is it computed and what is it used for?

Jaccard = |shared neighbors| / |union of neighbors|. Measures node pair similarity based on neighborhood overlap. Used in recommendation systems — people with many mutual connections are likely to know each other.

cuDF Rolling Window

What does .rolling(7).mean() compute?

Computes a 7-period sliding average — for each point, the mean of the current and 6 preceding values. GPU-computed on cuDF. Smooths noise to reveal underlying trend; window size matches the temporal pattern period.

Study Advisor

cuGraph Key Points

Input format: edge list cuDF DataFrame with src and dst integer columns
PageRank: importance via random walk; alpha=0.85 damping factor is standard
Louvain: maximizes modularity — finds communities; iterative greedy algorithm
Betweenness: counts shortest paths through a node — bridge detection
Jaccard: shared neighbors / union of neighbors — node similarity for recommendations
Scale: billion-edge graphs on GPU; CPU NetworkX times out at fraction of that size
Integration: cuGraph results return as cuDF DataFrames for downstream GPU analysis

Graph Analytics, Time-Series & EDA

Graph Analytics, Time-Series & EDA

cuGraph at a Glance

Time-Series with cuDF

EDA & Anomaly Detection

Algorithm Quick Reference

cuGraph Algorithms

Graph Construction

Scale Advantage

PageRank

BFS & DFS

Shortest Path (SSSP)

Betweenness Centrality

Jaccard Similarity

Louvain Community Detection

Triangle Counting

Connected Components

Time-Series Analysis

Date Parsing & Extraction

Rolling Windows

Lag Features

Resampling

Autocorrelation & Seasonality

ARIMA & Prophet on GPU

EDA & Anomaly Detection

Descriptive Statistics

Correlation Analysis

Class Imbalance

Z-Score Method

IQR Method

Isolation Forest (cuML)

Autoencoder Anomaly Detection

DBSCAN Noise Points

Practice Quiz — Graph Analytics, Time-Series & EDA

Memory Hooks

Flashcards & Advisor

Study Advisor

cuGraph Key Points

Ready to Pass NCP-ADS?