FlashGenius Logo FlashGenius
NCP-ADS Exam Prep · Topic 4

Graph Analytics, Time-Series & EDA

cuGraph · PageRank · Anomaly Detection · GPU-Accelerated EDA

Study with Practice Tests →

Graph Analytics, Time-Series & EDA

cuGraph scales graph analytics to billion-edge networks on GPU. cuDF powers GPU time-series feature engineering and statistical EDA — enabling fast discovery of patterns, anomalies, and data quality issues before modeling.

cuGraph at a Glance

  • NetworkX-compatible API: drop-in GPU replacement for CPU graph analytics
  • Input: edge list as cuDF DataFrame with src and dst columns
  • Scale: handles billion-edge graphs; CPU NetworkX would time out
  • Integration: graph results returnable as cuDF DataFrames for further GPU analysis
  • Key algorithms: PageRank, BFS, DFS, SSSP, Betweenness Centrality, Louvain, Jaccard

Time-Series with cuDF

  • Datetime parsing: cudf.to_datetime() — GPU date parsing
  • Rolling windows: .rolling(window=7).mean() — sliding aggregations on GPU
  • Lag features: .shift(n) — n-step lag for forecasting features
  • Resampling: .resample('1H').agg(...) — downsample to coarser time grain
  • Use cases: sensor telemetry, financial tick data, system metrics

EDA & Anomaly Detection

  • EDA goal: understand distributions, correlations, outliers, and class balance before modeling
  • GPU descriptive stats: df.describe(), df.corr(), df.isnull().sum()
  • Z-score: |z| > 3 flags outliers assuming Gaussian distribution
  • IQR method: Q1 - 1.5×IQR to Q3 + 1.5×IQR — distribution-free outlier bounds
  • cuML IsolationForest: anomalies isolated by shorter path lengths in random trees

Algorithm Quick Reference

AlgorithmLibraryWhat it ComputesUse Case
PageRankcuGraphNode importance via random walk scoresWeb ranking, influence analysis
LouvaincuGraphCommunity detection via modularityNetwork clustering, segmentation
Betweenness CentralitycuGraphNodes that bridge clustersInfrastructure bottleneck analysis
Rolling MeancuDFSliding window averageTrend smoothing in time series
IsolationForestcuMLAnomaly score via path lengthOutlier detection in large datasets
DBSCAN (noise)cuMLLow-density region labels (-1)Density-based anomaly detection

cuGraph Algorithms

cuGraph is NVIDIA's GPU graph analytics library. It uses the same API patterns as NetworkX but runs algorithms in parallel across thousands of GPU cores — enabling billion-edge graph processing.

Data Input & Setup

Graph Construction

  • Input: edge list as a cuDF DataFrame with src and dst columns (node IDs as integers)
  • Optional: weight column for weighted graph algorithms
  • NetworkX-compatible API: many functions match NetworkX signatures
  • Results returned as cuDF DataFrames — seamlessly chain into downstream cuDF operations
  • Handles directed and undirected graphs

Scale Advantage

  • cuGraph handles billion-edge graphs in GPU memory
  • CPU NetworkX would time out or run out of RAM on the same graphs
  • GPU parallelism: all nodes/edges processed simultaneously in parallel threads
  • Typical speedup: 10–1000× over CPU NetworkX depending on algorithm and graph size
Core Algorithms

PageRank

  • Computes importance score for each node via random walk simulation
  • Higher PageRank = more likely to be visited by a random surfer following edges
  • Use: web page ranking, social influence scoring, citation analysis
  • cuGraph.pagerank(G, alpha=0.85) — damping factor alpha typical at 0.85

BFS & DFS

  • BFS (Breadth-First Search): explores all neighbors at current depth before going deeper; finds shortest path in unweighted graphs
  • DFS (Depth-First Search): explores as deep as possible before backtracking; useful for cycle detection, topological sort
  • Both GPU-accelerated in cuGraph for massive graphs

Shortest Path (SSSP)

  • Single Source Shortest Path — finds shortest weighted paths from one source node to all others
  • Uses Bellman-Ford or Dijkstra variants on GPU
  • Applications: route optimization, network latency analysis

Betweenness Centrality

  • Measures how often a node lies on the shortest path between other nodes
  • High betweenness = bridge node connecting otherwise separate clusters
  • Critical for identifying network bottlenecks, key influencers, and infrastructure vulnerabilities
  • Computationally expensive on CPU; GPU parallelism makes it practical for large graphs

Jaccard Similarity

  • Measures similarity between pairs of nodes based on shared neighbors
  • Formula: |shared neighbors| / |union of neighbors|
  • Used in recommendation systems (people with many shared connections may know each other)
  • Result: similarity score 0.0 (no shared neighbors) to 1.0 (identical neighbor sets)

Louvain Community Detection

  • Identifies communities (clusters) in a network by maximizing modularity
  • Modularity measures density of edges within communities vs. between communities
  • Iterative greedy algorithm — works bottom-up merging nodes into clusters
  • Applications: social network segmentation, protein interaction networks, fraud ring detection
Additional Algorithms

Triangle Counting

  • Counts the number of closed triangles (3-node cliques) each node participates in
  • High triangle count = node is part of a tightly-knit community
  • Used in social network analysis (friend-of-friend clustering coefficient)

Connected Components

  • Weakly Connected Components: connected if ignoring edge direction — finds isolated subgraphs
  • Strongly Connected Components: directed; every node reachable from every other node in the component
  • Use: fraud detection (disconnected subgraphs), data quality (isolated records)

Time-Series Analysis

cuDF provides GPU-accelerated datetime operations, rolling windows, lag features, and resampling — enabling fast feature engineering for forecasting and temporal analysis on large datasets.

Datetime Operations

Date Parsing & Extraction

  • cudf.to_datetime(df['timestamp']) — GPU datetime parsing (same as pd.to_datetime())
  • Extract components: .dt.year, .dt.month, .dt.dayofweek, .dt.hour, .dt.minute
  • These extracted features become columns for ML models (e.g., hour-of-day as cyclical feature)
  • All datetime operations run on GPU — no CPU round-trip needed

Rolling Windows

  • df['value'].rolling(window=7).mean() — 7-day moving average
  • Common aggregations: .mean(), .sum(), .std(), .min(), .max()
  • Rolling features smooth noise and capture local trend — critical for forecasting models
  • Window size selection: match the temporal granularity of the underlying pattern
Feature Engineering for Forecasting

Lag Features

  • df['lag_1'] = df['value'].shift(1) — value at time t-1
  • df['lag_7'] = df['value'].shift(7) — value 7 periods ago (e.g., same day last week)
  • Lag features encode temporal autocorrelation as numeric inputs to ML models
  • Essential for ARIMA alternatives: XGBoost on lag features outperforms ARIMA on many real datasets

Resampling

  • df.resample('1H').agg({'value':'mean'}) — downsample to hourly means
  • Downsample: aggregate fine-grain data (e.g., seconds → hours)
  • Upsample: fill missing timestamps (e.g., hours → minutes with interpolation)
  • Useful for aligning datasets at different time granularities before joining

Autocorrelation & Seasonality

  • Autocorrelation: correlation of a series with its own lagged values
  • High autocorrelation at lag k indicates the series has memory at that period
  • Seasonality detection: ACF (autocorrelation function) plot shows periodic peaks at seasonal lags
  • Seasonal decomposition: trend + seasonality + residual using statsmodels or cuML
  • Common use cases: sensor telemetry, financial tick data, system metrics, IoT streams
Forecasting Methods

ARIMA & Prophet on GPU

  • ARIMA (AutoRegressive Integrated Moving Average): uses past values + past errors to forecast
  • cuML provides GPU-accelerated ARIMA fitting for large datasets
  • Prophet: Facebook's decomposition model; handles multiple seasonality and holidays automatically
  • Both suitable for NCP-ADS exam context — know they exist as GPU-compatible options

EDA & Anomaly Detection

GPU-accelerated EDA with cuDF allows descriptive statistics, correlations, and missing data audits at scale. cuML provides multiple anomaly detection algorithms for identifying outliers before modeling.

Exploratory Data Analysis

Descriptive Statistics

  • df.describe() — count, mean, std, min, 25%, 50%, 75%, max; all GPU-computed
  • df['col'].value_counts() — frequency per category on GPU
  • df.isnull().sum() — null count per column — essential data quality audit
  • df.dtypes — verify column types loaded correctly (int, float, datetime)
  • EDA goal: understand distributions, correlations, outliers, and class balance before modeling

Correlation Analysis

  • df.corr() — pairwise Pearson correlation matrix; all computed on GPU
  • Visualize as heatmap using cuDF results piped to matplotlib or seaborn
  • High correlation between features: may indicate multicollinearity — consider dropping one
  • High correlation between feature and target: strong predictor — prioritize in modeling
  • Near-zero correlation: may still be useful with non-linear models

Class Imbalance

  • Check with df['label'].value_counts()
  • Imbalanced classes cause classifiers to predict majority class — skewed metrics
  • Fixes: resampling (oversample minority, undersample majority), class_weight='balanced', SMOTE
  • Use F1 or AUC-ROC instead of accuracy for imbalanced problems
Anomaly Detection Methods

Z-Score Method

  • Z = (x - mean) / std; |z| > 3 flags outliers
  • Assumes Gaussian distribution — fails for skewed or multimodal data
  • Fast and simple; good starting point for symmetric distributions
  • All stats computed on GPU via cuDF — extremely fast for large datasets

IQR Method

  • Bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR; points outside are outliers
  • IQR = Q3 - Q1 (interquartile range)
  • Distribution-free: works on skewed data; robust to extreme outliers influencing mean/std
  • Same as box plot whisker calculation — visually intuitive

Isolation Forest (cuML)

  • cuml.IsolationForest().fit_predict(X) — returns -1 for anomaly, 1 for normal
  • Principle: anomalies are easier to isolate — they require fewer splits to separate from others
  • Shorter average path length in isolation trees = more anomalous
  • Does not assume any distribution — works for multivariate anomalies
  • GPU-accelerated in cuML for high-dimensional data

Autoencoder Anomaly Detection

  • Train autoencoder on normal data only; anomalies have high reconstruction error
  • Threshold: set reconstruction error cutoff based on normal distribution of training errors
  • Especially effective for multivariate time series — captures complex temporal correlations
  • Requires more data and tuning than statistical methods but catches complex anomalies

DBSCAN Noise Points

  • DBSCAN labels points in low-density regions as noise (-1)
  • These noise points effectively serve as anomaly flags in density-based outlier detection
  • Advantage: no assumed distribution; finds clusters of arbitrary shape
  • Key params: eps (neighborhood radius), min_samples (minimum density threshold)

Practice Quiz — Graph Analytics, Time-Series & EDA

10 questions covering cuGraph algorithms, time-series feature engineering, and anomaly detection. Select one answer per question, then click Submit.

Memory Hooks

Six mnemonic devices to lock in the most exam-critical cuGraph, time-series, and anomaly detection concepts.

🌐
PageRank Logic
"Random Walker Visits Important Nodes More Often"
PageRank simulates a random surfer following edges. Nodes that many high-rank nodes point to get higher scores. GPU cuGraph computes this for billion-edge graphs where CPU NetworkX times out.
🏘️
Louvain Communities
"Max Modularity = Tight Clusters, Loose Bridges"
Louvain maximizes modularity — dense edges within communities, sparse between. It iteratively merges nodes to maximize this score. Use for social network segmentation and fraud ring detection.
📅
Lag Features
"Shift Back to See the Past"
.shift(n) in cuDF creates lag-n features — value n periods ago. Essential for forecasting: XGBoost on lag features captures temporal patterns that ARIMA models capture through p, d, q parameters.
🌡️
Anomaly Detection Methods
"Z Scores for Gaussian; IQR for Skewed; Forest for Complex"
Z-score assumes Gaussian — fails on skewed data. IQR is distribution-free. Isolation Forest handles multivariate anomalies. Autoencoder targets high reconstruction error in time series.
🕸️
Betweenness Centrality
"Bridge Nodes Have High Betweenness"
Betweenness centrality counts how many shortest paths pass through a node. High betweenness = the node connects otherwise separate clusters. Remove it and the graph fragments — critical infrastructure node.
🔄
Rolling Window
"Slide the Window, Smooth the Noise"
.rolling(window=7).mean() computes a 7-period sliding average — removes short-term noise to reveal underlying trend. Window size choice: match the temporal granularity of the pattern you want to capture.

Flashcards & Advisor

Click any flashcard to flip and reveal the answer. Then use the Study Advisor for targeted review guidance.

Click a card to flip it

PageRank in cuGraph
What does it measure and how is it computed?
PageRank assigns an importance score to each node based on a random walk simulation. Nodes pointed to by many high-rank nodes get higher scores. GPU cuGraph processes billion-edge graphs where CPU NetworkX times out.
Louvain Algorithm
What does it optimize and what does it find?
Louvain finds communities in a network by maximizing modularity — a measure of edge density within communities vs. between them. Iterative greedy approach; widely used for social network segmentation.
Betweenness Centrality
What does high betweenness mean for a node?
High betweenness centrality means the node lies on many shortest paths between other nodes — it is a bridge connecting otherwise separate clusters. Removing it fragments the network.
cuDF .shift(n)
What does it create and why is it useful for forecasting?
.shift(n) creates a lag-n feature — the value n periods earlier. Encodes temporal autocorrelation as a numeric column that ML models (XGBoost, Random Forest) can learn from directly.
Isolation Forest
What is the core principle behind anomaly detection?
Anomalies are easier to isolate — they require fewer random splits to separate from the rest. Shorter average path length in isolation trees = higher anomaly score. No distributional assumption required.
IQR Anomaly Bounds
What are the outlier thresholds and what is the advantage?
Outlier bounds: Q1 - 1.5×IQR to Q3 + 1.5×IQR. The IQR method is distribution-free — works on skewed or non-Gaussian data where Z-score fails. Same as box plot whisker calculation.
Jaccard Similarity in cuGraph
How is it computed and what is it used for?
Jaccard = |shared neighbors| / |union of neighbors|. Measures node pair similarity based on neighborhood overlap. Used in recommendation systems — people with many mutual connections are likely to know each other.
cuDF Rolling Window
What does .rolling(7).mean() compute?
Computes a 7-period sliding average — for each point, the mean of the current and 6 preceding values. GPU-computed on cuDF. Smooths noise to reveal underlying trend; window size matches the temporal pattern period.

Study Advisor

cuGraph Key Points

  • Input format: edge list cuDF DataFrame with src and dst integer columns
  • PageRank: importance via random walk; alpha=0.85 damping factor is standard
  • Louvain: maximizes modularity — finds communities; iterative greedy algorithm
  • Betweenness: counts shortest paths through a node — bridge detection
  • Jaccard: shared neighbors / union of neighbors — node similarity for recommendations
  • Scale: billion-edge graphs on GPU; CPU NetworkX times out at fraction of that size
  • Integration: cuGraph results return as cuDF DataFrames for downstream GPU analysis

Ready to Pass NCP-ADS?

Test your graph analytics and EDA knowledge with full practice exams on FlashGenius.

Unlock Full Practice Tests on FlashGenius →