NCA-ADS: Descriptive Analysis, Visualization & Advanced Data Structures

NCA-ADS Exam Overview

This page covers Topics 4 (Descriptive Analysis & Visualization, ~13%) and Topic 7 (Advanced Data Structures, ~7%) of the NCA-ADS exam. Together they form roughly 20% of exam content — covering EDA workflows, statistical interpretation, plot selection, hypothesis testing, time-series handling, and graph-based data with RAPIDS cuGraph.

Full Exam Topic Weight Table

Topic	Domain	Weight
1	Introduction to Accelerated Data Science	~10%
2	GPU-Accelerated Data Loading & Preprocessing	~15%
3	Feature Engineering & Data Transformation	~15%
4	Descriptive Analysis & Visualization	~13%
5	Machine Learning with cuML	~20%
6	Model Evaluation & Hyperparameter Tuning	~15%
7	Advanced Data Structures	~7%
8	End-to-End Pipelines & Deployment	~5%

Page Summary

Topics Covered

4 (Descriptive Analysis) & 7 (Advanced Data Structures)

Combined Exam Weight

~20% of NCA-ADS exam

Key Libraries

cuDF, cuGraph, matplotlib, seaborn, cuXfilter

EDA Functions

.describe(), .value_counts(), .corr(), .isnull().sum(), .nunique()

Plot Types to Know

Histogram, Box, Scatter, Line, Bar, Heatmap, Violin, Pie

Graph Algorithms

PageRank, BFS, Louvain Community Detection

Time-Series Key Ops

to_datetime(), .dt.*, rolling(), shift(), resample()

Statistical Tests

t-test, Chi-square, ANOVA, p-value interpretation

Core Concepts

Eight foundational concept areas for Topics 4 and 7. Study each group carefully — the exam tests conceptual understanding and the ability to select the right tool or approach for a given scenario.

1. Exploratory Data Analysis (EDA) — The Foundation

Goal: understand data before modeling — distribution, relationships, anomalies, missing data. Always do EDA before feature engineering: garbage in, garbage out.

Key cuDF EDA Functions

.describe() — count, mean, std, min, 25th/50th/75th percentile, max
.value_counts() — frequency of each unique value (good for categorical)
.corr() — Pearson correlation matrix between all numerical columns
.isnull().sum() — count missing values per column
.nunique() — number of unique values per column

EDA Workflow — SDMCOA

Shape — df.shape: rows and columns
Dtypes — df.dtypes: check column types
Missing — df.isnull().sum(): find gaps
Correlations — df.corr(): feature relationships
Outliers — box plots, IQR method
Anomalies — check for impossible values, duplicates

2. Descriptive Statistics — Know What They Mean

Central Tendency Measures

Mean: average — sensitive to outliers
Median: middle value — robust to outliers; better for skewed data
Mode: most frequent value — useful for categorical data
Rule: when mean >> median → right-skewed distribution → consider log transform

Spread & Shape Measures

Standard Deviation: spread of data around the mean
Variance: std squared
Skewness: positive (right tail), negative (left tail) — affects model assumptions
Kurtosis: peakedness of distribution — high kurtosis = heavy tails (more outliers)
Percentiles: P25 (Q1), P50 (median), P75 (Q3); IQR = Q3 - Q1

3. Visualization — Plot Selection Guide

Associate level: know WHICH plot for WHICH purpose. RAPIDS workflow: compute on GPU (cuDF), then .to_pandas() + matplotlib/seaborn for plotting. cuXfilter enables GPU-accelerated interactive dashboards.

Plot Type	When to Use	Example
Histogram	Distribution of a single numerical variable	Age distribution of customers
Box Plot	Distribution + outliers; compare across groups	Salary by department
Scatter Plot	Relationship between two numerical variables	Height vs weight
Line Chart	Trends over time (time-series)	Sales per month
Bar Chart	Comparison of categorical groups	Revenue by product category
Heatmap	Correlation matrix, feature relationships	Feature correlation heatmap
Violin Plot	Distribution shape across groups (box + KDE)	Model accuracy by method
Pie Chart	Proportional composition (max 5 categories)	Market share

4. Correlation Analysis

Pearson Correlation

Measures linear relationship between two numerical variables
Range: -1 to +1
+1 = perfect positive linear correlation
-1 = perfect negative linear correlation
0 = no linear correlation (may still have non-linear relationship)
df.corr() in cuDF computes the Pearson correlation matrix

Multicollinearity & Causation

Highly correlated features (|r| > 0.9): consider dropping one — multicollinearity harms linear models
Heatmap: seaborn.heatmap(df.to_pandas().corr())
Correlation does NOT equal causation — always investigate causality separately
Rule of thumb: r > 0.9 = one must go (for linear models)

5. Hypothesis Testing and Statistical Significance

Hypothesis Framework

Null hypothesis (H0): no effect / no difference
Alternative hypothesis (H1): there is an effect / difference
p-value: probability of observing results as extreme if H0 is true
p < 0.05: statistically significant — reject H0 (95% confidence)
p ≥ 0.05: fail to reject H0 — effect may be due to chance

Common Statistical Tests

t-test: compare means of two groups (e.g., A/B test)
Chi-square test: independence between two categorical variables
ANOVA: compare means of 3+ groups
Practical vs statistical significance: large datasets can make tiny, unimportant differences statistically significant — always check effect size

6. Time-Series Data Handling

Parsing & Extraction

Parse dates: cudf.to_datetime(df['date_col']) — convert string to datetime
Extract: .dt.year, .dt.month, .dt.day, .dt.dayofweek, .dt.hour
Resampling: df.set_index('date').resample('1M').mean() — monthly averages
Missing timestamps: .interpolate() — fill gaps with interpolated values

Feature Engineering & Split Rules

Rolling stats: df['sales'].rolling(window=7).mean() — 7-day moving average
Lag features: df['sales_lag1'] = df['sales'].shift(1) — previous period value
Time-series split: ALWAYS split by time (never random split) — future data cannot train the model
Example: train on 2020–2024, test on 2025

7. Graph-Based Data Representation

Graph = nodes (entities) + edges (relationships). Use graphs when relationships between entities are as important as the entities themselves. cuGraph input: edge list as cuDF DataFrame with 'src' and 'dst' columns.

Graph Use Cases

Social networks: users + friendships
Fraud detection: transactions + accounts
Recommendation systems: users + items
Key concepts: Degree (edges per node), Path (connecting edges), Component (connected group)
Directed graphs: in-degree vs out-degree

Graph Algorithms (Associate Level)

PageRank: identifies most important/influential nodes (used by Google Search)
BFS (Breadth-First Search): shortest path exploration layer by layer
Louvain Community Detection: finds clusters/communities in the graph
Exam tip: understand purpose, not implementation details

8. Node Importance and Network Relationships

Centrality Measures

Degree centrality: most direct connections = most important (simplest)
Betweenness centrality: how often a node appears on shortest paths — identifies bridges/brokers
PageRank: importance weighted by quality of connections — being connected to important nodes matters more

cuGraph at Scale

Computes PageRank and Betweenness Centrality on GPU
Enables analysis of billion-edge graphs
Network visualization: node size proportional to importance score (for small graphs)
Results returned as cuDF DataFrames for downstream analysis

Memory Hooks

Six memorable hooks to lock in the most-tested concepts. Each hook gives you a mental shortcut you can recall under exam pressure.

📊

Plot Selection — HLSBHV

"Histograms Look at Singles, Bars Compare, Heatmaps Visualize Relationships, Scatter Shows pairs, Line is for time"

Remember plots by purpose, not by name. Each letter maps to a plot type and its job: H=Histogram (distribution), L=Line (time), S=Scatter (two variables), B=Bar (categories), H=Heatmap (correlations), V=Violin (grouped distribution).

🔍

EDA Workflow — SDMCOA

"Shape → Dtypes → Missing → Correlations → Outliers → Anomalies"

Always follow this order — never skip steps. Shape and dtypes first (structural check), then data quality (missing), then relationships (correlations), then unusual values (outliers and anomalies). Skipping ahead causes you to miss data issues that corrupt your model.

📉

p-value Rule

"p under 0.05, hypothesis is done"

p < 0.05 = statistically significant = reject the null hypothesis. p ≥ 0.05 = insufficient evidence = fail to reject. Remember: statistical significance does not prove causation, and a big dataset can make tiny effects significant — always check effect size.

⏱️

Time-Series Split Rule

"Time data splits by time, not by random"

Train on past, validate on future — never use random split for time-series. Random split leaks future information into training, artificially inflating performance. Always simulate real forecasting conditions: earlier dates train, later dates test.

⚠️

Correlation Warning

"r > 0.9 = one must go"

Highly correlated features (Pearson |r| > 0.9) cause multicollinearity — linear models become unstable and coefficients become unreliable. Keep only one of the two correlated features. Tree-based models (XGBoost, Random Forest) are less affected but it still adds noise.

🕸️

Graph vs Table

"When relationships matter as much as entities, use a graph"

If your data is about connections between things (users-friends, transactions-accounts, buyers-products), a graph is the right structure. Social networks, fraud rings, and recommendation systems all depend on relationship patterns that a flat table cannot capture.

Practice Quiz

10 associate-level questions on plot selection, statistical interpretation, time-series decisions, graph analytics, and EDA. Select your answers and click Submit to see results.

0/10

Flashcards

12 cards covering EDA functions, statistics, visualization, time-series, and graph concepts. Click a card to flip it and reveal the full explanation.

Click any card to flip • Covers all major exam concepts for Topics 4 & 7

EDA

.describe()

.describe() computes summary stats for all numerical columns: count, mean, std, min, 25%/50%/75% percentile, max. Quick way to spot scale differences, potential outliers, and data ranges. Works on cuDF DataFrames directly.

Statistics

Mean vs Median

Mean = sum / count — sensitive to outliers. Median = middle value — robust to outliers. When mean >> median: right-skewed data. When mean << median: left-skewed data. Use median for salary, housing prices, and income data.

Statistics

p-value

Probability of seeing the observed result if the null hypothesis is true. p < 0.05: statistically significant — reject null (95% confidence). p ≥ 0.05: insufficient evidence to reject null. Common mistake: p < 0.05 does not prove causation.

Visualization

Histogram

Shows distribution of a single numerical variable. X-axis: value ranges (bins). Y-axis: frequency. Use to identify: skewness (left/right tail), modality (peaks), outliers (isolated bars). Different from bar chart which compares categories.

Visualization

Box Plot

Shows distribution summary: median line, IQR box (Q1–Q3), whiskers (1.5×IQR), and individual outlier points beyond whiskers. Best for: comparing distributions across multiple groups and identifying outliers visually.

Statistics

Pearson Correlation

Measures linear relationship between two numerical variables. Range: -1 (perfect negative) to +1 (perfect positive). 0 = no linear relationship. Computed with df.corr() in cuDF. Warning: correlation does not imply causation.

Time-Series

Rolling Mean

Moving average over a sliding window. Usage: df['col'].rolling(window=7).mean() for 7-period moving average. Smooths out noise to reveal trends. First (window-1) values are NaN — insufficient data for the window. Common windows: 7-day, 30-day, 52-week.

Time-Series

Lag Features

Previous time period values as predictors. df['sales_lag1'] = df['sales'].shift(1) creates yesterday's sales as a feature. df['sales_lag7'] = df['sales'].shift(7) creates last week's sales. Essential for time-series forecasting models.

Time-Series

Time-Series Train/Test Split

ALWAYS split by time — train on earlier data, test on later data. NEVER use random split — it leaks future information into training. Example: train on 2020–2024, test on 2025. This simulates real forecasting conditions.

Graph

PageRank

Ranks nodes by importance based on quality of connections. High PageRank = connected to other important nodes. Used by Google Search, fraud detection, social network influence analysis. Available in cuGraph. Being connected to important nodes matters more than just having many connections.

Graph

Louvain Community Detection

Finds clusters (communities) in a graph by maximizing modularity — how densely connected nodes are within a community vs between communities. Use for: customer segmentation, fraud ring detection, social group discovery. Available as cuGraph.louvain().

EDA

.corr()

Computes pairwise Pearson correlation matrix for all numerical columns. df.corr() in cuDF. Values range -1 to +1. Use to identify multicollinearity (drop features with |r|>0.9 for linear models) and understand relationships between features and target variable.

Study Advisor

Personalized study plans for Topics 4 & 7 based on your background. Select your profile to see prioritized steps.

Business Analyst Study Plan

Start with the Plot Selection Guide — you likely already use charts; focus on matching the right chart to the data type and goal HIGH
Learn the p-value rule first (p < 0.05 = significant) — you'll encounter this in A/B testing questions HIGH
Study mean vs. median carefully — right-skew vs. left-skew scenarios are common exam questions HIGH
Memorize the EDA SDMCOA workflow — this is the structured approach examiners expect MED
For graph data: understand use cases (fraud, social, recommendations) — you don't need to code cuGraph MED
Practice identifying when NOT to use random split for time-series data MED
Review the correlation warning — understand why |r| > 0.9 is a problem for linear models MED

Official Resources

Primary references for NCA-ADS Topics 4 and 7. Use these for deep dives beyond what the exam requires — but the APIs listed here may appear in scenario questions.

Certification

NVIDIA NCA-ADS Exam Page

Official certification overview, exam domains, and registration details from NVIDIA Learning.

nvidia.com →

API Reference

cuDF API Documentation

Full reference for cuDF DataFrame operations including .describe(), .corr(), .rolling(), .shift(), and datetime methods.

docs.rapids.ai/api/cudf →

API Reference

cuGraph Documentation

cuGraph graph analytics library — PageRank, BFS, Louvain, Betweenness Centrality, and Jaccard similarity on GPU.

docs.rapids.ai/api/cugraph →

Visualization

cuXfilter — RAPIDS Visualization

GPU-accelerated visualization library for creating interactive dashboards directly from cuDF DataFrames without .to_pandas().

github.com/rapidsai/cuxfilter →

NCA-ADS: Descriptive Analysis
& Visualization

NCA-ADS Exam Overview

Full Exam Topic Weight Table

Page Summary

Core Concepts

Key cuDF EDA Functions

EDA Workflow — SDMCOA

Central Tendency Measures

Spread & Shape Measures

Pearson Correlation

Multicollinearity & Causation

Hypothesis Framework

Common Statistical Tests

Parsing & Extraction

Feature Engineering & Split Rules

Graph Use Cases

Graph Algorithms (Associate Level)

Centrality Measures

cuGraph at Scale

Memory Hooks

Plot Selection — HLSBHV

EDA Workflow — SDMCOA

p-value Rule

Time-Series Split Rule

Correlation Warning

Graph vs Table

Practice Quiz

Flashcards

Study Advisor

Business Analyst Study Plan

Official Resources

NVIDIA NCA-ADS Exam Page

cuDF API Documentation

cuGraph Documentation

cuXfilter — RAPIDS Visualization

Ready to Pass NCA-ADS?

NCA-ADS: Descriptive Analysis& Visualization

NCA-ADS Exam Overview

Full Exam Topic Weight Table

Page Summary

Core Concepts

Key cuDF EDA Functions

EDA Workflow — SDMCOA

Central Tendency Measures

Spread & Shape Measures

Pearson Correlation

Multicollinearity & Causation

Hypothesis Framework

Common Statistical Tests

Parsing & Extraction

Feature Engineering & Split Rules

Graph Use Cases

Graph Algorithms (Associate Level)

Centrality Measures

cuGraph at Scale

Memory Hooks

Plot Selection — HLSBHV

EDA Workflow — SDMCOA

p-value Rule

Time-Series Split Rule

Correlation Warning

Graph vs Table

Practice Quiz

Flashcards

Study Advisor

Business Analyst Study Plan

Official Resources

NVIDIA NCA-ADS Exam Page

cuDF API Documentation

cuGraph Documentation

cuXfilter — RAPIDS Visualization

Ready to Pass NCA-ADS?

NCA-ADS: Descriptive Analysis
& Visualization