NCA-ADS Exam Overview
This page covers Topics 4 (Descriptive Analysis & Visualization, ~13%) and Topic 7 (Advanced Data Structures, ~7%) of the NCA-ADS exam. Together they form roughly 20% of exam content — covering EDA workflows, statistical interpretation, plot selection, hypothesis testing, time-series handling, and graph-based data with RAPIDS cuGraph.
Full Exam Topic Weight Table
| Topic | Domain | Weight |
|---|---|---|
| 1 | Introduction to Accelerated Data Science | ~10% |
| 2 | GPU-Accelerated Data Loading & Preprocessing | ~15% |
| 3 | Feature Engineering & Data Transformation | ~15% |
| 4 | Descriptive Analysis & Visualization | ~13% |
| 5 | Machine Learning with cuML | ~20% |
| 6 | Model Evaluation & Hyperparameter Tuning | ~15% |
| 7 | Advanced Data Structures | ~7% |
| 8 | End-to-End Pipelines & Deployment | ~5% |
Page Summary
Core Concepts
Eight foundational concept areas for Topics 4 and 7. Study each group carefully — the exam tests conceptual understanding and the ability to select the right tool or approach for a given scenario.
Key cuDF EDA Functions
.describe()— count, mean, std, min, 25th/50th/75th percentile, max.value_counts()— frequency of each unique value (good for categorical).corr()— Pearson correlation matrix between all numerical columns.isnull().sum()— count missing values per column.nunique()— number of unique values per column
EDA Workflow — SDMCOA
- Shape —
df.shape: rows and columns - Dtypes —
df.dtypes: check column types - Missing —
df.isnull().sum(): find gaps - Correlations —
df.corr(): feature relationships - Outliers — box plots, IQR method
- Anomalies — check for impossible values, duplicates
Central Tendency Measures
- Mean: average — sensitive to outliers
- Median: middle value — robust to outliers; better for skewed data
- Mode: most frequent value — useful for categorical data
- Rule: when mean >> median → right-skewed distribution → consider log transform
Spread & Shape Measures
- Standard Deviation: spread of data around the mean
- Variance: std squared
- Skewness: positive (right tail), negative (left tail) — affects model assumptions
- Kurtosis: peakedness of distribution — high kurtosis = heavy tails (more outliers)
- Percentiles: P25 (Q1), P50 (median), P75 (Q3); IQR = Q3 - Q1
.to_pandas() + matplotlib/seaborn for plotting. cuXfilter enables GPU-accelerated interactive dashboards.
| Plot Type | When to Use | Example |
|---|---|---|
| Histogram | Distribution of a single numerical variable | Age distribution of customers |
| Box Plot | Distribution + outliers; compare across groups | Salary by department |
| Scatter Plot | Relationship between two numerical variables | Height vs weight |
| Line Chart | Trends over time (time-series) | Sales per month |
| Bar Chart | Comparison of categorical groups | Revenue by product category |
| Heatmap | Correlation matrix, feature relationships | Feature correlation heatmap |
| Violin Plot | Distribution shape across groups (box + KDE) | Model accuracy by method |
| Pie Chart | Proportional composition (max 5 categories) | Market share |
Pearson Correlation
- Measures linear relationship between two numerical variables
- Range: -1 to +1
- +1 = perfect positive linear correlation
- -1 = perfect negative linear correlation
- 0 = no linear correlation (may still have non-linear relationship)
df.corr()in cuDF computes the Pearson correlation matrix
Multicollinearity & Causation
- Highly correlated features (|r| > 0.9): consider dropping one — multicollinearity harms linear models
- Heatmap:
seaborn.heatmap(df.to_pandas().corr()) - Correlation does NOT equal causation — always investigate causality separately
- Rule of thumb: r > 0.9 = one must go (for linear models)
Hypothesis Framework
- Null hypothesis (H0): no effect / no difference
- Alternative hypothesis (H1): there is an effect / difference
- p-value: probability of observing results as extreme if H0 is true
- p < 0.05: statistically significant — reject H0 (95% confidence)
- p ≥ 0.05: fail to reject H0 — effect may be due to chance
Common Statistical Tests
- t-test: compare means of two groups (e.g., A/B test)
- Chi-square test: independence between two categorical variables
- ANOVA: compare means of 3+ groups
- Practical vs statistical significance: large datasets can make tiny, unimportant differences statistically significant — always check effect size
Parsing & Extraction
- Parse dates:
cudf.to_datetime(df['date_col'])— convert string to datetime - Extract:
.dt.year,.dt.month,.dt.day,.dt.dayofweek,.dt.hour - Resampling:
df.set_index('date').resample('1M').mean()— monthly averages - Missing timestamps:
.interpolate()— fill gaps with interpolated values
Feature Engineering & Split Rules
- Rolling stats:
df['sales'].rolling(window=7).mean()— 7-day moving average - Lag features:
df['sales_lag1'] = df['sales'].shift(1)— previous period value - Time-series split: ALWAYS split by time (never random split) — future data cannot train the model
- Example: train on 2020–2024, test on 2025
Graph Use Cases
- Social networks: users + friendships
- Fraud detection: transactions + accounts
- Recommendation systems: users + items
- Key concepts: Degree (edges per node), Path (connecting edges), Component (connected group)
- Directed graphs: in-degree vs out-degree
Graph Algorithms (Associate Level)
- PageRank: identifies most important/influential nodes (used by Google Search)
- BFS (Breadth-First Search): shortest path exploration layer by layer
- Louvain Community Detection: finds clusters/communities in the graph
- Exam tip: understand purpose, not implementation details
Centrality Measures
- Degree centrality: most direct connections = most important (simplest)
- Betweenness centrality: how often a node appears on shortest paths — identifies bridges/brokers
- PageRank: importance weighted by quality of connections — being connected to important nodes matters more
cuGraph at Scale
- Computes PageRank and Betweenness Centrality on GPU
- Enables analysis of billion-edge graphs
- Network visualization: node size proportional to importance score (for small graphs)
- Results returned as cuDF DataFrames for downstream analysis
Memory Hooks
Six memorable hooks to lock in the most-tested concepts. Each hook gives you a mental shortcut you can recall under exam pressure.
Plot Selection — HLSBHV
Remember plots by purpose, not by name. Each letter maps to a plot type and its job: H=Histogram (distribution), L=Line (time), S=Scatter (two variables), B=Bar (categories), H=Heatmap (correlations), V=Violin (grouped distribution).
EDA Workflow — SDMCOA
Always follow this order — never skip steps. Shape and dtypes first (structural check), then data quality (missing), then relationships (correlations), then unusual values (outliers and anomalies). Skipping ahead causes you to miss data issues that corrupt your model.
p-value Rule
p < 0.05 = statistically significant = reject the null hypothesis. p ≥ 0.05 = insufficient evidence = fail to reject. Remember: statistical significance does not prove causation, and a big dataset can make tiny effects significant — always check effect size.
Time-Series Split Rule
Train on past, validate on future — never use random split for time-series. Random split leaks future information into training, artificially inflating performance. Always simulate real forecasting conditions: earlier dates train, later dates test.
Correlation Warning
Highly correlated features (Pearson |r| > 0.9) cause multicollinearity — linear models become unstable and coefficients become unreliable. Keep only one of the two correlated features. Tree-based models (XGBoost, Random Forest) are less affected but it still adds noise.
Graph vs Table
If your data is about connections between things (users-friends, transactions-accounts, buyers-products), a graph is the right structure. Social networks, fraud rings, and recommendation systems all depend on relationship patterns that a flat table cannot capture.
Practice Quiz
10 associate-level questions on plot selection, statistical interpretation, time-series decisions, graph analytics, and EDA. Select your answers and click Submit to see results.
Flashcards
12 cards covering EDA functions, statistics, visualization, time-series, and graph concepts. Click a card to flip it and reveal the full explanation.
Click any card to flip • Covers all major exam concepts for Topics 4 & 7
.describe()
Mean vs Median
p-value
Histogram
Box Plot
Pearson Correlation
df.corr() in cuDF. Warning: correlation does not imply causation.Rolling Mean
df['col'].rolling(window=7).mean() for 7-period moving average. Smooths out noise to reveal trends. First (window-1) values are NaN — insufficient data for the window. Common windows: 7-day, 30-day, 52-week.Lag Features
df['sales_lag1'] = df['sales'].shift(1) creates yesterday's sales as a feature. df['sales_lag7'] = df['sales'].shift(7) creates last week's sales. Essential for time-series forecasting models.Time-Series Train/Test Split
PageRank
Louvain Community Detection
cuGraph.louvain()..corr()
df.corr() in cuDF. Values range -1 to +1. Use to identify multicollinearity (drop features with |r|>0.9 for linear models) and understand relationships between features and target variable.Study Advisor
Personalized study plans for Topics 4 & 7 based on your background. Select your profile to see prioritized steps.
Business Analyst Study Plan
- Start with the Plot Selection Guide — you likely already use charts; focus on matching the right chart to the data type and goal HIGH
- Learn the p-value rule first (p < 0.05 = significant) — you'll encounter this in A/B testing questions HIGH
- Study mean vs. median carefully — right-skew vs. left-skew scenarios are common exam questions HIGH
- Memorize the EDA SDMCOA workflow — this is the structured approach examiners expect MED
- For graph data: understand use cases (fraud, social, recommendations) — you don't need to code cuGraph MED
- Practice identifying when NOT to use random split for time-series data MED
- Review the correlation warning — understand why |r| > 0.9 is a problem for linear models MED
Official Resources
Primary references for NCA-ADS Topics 4 and 7. Use these for deep dives beyond what the exam requires — but the APIs listed here may appear in scenario questions.
NVIDIA NCA-ADS Exam Page
Official certification overview, exam domains, and registration details from NVIDIA Learning.
nvidia.com →cuDF API Documentation
Full reference for cuDF DataFrame operations including .describe(), .corr(), .rolling(), .shift(), and datetime methods.
docs.rapids.ai/api/cudf →cuGraph Documentation
cuGraph graph analytics library — PageRank, BFS, Louvain, Betweenness Centrality, and Jaccard similarity on GPU.
docs.rapids.ai/api/cugraph →cuXfilter — RAPIDS Visualization
GPU-accelerated visualization library for creating interactive dashboards directly from cuDF DataFrames without .to_pandas().
github.com/rapidsai/cuxfilter →