FlashGenius Logo FlashGenius
NVIDIA Certified · NCA-ADS · Associate Level

NCA-ADS: Descriptive Analysis
& Visualization

Topics 4 & 7 of 8  |  ~20% of Exam  |  Accelerated Data Science Associate

20%Weight
60 minExam
70%Passing
50–60Questions
AssociateLevel
Study with Practice Tests →

NCA-ADS Exam Overview

This page covers Topics 4 (Descriptive Analysis & Visualization, ~13%) and Topic 7 (Advanced Data Structures, ~7%) of the NCA-ADS exam. Together they form roughly 20% of exam content — covering EDA workflows, statistical interpretation, plot selection, hypothesis testing, time-series handling, and graph-based data with RAPIDS cuGraph.

Full Exam Topic Weight Table

TopicDomainWeight
1Introduction to Accelerated Data Science~10%
2GPU-Accelerated Data Loading & Preprocessing~15%
3Feature Engineering & Data Transformation~15%
4Descriptive Analysis & Visualization~13%
5Machine Learning with cuML~20%
6Model Evaluation & Hyperparameter Tuning~15%
7Advanced Data Structures~7%
8End-to-End Pipelines & Deployment~5%

Page Summary

Topics Covered
4 (Descriptive Analysis) & 7 (Advanced Data Structures)
Combined Exam Weight
~20% of NCA-ADS exam
Key Libraries
cuDF, cuGraph, matplotlib, seaborn, cuXfilter
EDA Functions
.describe(), .value_counts(), .corr(), .isnull().sum(), .nunique()
Plot Types to Know
Histogram, Box, Scatter, Line, Bar, Heatmap, Violin, Pie
Graph Algorithms
PageRank, BFS, Louvain Community Detection
Time-Series Key Ops
to_datetime(), .dt.*, rolling(), shift(), resample()
Statistical Tests
t-test, Chi-square, ANOVA, p-value interpretation

Core Concepts

Eight foundational concept areas for Topics 4 and 7. Study each group carefully — the exam tests conceptual understanding and the ability to select the right tool or approach for a given scenario.

1. Exploratory Data Analysis (EDA) — The Foundation
Goal: understand data before modeling — distribution, relationships, anomalies, missing data. Always do EDA before feature engineering: garbage in, garbage out.

Key cuDF EDA Functions

  • .describe() — count, mean, std, min, 25th/50th/75th percentile, max
  • .value_counts() — frequency of each unique value (good for categorical)
  • .corr() — Pearson correlation matrix between all numerical columns
  • .isnull().sum() — count missing values per column
  • .nunique() — number of unique values per column

EDA Workflow — SDMCOA

  • Shape — df.shape: rows and columns
  • Dtypes — df.dtypes: check column types
  • Missing — df.isnull().sum(): find gaps
  • Correlations — df.corr(): feature relationships
  • Outliers — box plots, IQR method
  • Anomalies — check for impossible values, duplicates
2. Descriptive Statistics — Know What They Mean

Central Tendency Measures

  • Mean: average — sensitive to outliers
  • Median: middle value — robust to outliers; better for skewed data
  • Mode: most frequent value — useful for categorical data
  • Rule: when mean >> median → right-skewed distribution → consider log transform

Spread & Shape Measures

  • Standard Deviation: spread of data around the mean
  • Variance: std squared
  • Skewness: positive (right tail), negative (left tail) — affects model assumptions
  • Kurtosis: peakedness of distribution — high kurtosis = heavy tails (more outliers)
  • Percentiles: P25 (Q1), P50 (median), P75 (Q3); IQR = Q3 - Q1
3. Visualization — Plot Selection Guide
Associate level: know WHICH plot for WHICH purpose. RAPIDS workflow: compute on GPU (cuDF), then .to_pandas() + matplotlib/seaborn for plotting. cuXfilter enables GPU-accelerated interactive dashboards.
Plot TypeWhen to UseExample
HistogramDistribution of a single numerical variableAge distribution of customers
Box PlotDistribution + outliers; compare across groupsSalary by department
Scatter PlotRelationship between two numerical variablesHeight vs weight
Line ChartTrends over time (time-series)Sales per month
Bar ChartComparison of categorical groupsRevenue by product category
HeatmapCorrelation matrix, feature relationshipsFeature correlation heatmap
Violin PlotDistribution shape across groups (box + KDE)Model accuracy by method
Pie ChartProportional composition (max 5 categories)Market share
4. Correlation Analysis

Pearson Correlation

  • Measures linear relationship between two numerical variables
  • Range: -1 to +1
  • +1 = perfect positive linear correlation
  • -1 = perfect negative linear correlation
  • 0 = no linear correlation (may still have non-linear relationship)
  • df.corr() in cuDF computes the Pearson correlation matrix

Multicollinearity & Causation

  • Highly correlated features (|r| > 0.9): consider dropping one — multicollinearity harms linear models
  • Heatmap: seaborn.heatmap(df.to_pandas().corr())
  • Correlation does NOT equal causation — always investigate causality separately
  • Rule of thumb: r > 0.9 = one must go (for linear models)
5. Hypothesis Testing and Statistical Significance

Hypothesis Framework

  • Null hypothesis (H0): no effect / no difference
  • Alternative hypothesis (H1): there is an effect / difference
  • p-value: probability of observing results as extreme if H0 is true
  • p < 0.05: statistically significant — reject H0 (95% confidence)
  • p ≥ 0.05: fail to reject H0 — effect may be due to chance

Common Statistical Tests

  • t-test: compare means of two groups (e.g., A/B test)
  • Chi-square test: independence between two categorical variables
  • ANOVA: compare means of 3+ groups
  • Practical vs statistical significance: large datasets can make tiny, unimportant differences statistically significant — always check effect size
6. Time-Series Data Handling

Parsing & Extraction

  • Parse dates: cudf.to_datetime(df['date_col']) — convert string to datetime
  • Extract: .dt.year, .dt.month, .dt.day, .dt.dayofweek, .dt.hour
  • Resampling: df.set_index('date').resample('1M').mean() — monthly averages
  • Missing timestamps: .interpolate() — fill gaps with interpolated values

Feature Engineering & Split Rules

  • Rolling stats: df['sales'].rolling(window=7).mean() — 7-day moving average
  • Lag features: df['sales_lag1'] = df['sales'].shift(1) — previous period value
  • Time-series split: ALWAYS split by time (never random split) — future data cannot train the model
  • Example: train on 2020–2024, test on 2025
7. Graph-Based Data Representation
Graph = nodes (entities) + edges (relationships). Use graphs when relationships between entities are as important as the entities themselves. cuGraph input: edge list as cuDF DataFrame with 'src' and 'dst' columns.

Graph Use Cases

  • Social networks: users + friendships
  • Fraud detection: transactions + accounts
  • Recommendation systems: users + items
  • Key concepts: Degree (edges per node), Path (connecting edges), Component (connected group)
  • Directed graphs: in-degree vs out-degree

Graph Algorithms (Associate Level)

  • PageRank: identifies most important/influential nodes (used by Google Search)
  • BFS (Breadth-First Search): shortest path exploration layer by layer
  • Louvain Community Detection: finds clusters/communities in the graph
  • Exam tip: understand purpose, not implementation details
8. Node Importance and Network Relationships

Centrality Measures

  • Degree centrality: most direct connections = most important (simplest)
  • Betweenness centrality: how often a node appears on shortest paths — identifies bridges/brokers
  • PageRank: importance weighted by quality of connections — being connected to important nodes matters more

cuGraph at Scale

  • Computes PageRank and Betweenness Centrality on GPU
  • Enables analysis of billion-edge graphs
  • Network visualization: node size proportional to importance score (for small graphs)
  • Results returned as cuDF DataFrames for downstream analysis

Memory Hooks

Six memorable hooks to lock in the most-tested concepts. Each hook gives you a mental shortcut you can recall under exam pressure.

📊

Plot Selection — HLSBHV

"Histograms Look at Singles, Bars Compare, Heatmaps Visualize Relationships, Scatter Shows pairs, Line is for time"

Remember plots by purpose, not by name. Each letter maps to a plot type and its job: H=Histogram (distribution), L=Line (time), S=Scatter (two variables), B=Bar (categories), H=Heatmap (correlations), V=Violin (grouped distribution).

🔍

EDA Workflow — SDMCOA

"Shape → Dtypes → Missing → Correlations → Outliers → Anomalies"

Always follow this order — never skip steps. Shape and dtypes first (structural check), then data quality (missing), then relationships (correlations), then unusual values (outliers and anomalies). Skipping ahead causes you to miss data issues that corrupt your model.

📉

p-value Rule

"p under 0.05, hypothesis is done"

p < 0.05 = statistically significant = reject the null hypothesis. p ≥ 0.05 = insufficient evidence = fail to reject. Remember: statistical significance does not prove causation, and a big dataset can make tiny effects significant — always check effect size.

⏱️

Time-Series Split Rule

"Time data splits by time, not by random"

Train on past, validate on future — never use random split for time-series. Random split leaks future information into training, artificially inflating performance. Always simulate real forecasting conditions: earlier dates train, later dates test.

⚠️

Correlation Warning

"r > 0.9 = one must go"

Highly correlated features (Pearson |r| > 0.9) cause multicollinearity — linear models become unstable and coefficients become unreliable. Keep only one of the two correlated features. Tree-based models (XGBoost, Random Forest) are less affected but it still adds noise.

🕸️

Graph vs Table

"When relationships matter as much as entities, use a graph"

If your data is about connections between things (users-friends, transactions-accounts, buyers-products), a graph is the right structure. Social networks, fraud rings, and recommendation systems all depend on relationship patterns that a flat table cannot capture.

Practice Quiz

10 associate-level questions on plot selection, statistical interpretation, time-series decisions, graph analytics, and EDA. Select your answers and click Submit to see results.

0/10

Flashcards

12 cards covering EDA functions, statistics, visualization, time-series, and graph concepts. Click a card to flip it and reveal the full explanation.

Click any card to flip • Covers all major exam concepts for Topics 4 & 7

EDA

.describe()

.describe() computes summary stats for all numerical columns: count, mean, std, min, 25%/50%/75% percentile, max. Quick way to spot scale differences, potential outliers, and data ranges. Works on cuDF DataFrames directly.
Statistics

Mean vs Median

Mean = sum / count — sensitive to outliers. Median = middle value — robust to outliers. When mean >> median: right-skewed data. When mean << median: left-skewed data. Use median for salary, housing prices, and income data.
Statistics

p-value

Probability of seeing the observed result if the null hypothesis is true. p < 0.05: statistically significant — reject null (95% confidence). p ≥ 0.05: insufficient evidence to reject null. Common mistake: p < 0.05 does not prove causation.
Visualization

Histogram

Shows distribution of a single numerical variable. X-axis: value ranges (bins). Y-axis: frequency. Use to identify: skewness (left/right tail), modality (peaks), outliers (isolated bars). Different from bar chart which compares categories.
Visualization

Box Plot

Shows distribution summary: median line, IQR box (Q1–Q3), whiskers (1.5×IQR), and individual outlier points beyond whiskers. Best for: comparing distributions across multiple groups and identifying outliers visually.
Statistics

Pearson Correlation

Measures linear relationship between two numerical variables. Range: -1 (perfect negative) to +1 (perfect positive). 0 = no linear relationship. Computed with df.corr() in cuDF. Warning: correlation does not imply causation.
Time-Series

Rolling Mean

Moving average over a sliding window. Usage: df['col'].rolling(window=7).mean() for 7-period moving average. Smooths out noise to reveal trends. First (window-1) values are NaN — insufficient data for the window. Common windows: 7-day, 30-day, 52-week.
Time-Series

Lag Features

Previous time period values as predictors. df['sales_lag1'] = df['sales'].shift(1) creates yesterday's sales as a feature. df['sales_lag7'] = df['sales'].shift(7) creates last week's sales. Essential for time-series forecasting models.
Time-Series

Time-Series Train/Test Split

ALWAYS split by time — train on earlier data, test on later data. NEVER use random split — it leaks future information into training. Example: train on 2020–2024, test on 2025. This simulates real forecasting conditions.
Graph

PageRank

Ranks nodes by importance based on quality of connections. High PageRank = connected to other important nodes. Used by Google Search, fraud detection, social network influence analysis. Available in cuGraph. Being connected to important nodes matters more than just having many connections.
Graph

Louvain Community Detection

Finds clusters (communities) in a graph by maximizing modularity — how densely connected nodes are within a community vs between communities. Use for: customer segmentation, fraud ring detection, social group discovery. Available as cuGraph.louvain().
EDA

.corr()

Computes pairwise Pearson correlation matrix for all numerical columns. df.corr() in cuDF. Values range -1 to +1. Use to identify multicollinearity (drop features with |r|>0.9 for linear models) and understand relationships between features and target variable.

Study Advisor

Personalized study plans for Topics 4 & 7 based on your background. Select your profile to see prioritized steps.

Business Analyst Study Plan

  • Start with the Plot Selection Guide — you likely already use charts; focus on matching the right chart to the data type and goal HIGH
  • Learn the p-value rule first (p < 0.05 = significant) — you'll encounter this in A/B testing questions HIGH
  • Study mean vs. median carefully — right-skew vs. left-skew scenarios are common exam questions HIGH
  • Memorize the EDA SDMCOA workflow — this is the structured approach examiners expect MED
  • For graph data: understand use cases (fraud, social, recommendations) — you don't need to code cuGraph MED
  • Practice identifying when NOT to use random split for time-series data MED
  • Review the correlation warning — understand why |r| > 0.9 is a problem for linear models MED

Official Resources

Primary references for NCA-ADS Topics 4 and 7. Use these for deep dives beyond what the exam requires — but the APIs listed here may appear in scenario questions.

Certification

NVIDIA NCA-ADS Exam Page

Official certification overview, exam domains, and registration details from NVIDIA Learning.

nvidia.com →
API Reference

cuDF API Documentation

Full reference for cuDF DataFrame operations including .describe(), .corr(), .rolling(), .shift(), and datetime methods.

docs.rapids.ai/api/cudf →
API Reference

cuGraph Documentation

cuGraph graph analytics library — PageRank, BFS, Louvain, Betweenness Centrality, and Jaccard similarity on GPU.

docs.rapids.ai/api/cugraph →
Visualization

cuXfilter — RAPIDS Visualization

GPU-accelerated visualization library for creating interactive dashboards directly from cuDF DataFrames without .to_pandas().

github.com/rapidsai/cuxfilter →

Ready to Pass NCA-ADS?

Test your Descriptive Analysis and Advanced Data Structures knowledge with full practice exams on FlashGenius.

Unlock Full Practice Tests on FlashGenius →