NCA-GENM: Data Analysis & Visualization

Domain 2: Data Analysis & Visualization

This domain covers how to mine, engineer, analyze, and visualize data in the context of generative and multimodal AI. While it carries 10% of exam weight (≈5 questions), these concepts bridge all other domains — data quality drives model quality.

What This Domain Covers

Data mining techniques: clustering, classification, pattern discovery
Feature engineering: transforming raw data into informative model inputs
Explainability: Grad-CAM, attention maps, SHAP for multimodal models
Data visualization: selecting and interpreting the right chart type
Trend and anomaly detection in AI system outputs

Exam Strategy

10% = ≈5 questions — focused but specific
Know Grad-CAM vs attention maps: which applies to CNNs vs transformers
Chart selection: heatmaps for correlation/attention, line for trends, scatter for correlation
Feature engineering: one-hot encoding for categoricals, normalization for continuous
K-means requires K specified upfront; DBSCAN does not

Domain 2 Subtopics

Subtopic	Key Concepts	Exam Priority
2.1 — Data Mining & Feature Engineering	Clustering, classification, normalization, encoding, PCA	⭐⭐⭐
2.2 — Attention Maps & Explainability	Grad-CAM (CNNs), attention weights (transformers), SHAP, LIME	⭐⭐⭐
2.3 — Charts & Visualization Tools	Bar, line, scatter, heatmap, histogram, confusion matrix	⭐⭐
2.4 — Trend & Anomaly Detection	Moving averages, Z-score, IQR, learning curves, ROC curves	⭐⭐

Data Mining & Feature Engineering

Data mining extracts useful patterns from large datasets. Feature engineering transforms raw data into the structured numerical inputs that machine learning models can learn from effectively.

Data Mining Techniques

Clustering (Unsupervised)

K-means: requires K (number of clusters) specified upfront; assigns each point to nearest centroid; minimizes within-cluster variance; sensitive to outliers and initial centroid placement
DBSCAN: does NOT require K; finds clusters by density; handles irregular shapes and outliers; marks low-density points as noise
Hierarchical: builds a tree (dendrogram) of cluster merges; agglomerative (bottom-up) most common
Use case: customer segmentation, document grouping, anomaly detection

Classification & Pattern Discovery

Association rules: discover co-occurrence patterns (market basket analysis); support, confidence, lift metrics
Decision trees: interpretable; split on features that maximize information gain (entropy reduction)
Random forests: ensemble of trees; reduces variance via bagging
Gradient boosting (XGBoost): sequential trees correct prior errors; often best on tabular data
Evaluation: accuracy, precision, recall, F1, AUC-ROC

Feature Engineering

Numerical Feature Transforms

Min-max normalization: scale to [0,1] range: (x − min)/(max − min)
Z-score standardization: scale to mean=0, std=1: (x − μ)/σ; better when data follows a normal distribution
Log transform: compress right-skewed distributions (e.g. income, word frequency)
Binning: convert continuous to categorical ranges (age → 0–18, 19–35, 36+)
Polynomial features: add interaction terms (x₁×x₂) to capture non-linear relationships

Categorical Encoding

One-hot encoding: create binary column for each category value; avoids ordinal assumption; high cardinality → many columns
Label encoding: assign integer to each category; only for ordinal variables (Low=1, Medium=2, High=3)
Target encoding: replace category with mean target value; risk of leakage without proper validation
Embedding layers: learned dense vector for each category (NLP tokens, entity IDs)

Dimensionality Reduction

PCA (Principal Component Analysis): find orthogonal axes of maximum variance; project data to fewer dimensions; preserves global structure; output: principal components ranked by explained variance
t-SNE: non-linear; preserves local neighborhood structure; great for visualizing high-dim embeddings in 2D/3D; not deterministic
UMAP: faster than t-SNE; better at preserving global structure; used for embedding visualization
Feature selection: filter (correlation), wrapper (RFE), embedded (Lasso L1 penalty)

Missing Value Handling

Mean/median imputation: simple; distorts distribution; don't use with tree models (they handle missing values)
Mode imputation: for categorical variables
KNN imputation: fill missing values using similar rows; better than mean for non-normal data
Multiple imputation: generate multiple complete datasets and pool results; gold standard
Indicator column: add binary "was_missing" column to let model learn from missingness pattern

Attention Maps & Explainability

Understanding what a model "looks at" when making predictions is critical for debugging, trust, and regulatory compliance. Different architectures require different explainability techniques.

Gradient-Based Explainability

Grad-CAM (Gradient-weighted Class Activation Mapping)

Target architecture: CNNs — uses the final convolutional layer feature maps
Mechanism: compute gradients of the target class score with respect to the last conv layer feature maps → pool gradients to get per-channel importance weights → weighted sum → ReLU → resize to input size
Output: heatmap overlaid on input image showing which regions were most influential
Use case: diagnose model failures, verify that model attends to correct image regions, build trust
Limitation: coarse resolution (tied to final conv feature map size); doesn't apply to transformers

Attention Map Visualization

Target architecture: transformers (not CNNs)
Self-attention: shows which tokens in a sequence attend to which other tokens; can reveal long-range dependencies (e.g. pronoun resolves to distant noun)
Cross-attention (multimodal): shows which image regions a text token attends to — directly interprets text-image alignment in VLMs
Multi-head: each attention head captures different relationships; aggregate or select specific heads
Rollout: propagate attention across layers to get end-to-end attention from input tokens to output

Model-Agnostic Explainability

SHAP (SHapley Additive exPlanations)

Based on game theory cooperative Shapley values
Measures each feature's marginal contribution to a specific prediction
Consistent and locally accurate — satisfies mathematical fairness axioms
Works for any model (tree, neural net, linear)
SHAP waterfall plot: shows feature impact for single prediction; SHAP summary: shows feature importance across dataset

LIME (Local Interpretable Model-agnostic Explanations)

Perturbs the input (e.g. masks image patches or words), observes prediction changes
Fits a simple interpretable model (linear) locally around the instance
Works for images (superpixels), text (word removal), tabular data
Faster than SHAP for some models; less theoretically grounded
Result: which features most influenced this specific prediction

Explainability in Multimodal Models

Cross-attention maps reveal which image patches influence each text token
Grad-CAM can be adapted to VLMs by targeting the visual encoder's final layer
Probing classifiers: train lightweight classifiers on intermediate representations to understand what each layer encodes
Concept Activation Vectors (CAVs): test whether human-defined concepts are encoded in model representations

Explainability Technique Comparison

Technique	Architecture	Output	Granularity
Grad-CAM	CNNs only	Spatial heatmap on image	Coarse (conv map resolution)
Attention maps	Transformers	Token-to-token attention weights	Fine (per token)
SHAP	Any model	Feature importance per prediction	Feature-level
LIME	Any model	Local linear approximation	Superpixel / word / feature

Data Visualization & Trend Analysis

Choosing the right chart type and correctly interpreting trends, anomalies, and model performance curves are core data analysis skills tested on the NCA-GENM exam.

Chart Type Selection

When to Use Each Chart

Bar chart: compare values across discrete categories (accuracy per model, revenue per quarter)
Line chart: show trends over continuous time or ordered sequence (training loss over epochs, stock price)
Scatter plot: show relationship/correlation between two continuous variables; add color for third variable
Heatmap: visualize a matrix of values — correlation matrices, confusion matrices, attention weights
Histogram: show frequency distribution of a single continuous variable (pixel intensity, embedding magnitude)
Box plot: show distribution summary (median, IQR, min/max, outliers) for one or more groups

Model Performance Charts

Learning curve: plot training and validation loss/accuracy over epochs; diagnose overfitting (train↑ val↓) or underfitting (both low)
ROC curve: plot True Positive Rate vs False Positive Rate across thresholds; AUC = area under curve (1.0 = perfect)
Precision-Recall curve: better than ROC for class-imbalanced datasets; shows tradeoff between precision and recall
Confusion matrix: heatmap of TP/TN/FP/FN per class; reveals which classes are confused
Feature importance plot: horizontal bar chart ranked by SHAP or impurity-based importance

Trend & Anomaly Detection

Statistical Trend Detection

Moving average: smooth time series noise; simple (SMA), exponential (EMA weights recent more)
Seasonal decomposition: separate trend, seasonality, and residual components from a time series
Autocorrelation: detect repeating patterns by measuring correlation with lagged versions of itself
Linear regression trend: fit line to detect upward/downward trends in scatter data

Anomaly Detection Methods

Z-score: flag points where |z| > 3; assumes normal distribution; z = (x − μ)/σ
IQR method: outlier if x < Q1 − 1.5×IQR or x > Q3 + 1.5×IQR; distribution-free
Isolation Forest: anomalies are easier to isolate in random feature splits → shorter path lengths
Autoencoder: high reconstruction error = anomaly; especially useful for image/sequence anomalies
DBSCAN noise points: low-density points classified as anomalies automatically

Python Visualization Tools

Matplotlib: low-level; full control; plots, histograms, scatter, any chart type
Seaborn: statistical visualization on top of Matplotlib; correlation heatmaps, distribution plots, pairplots
Plotly: interactive web-based charts; hover tooltips, zoom, drill-down
TensorBoard: visualize training metrics (loss, accuracy), histograms, embeddings, model graphs during training
Weights & Biases (WandB): experiment tracking, hyperparameter sweep visualization, model comparison dashboards

Practice Quiz — Domain 2

10 questions covering data mining, feature engineering, explainability, and visualization. Select the best answer for each question.

Memory Hooks

Anchor key concepts with these mnemonics before exam day.

🔥

Grad-CAM

"Grad-CAM: Gradients Spotlight the Region"

Grad-CAM backpropagates the class score gradient to the final conv layer and creates a spotlight heatmap. CNN only — transformers use attention maps instead.

🔭

Attention Maps vs Grad-CAM

"Attention for Transformers, Grad for CNNs"

Attention maps show token-to-token relationships (transformers). Grad-CAM shows spatial importance via gradients (CNNs). They complement each other but each applies to only one architecture type.

🎯

K-means

"K-means: You Must Know K"

K-means requires you to specify K clusters before running. DBSCAN does not. If the question asks which clustering algorithm doesn't need K upfront — the answer is DBSCAN.

🌡️

Heatmap Use Cases

"Heat shows HOW things relate"

Heatmaps reveal correlation — between features (correlation matrix), between classes (confusion matrix), between tokens/pixels (attention weights). When you need to show a matrix of relationships, reach for a heatmap.

📉

Learning Curves

"Train↑ Val↓ = Overfit; Both Low = Underfit"

A learning curve diagnostic: if training accuracy is high but validation is low → overfitting (model memorized training data). If both are low → underfitting (model too simple). The gap between curves reveals generalization.

🔢

One-Hot Encoding

"One-Hot: One Binary Column Per Category"

Color: {Red, Blue, Green} → [1,0,0], [0,1,0], [0,0,1]. Avoids false ordinal assumptions that label encoding introduces. Use for nominal (unordered) categoricals.

Flashcards & Advisor

Click a card to reveal the answer

Grad-CAM

What is it, how does it work, and for which architecture?

Gradient-weighted Class Activation Mapping. For CNNs. Backpropagates class score gradients to final conv layer → pool channels → ReLU → resize to input. Produces a spatial heatmap showing influential image regions.

Attention Maps

Which architecture, and what do they show?

For transformers (not CNNs). Self-attention shows token-to-token relationships. Cross-attention (multimodal) shows which image regions a text token attends to — directly interprets text-image alignment.

K-means vs DBSCAN

Key difference in how they cluster?

K-means: requires K (number of clusters) specified upfront; assigns each point to nearest centroid; sensitive to outliers. DBSCAN: no K needed; density-based; handles irregular shapes; marks outliers as noise points.

One-Hot Encoding

When to use it and how does it work?

For nominal (unordered) categorical variables. Creates one binary column per category value. Avoids ordinal assumptions of label encoding. High cardinality → many columns (use embedding layers instead).

PCA

What does it do and when should you use it?

Principal Component Analysis. Finds orthogonal axes (principal components) of maximum variance and projects data to fewer dimensions. Use to: reduce dimensionality, remove collinear features, speed up training. Output ranked by explained variance.

SHAP Values

What do they measure and why are they preferred?

Shapley Additive exPlanations — measures each feature's marginal contribution to a specific prediction. Model-agnostic. Theoretically grounded (satisfies consistency + local accuracy). SHAP waterfall: per-instance; SHAP summary: global importance.

Heatmap vs Histogram

When do you choose each?

Heatmap: visualize a 2D matrix — correlation between features, confusion matrix, attention weights. Histogram: distribution of a single continuous variable (frequency vs value). Never swap them — wrong chart = wrong insight.

Overfitting vs Underfitting

How to diagnose from a learning curve?

Overfitting: high training accuracy, low validation accuracy — large gap between curves. Fix: more data, dropout, L2 regularization, early stopping. Underfitting: both training and validation accuracy low. Fix: larger model, more features, less regularization.

Study Advisor

Data Mining Techniques

K-means: specify K upfront; minimizes within-cluster variance; sensitive to outliers and init
DBSCAN: no K needed; density-based; marks low-density points as noise; handles arbitrary cluster shapes
Hierarchical clustering: builds dendrogram of cluster merges; agglomerative (bottom-up) is most common
Association rules: support (how often), confidence (precision), lift (co-occurrence above random)
Decision trees: split on feature that maximizes information gain (entropy reduction)
Random forests: ensemble of decision trees via bagging; reduces variance over a single tree
XGBoost: sequential boosting; each tree corrects residuals of previous; best on tabular data

Data Analysis & Visualization

Domain 2: Data Analysis & Visualization

What This Domain Covers

Exam Strategy

Domain 2 Subtopics

Data Mining & Feature Engineering

Clustering (Unsupervised)

Classification & Pattern Discovery

Numerical Feature Transforms

Categorical Encoding

Dimensionality Reduction

Missing Value Handling

Attention Maps & Explainability

Grad-CAM (Gradient-weighted Class Activation Mapping)

Attention Map Visualization

SHAP (SHapley Additive exPlanations)

LIME (Local Interpretable Model-agnostic Explanations)

Explainability in Multimodal Models

Explainability Technique Comparison

Data Visualization & Trend Analysis

When to Use Each Chart

Model Performance Charts

Statistical Trend Detection

Anomaly Detection Methods

Python Visualization Tools

Practice Quiz — Domain 2

Memory Hooks

Flashcards & Advisor

Study Advisor

Data Mining Techniques

Ready to Pass the NCA-GENM?