NCA-GENM: Experimentation & Model Testing

Domain: Experimentation

Exam Weight: 25% — Subtopics 3.1 – 3.5

Experimentation is the largest NCA-GENM domain. It covers the full experimental workflow: developing and testing multimodal models, preprocessing diverse data sources, using models to improve explainability, ensuring data quality, and evaluating model accuracy.

Official Subtopics

#	Subtopic	Key Skills
3.1	Assist in developing and testing multimodal AI models	Model selection, prototyping, evaluation pipelines, A/B testing
3.2	Manage and preprocess data from various sources	Data cleaning, augmentation, normalization, multimodal alignment
3.3	Use multimodal models to improve explainability	Attention maps, Grad-CAM, saliency, cross-modal explainability
3.4	Test data quality and consistency in a multimodal setting	Missing modality handling, label verification, consistency checks
3.5	Test AI models to ensure accuracy and effectiveness	Evaluation metrics, holdout sets, benchmark comparison

Why This Domain Is 25%

Practical Developer Focus

NCA-GENM targets developers who spend most of their time running experiments, evaluating model variants, and iterating on data. Experimentation is the day-to-day work of an AI associate.

Cross-Domain Knowledge

Experimentation integrates all other domains — you need Core ML knowledge to design experiments, Software skills to implement pipelines, and Data Analysis skills to interpret results.

Exam Focus Areas

Expect questions on: evaluation metrics (FID, BLEU, WER, accuracy), A/B testing methodology, data augmentation techniques, attention map interpretation, and NVIDIA Riva for ASR/TTS.

Model Development & Testing

Developing multimodal AI models involves selecting architectures, building evaluation pipelines, running controlled experiments, and interpreting results — all core to subtopics 3.1, 3.3, and 3.5.

Model Evaluation Metrics by Modality

Image Generation — FID

Fréchet Inception Distance measures statistical similarity between real and generated image distributions. Lower = better. Standard GAN/diffusion model quality metric.

Text Generation — BLEU

Bilingual Evaluation Understudy measures n-gram overlap between generated and reference text. Range 0–1. Used in machine translation and image captioning evaluation.

Speech Recognition — WER

Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower = better. Primary metric for ASR models. Measures transcription accuracy.

Classification — Accuracy / F1

Accuracy = correct/total. F1 = harmonic mean of Precision and Recall. F1 is preferred for imbalanced classes. Macro-F1 averages across all classes equally.

Image Segmentation — IoU

Intersection over Union measures overlap between predicted and ground truth segmentation masks. mIoU averages across all classes. Standard segmentation metric.

Text Summarization — ROUGE

Recall-Oriented Understudy for Gisting Evaluation. ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence). Measures summary quality.

A/B Testing in AI Experimentation (3.1, 3.5)

How A/B Testing Works in ML

A/B testing compares two model variants (or prompt strategies, architectures, preprocessing pipelines) under controlled, identical conditions:

Control (A): baseline model or approach
Treatment (B): new variant with one changed component
Same evaluation set: both variants tested on identical held-out data
Statistical significance: ensure differences aren't due to chance (p-value, confidence intervals)
One change at a time: isolates which variable caused the performance difference

Explainability in Multimodal Models (3.3)

Grad-CAM

Gradient-weighted Class Activation Mapping highlights which image regions most influence a CNN's prediction. Uses gradients flowing into the final conv layer to generate a heatmap overlay. Helps identify model biases (e.g., background vs. object focus).

Attention Maps

Transformer attention weights show which tokens/patches the model attends to when making predictions. In vision-language models, cross-attention maps reveal the image regions linked to specific words. Directly tested in Data Analysis domain (2.2).

LIME / SHAP

Model-agnostic explainability. LIME perturbs inputs and fits a local linear model. SHAP uses Shapley values from game theory to fairly attribute feature importance. Both work for text, image, and tabular modalities.

Multimodal Explainability

Identify which modality (text or image) the model relied on for a given prediction. Cross-modal attention scores reveal modality contribution. Important for debugging why a model ignores one modality.

Model Evaluation Workflow (3.5)

Split Data

Train / Validation / Test — never leak test data

Baseline

Establish baseline model performance on evaluation set

Experiment

Change one variable, retrain, evaluate on same val set

Compare

A/B test on held-out test set; check statistical significance

Iterate

Document findings, repeat with next hypothesis

Data Preprocessing & Quality

Subtopics 3.2 and 3.4 cover managing and preprocessing diverse data sources and ensuring data quality — critical steps before any model training.

Data Augmentation Techniques (3.2)

Image Flipping

Horizontal/vertical flips. Simple and effective. Doubles dataset size with one transformation.

Random Crop

Crop random regions at training time. Forces model to be location-invariant. Standard in ImageNet training.

Color Jitter

Randomly adjust brightness, contrast, saturation, hue. Prevents model from relying on color alone.

Gaussian Noise

Add random noise to images or audio. Improves robustness to noisy real-world inputs.

Rotation / Affine

Rotate, scale, shear images. Teaches rotational invariance. Avoid for text where orientation matters.

Mixup / CutMix

Blend two training examples linearly (Mixup) or paste regions between images (CutMix). Improves generalization.

Text Augmentation

Synonym replacement, back-translation, random deletion/insertion. Increases textual diversity without new labeling.

Audio Augmentation

Time stretching, pitch shifting, adding background noise, SpecAugment (mask frequency/time bands). Standard for ASR robustness.

Data Preprocessing Pipeline (3.2)

Step	Purpose	Multimodal Specifics
Collection	Gather raw data from sources	Ensure paired data (image+caption, audio+transcript)
Cleaning	Remove duplicates, corrupt files, outliers	Detect misaligned pairs (wrong audio with video)
Normalization	Scale features to consistent range	Images: [0,1] or [-1,1]; audio: normalize waveform amplitude
Tokenization	Convert text to model-ready tokens	Use model-specific tokenizer (BERT tokenizer, CLIP tokenizer)
Augmentation	Artificially expand training set	Apply augmentations consistently to paired modalities
Batching	Package into mini-batches for GPU training	Handle variable-length sequences with padding/collation

Data Quality Testing in Multimodal Settings (3.4)

Missing Modality Detection

Check for samples where one modality is absent (image with no caption, video with no audio). Decide: impute, exclude, or use modality-dropout training to handle missing modalities at inference.

Modality Alignment Verification

Ensure paired samples are correctly matched. Common issue: metadata errors causing wrong audio paired with video. Use hash-based checks or model-based verification (CLIP similarity score threshold).

Label Consistency

Check for conflicting annotations across modalities. Use inter-annotator agreement scores (Cohen's kappa). Resolve disagreements through majority voting or expert review.

Class Balance Analysis

Plot class distribution histograms. Imbalanced classes → biased model. Remedies: oversample minority class, undersample majority, or use class-weighted loss.

NVIDIA Riva & Diffusion Models

Two key technology areas explicitly called out in the Experimentation domain: NVIDIA Riva for conversational AI (ASR/TTS), and diffusion models for generative image experimentation.

NVIDIA Riva — Conversational AI SDK

🎤 ASR — Automatic Speech Recognition

Converts spoken audio → text transcriptions
Built on acoustic model + language model pipeline
Customizable with domain-specific vocabulary
Deployed on NVIDIA GPU with Triton Inference Server
Key metric: Word Error Rate (WER) — lower is better
Supports real-time (streaming) and batch transcription

🔊 TTS — Text-to-Speech

Converts text → natural-sounding speech audio
Models: FastPitch (spectrogram generation) + HiFi-GAN (vocoder)
Customizable voice, pitch, speaking rate
Supports multiple languages and speaker styles
End-to-end pipeline: text → mel spectrogram → waveform
Deployed alongside ASR for full conversational AI

End-to-End Conversational AI Pipeline on Riva

ASR (speech → text) → NLP processing (intent/entity extraction) → Response generation (LLM) → TTS (text → speech). Deployed on Kubernetes with Helm charts for production scaling. NVIDIA Riva handles the ASR and TTS components; NeMo handles the NLP/LLM components.

Generative AI with Diffusion Models (3.1, 3.5)

Forward Process

Gradually add Gaussian noise to real image over T timesteps until pure noise

Reverse Process

Train U-Net to predict and remove noise at each step (denoising)

Conditioning

Add text embeddings (via CLIP) to guide image generation toward the prompt

Sampling

Start from random noise, run T denoising steps to generate final image

DDPM — Denoising Diffusion Probabilistic Models

The foundational diffusion model training objective. Trains a U-Net to predict the noise added at each timestep. At inference, starts from pure Gaussian noise and iteratively denoises over hundreds of steps.

Classifier-Free Guidance (CFG)

Technique to strengthen text conditioning. Trains the model both with and without text conditioning. At inference, combines conditional and unconditional predictions: output = uncond + scale × (cond − uncond). Higher scale = stronger adherence to prompt.

Latent Diffusion Models (Stable Diffusion)

Run the diffusion process in a compressed latent space (via a VAE) rather than pixel space. Dramatically more efficient — same quality at fraction of compute cost. Standard architecture for modern text-to-image models.

Evaluating Diffusion Model Quality

FID score (lower = better quality vs. real images), CLIP score (measures image-text alignment), human evaluation. Experiment with guidance scale, number of inference steps, and prompt phrasing to optimize outputs.

Experimentation with Generative Models (3.1)

Experiment Variable	What Changes	How to Evaluate
Prompt engineering	Text prompt wording, style modifiers	CLIP score, human preference rating
Guidance scale (CFG)	Strength of text conditioning (1–20)	Trade-off: high scale = prompt adherence but lower diversity
Inference steps	Number of denoising steps (20–1000)	FID vs. speed trade-off; more steps = higher quality up to a point
Model architecture	U-Net size, attention resolution	FID score at same compute budget
Fine-tuning method	Full fine-tune vs. LoRA vs. DreamBooth	Style adaptation quality, identity preservation

Practice Quiz — Experimentation

10 NCA-GENM–style questions on model testing, data preprocessing, explainability, NVIDIA Riva, and diffusion models.

out of 10 correct

Memory Hooks

Mnemonics to lock in the Experimentation domain's key concepts.

Evaluation Metrics by Modality

FID / BLEU / WER / F1

FID = image generation quality (lower = better). BLEU = text generation n-gram overlap. WER = ASR word error rate (lower = better). F1 = classification with class imbalance. Match metric to task type.

A/B Testing Rule

One Variable. Same Data. Statistical Significance.

Always change one variable at a time. Always evaluate on the same held-out test set. Always check statistical significance before declaring a winner. Never compare models tested on different datasets.

Grad-CAM vs. Attention Maps

Grad-CAM = CNNs. Attention = Transformers.

Grad-CAM uses gradients in the last convolutional layer to produce a class-specific heatmap — for CNNs. Attention maps are built into transformers — show which tokens/patches the model attends to. Both help explain model decisions.

NVIDIA Riva Pipeline

Speech → ASR → NLP → TTS → Speech

Riva handles the audio ends. ASR (speech→text) and TTS (text→speech). NeMo handles the NLP/LLM middle. The full conversational AI pipeline: hear → understand → respond → speak. Deployed on Kubernetes with Helm charts.

Diffusion Model Flow

Add Noise → Learn to Denoise → Generate

Forward: add Gaussian noise over T steps. Training: U-Net learns to predict/remove that noise. Inference: start from random noise, run T denoising steps. Text conditioning via CLIP embeddings guides the output image.

Data Quality Checklist

Align · Complete · Consistent · Balanced

Aligned = paired modalities match (correct audio+video). Complete = no missing modalities. Consistent = labels agree across annotators. Balanced = class distribution is acceptable. Run all four checks before training.

Flashcards

Click any card to flip.

What metric measures the quality of generated images by comparing statistical distributions?

Click to flip

FID — Fréchet Inception Distance. Measures the distance between the distribution of real images and generated images using CNN features. Lower FID = higher quality generation. Standard for evaluating GANs and diffusion models.

What does NVIDIA Riva provide, and what are its two core components?

Click to flip

Riva is NVIDIA's conversational AI SDK. Two core components: ASR (Automatic Speech Recognition — speech to text) and TTS (Text-to-Speech — text to audio). Deployed on Kubernetes with Helm charts. Integrates with NeMo for the NLP middle layer.

What is the key requirement for a valid A/B test between two AI model variants?

Click to flip

Change ONE variable at a time, evaluate BOTH variants on the SAME held-out test set, and verify STATISTICAL SIGNIFICANCE of any performance difference. Without these controls, you cannot attribute the difference to the change you made.

How does Grad-CAM explain CNN predictions?

Click to flip

Grad-CAM uses gradients flowing into the final convolutional layer to produce a class-specific heatmap over the input image. Brighter regions = more influential for the predicted class. Helps identify if the model is attending to the right regions or spurious features.

What is Word Error Rate (WER) and what does a lower value mean?

Click to flip

WER = (Substitutions + Deletions + Insertions) ÷ Total Reference Words. The primary metric for ASR (speech recognition) quality. Lower WER = better transcription accuracy. A WER of 0 = perfect transcription.

In diffusion model inference, what is classifier-free guidance (CFG) and what does a higher scale value do?

Click to flip

CFG combines conditional (with prompt) and unconditional predictions: output = uncond + scale × (cond − uncond). Higher CFG scale = stronger adherence to the text prompt but lower image diversity. Typical values: 7–12 for image generation.

What is data augmentation and why is it important?

Click to flip

Data augmentation artificially expands the training dataset by applying transformations (flipping, cropping, noise, color jitter) that preserve labels. It reduces overfitting and improves model generalization to unseen real-world variations without requiring new labeled data.

What four data quality checks should be run before multimodal model training?

Click to flip

1. Alignment — paired modalities correctly matched. 2. Completeness — no missing modalities. 3. Consistency — annotation agreement across annotators. 4. Balance — acceptable class distribution. Missing any check risks training a biased or broken model.

Exam Advisor

Select a category for targeted guidance.

Are You Ready for Experimentation Domain?

You're ready if you can…Name the right evaluation metric for each modality: FID (images), BLEU (text), WER (speech), F1 (classification). Describe the three requirements for a valid A/B test.

You're ready if you can…Explain Grad-CAM for CNNs and attention maps for transformers. Describe NVIDIA Riva's two core components (ASR and TTS) and the full conversational pipeline.

Study more if you can't…Walk through the 4-step diffusion model inference process. List 4 data augmentation techniques for images. Identify the 4 data quality checks for multimodal datasets.

Common Experimentation Exam Traps

FID vs. BLEU ConfusionFID = images (lower is better). BLEU = text (measures n-gram overlap). WER = speech (lower is better). The exam may present a scenario and ask which metric to use — match metric to modality and task.

A/B Test ValidityChanging more than one variable at a time invalidates an A/B test. Evaluating on different datasets invalidates comparison. Statistical significance is required before declaring results conclusive.

Augmentation TimingAugmentation is applied during training only — NOT at inference. Applying augmentation to the test set would give artificially varied test conditions. This is a common conceptual error.

Grad-CAM ≠ Attention MapsGrad-CAM is for CNNs (uses convolutional layer gradients). Attention maps are native to transformers (from attention weight matrices). They're different techniques for different architectures.

Highest-Priority Topics for This Domain

Evaluation MetricsFID / BLEU / WER / F1 / IoU / ROUGE — know which metric applies to which task and modality. This is directly testable and appears in multiple question formats.

NVIDIA RivaASR (speech→text, WER metric) and TTS (text→speech, FastPitch+HiFi-GAN). Full pipeline: ASR → NLP → TTS. Deployed on Kubernetes. Explicitly listed in Experimentation recommended training.

Diffusion Model ConceptsForward process (add noise) → train U-Net to denoise → inference (start from noise, denoise T steps, guided by CLIP embeddings). Classifier-free guidance controls prompt adherence.

Data AugmentationKnow image techniques (flip, crop, jitter, noise) and audio techniques (SpecAugment, time stretch, pitch shift). Primary purpose: reduce overfitting and improve generalization.

NVIDIA Riva — What to Know for the Exam

ASR ComponentsAcoustic model (audio features → phonemes) + Language model (phonemes → words/sentences). Customizable with domain vocabulary. Supports streaming (real-time) and batch modes.

TTS ComponentsFastPitch generates mel spectrograms from text. HiFi-GAN (vocoder) converts spectrograms to audio waveforms. Result: natural-sounding speech from any text input.

DeploymentRiva runs on NVIDIA GPUs. Production deployment uses Kubernetes + Helm charts for scaling. Triton Inference Server handles the model serving backend.

NeMo vs. RivaNeMo = framework for training/fine-tuning ASR, TTS, and NLP models. Riva = optimized runtime for deploying those models in production. NeMo trains it; Riva serves it.

Last-Minute Review — Experimentation

Metrics ShortlistFID (image gen), BLEU (text), WER (speech/ASR), F1 (classification imbalanced), IoU (segmentation), ROUGE (summarization).

A/B Test = 3 RulesOne variable. Same data. Statistical significance.

Riva = ASR + TTSSpeech→text (ASR, WER metric) + Text→speech (TTS, FastPitch+HiFi-GAN). NeMo trains models, Riva deploys them on Kubernetes.

Diffusion = Noise→Denoise→GenerateU-Net denoiser, T timesteps, CLIP text conditioning, CFG scale controls prompt strength.

Experimentation & Model Testing

Domain: Experimentation

Official Subtopics

Why This Domain Is 25%

Practical Developer Focus

Cross-Domain Knowledge

Exam Focus Areas

Model Development & Testing

Model Evaluation Metrics by Modality

Image Generation — FID

Text Generation — BLEU

Speech Recognition — WER

Classification — Accuracy / F1

Image Segmentation — IoU

Text Summarization — ROUGE

A/B Testing in AI Experimentation (3.1, 3.5)

How A/B Testing Works in ML

Explainability in Multimodal Models (3.3)

Grad-CAM

Attention Maps

LIME / SHAP

Multimodal Explainability

Model Evaluation Workflow (3.5)

Data Preprocessing & Quality

Data Augmentation Techniques (3.2)

Image Flipping

Random Crop

Color Jitter

Gaussian Noise

Rotation / Affine

Mixup / CutMix

Text Augmentation

Audio Augmentation

Data Preprocessing Pipeline (3.2)

Data Quality Testing in Multimodal Settings (3.4)

Missing Modality Detection

Modality Alignment Verification

Label Consistency

Class Balance Analysis

NVIDIA Riva & Diffusion Models

NVIDIA Riva — Conversational AI SDK

🎤 ASR — Automatic Speech Recognition

🔊 TTS — Text-to-Speech

End-to-End Conversational AI Pipeline on Riva

Generative AI with Diffusion Models (3.1, 3.5)

DDPM — Denoising Diffusion Probabilistic Models

Classifier-Free Guidance (CFG)

Latent Diffusion Models (Stable Diffusion)

Evaluating Diffusion Model Quality

Experimentation with Generative Models (3.1)

Practice Quiz — Experimentation

Memory Hooks

Flashcards

Exam Advisor

Are You Ready for Experimentation Domain?

Common Experimentation Exam Traps

Highest-Priority Topics for This Domain

NVIDIA Riva — What to Know for the Exam

Last-Minute Review — Experimentation

Ready to Practice More NCA-GENM Questions?