NCA-GENM: Performance & Trustworthy AI

Domains 5 & 7: Performance Optimization & Trustworthy AI

This final page combines two domains: Domain 5 (Performance Optimization, 10%) and Domain 7 (Trustworthy AI, 5%). Together they make up 15% of the exam — covering how to make models faster, cheaper, and ethically responsible.

Domain 5: Performance Optimization (10%)

Quantization: reduce numerical precision (FP32→FP16→INT8) for smaller, faster models
Pruning: remove redundant weights to reduce model size and FLOPS
Mixed precision training: FP16 for speed, FP32 for stability
Hyperparameter tuning: learning rate, batch size, optimizer, schedule
Inference optimization: TensorRT, batching, model distillation

Domain 7: Trustworthy AI (5%)

AI ethics principles: fairness, accountability, transparency, safety
Data privacy: informed consent, differential privacy, federated learning
Bias types and mitigation strategies
NVIDIA tools: NeMo Guardrails for LLM safety
Model cards and responsible AI documentation

Combined Domain Subtopics

Domain	Subtopic	Key Concepts	Weight
5	5.1 — Quantization	FP32→FP16→INT8, PTQ vs QAT, TensorRT	⭐⭐⭐
5	5.2 — Pruning	Structured vs unstructured, magnitude pruning	⭐⭐⭐
5	5.3 — Mixed Precision	FP16 forward/backward, FP32 updates, loss scaling	⭐⭐⭐
5	5.4 — Hyperparameter Tuning	Learning rate schedule, grid/random/Bayesian search	⭐⭐
7	7.1 — Ethical Principles	Fairness, accountability, transparency, safety	⭐⭐⭐
7	7.2 — Data Privacy	Differential privacy, federated learning, consent	⭐⭐⭐
7	7.3 — NVIDIA Tools	NeMo Guardrails, model cards	⭐⭐
7	7.4 — Bias Mitigation	Bias types, resampling, fairness constraints	⭐⭐

Model Optimization Techniques

Performance optimization reduces model size, memory, and latency without unacceptably degrading accuracy. These techniques are essential for deploying AI at scale on NVIDIA hardware.

Quantization

Quantization: Reducing Numerical Precision

FP32 (32-bit float): standard training precision; 4 bytes per value; high accuracy baseline
FP16 (16-bit float): 2 bytes per value; 2× memory savings; slight accuracy trade-off; Tensor Core accelerated
INT8 (8-bit integer): 1 byte per value; 4× memory savings vs FP32; fastest inference; requires calibration to minimize accuracy loss
BF16 (Brain Float 16): same exponent range as FP32 but fewer mantissa bits; better training stability than FP16; preferred for LLMs

PTQ vs QAT

Post-Training Quantization (PTQ): quantize a fully trained FP32 model; fast, no retraining; uses calibration dataset to determine optimal quantization ranges; slight accuracy drop
Quantization-Aware Training (QAT): simulate quantization noise during training itself; model learns to be robust to precision reduction; better accuracy but requires retraining
TensorRT: automates INT8/FP16 quantization and calibration for NVIDIA GPUs; produces optimized engine file
Key benefit: INT8 inference can be 4× faster than FP32 with minimal accuracy loss after calibration

Pruning

Unstructured Pruning

Remove individual weight values (set to zero) anywhere in the network
Creates a sparse weight matrix — most values are zero
High compression ratios possible (80–90% sparsity)
Challenge: sparse matrix operations require special hardware support to actually speed up inference
NVIDIA A100/H100 Tensor Cores support 2:4 structured sparsity (2 non-zeros per 4-weight block)

Structured Pruning

Remove entire filters, channels, attention heads, or layers
Results in a smaller, denser model — no special sparse hardware required
Directly reduces model FLOPS and parameter count
L1 magnitude pruning: rank channels by sum of absolute weight values; remove smallest
After pruning: fine-tune on original task to recover lost accuracy
Hardware-friendly — compatible with all standard inference frameworks

Mixed Precision Training

How Mixed Precision Works

FP16 for forward and backward pass: leverages Tensor Core acceleration for matrix operations; 2× less memory; faster computation
FP32 master weights: weight updates are accumulated in FP32 to prevent precision loss; ensures stable convergence
Loss scaling: multiply loss by a large scale factor (e.g. 128) before backward pass; prevents FP16 underflow of small gradients; scale back after gradient computation
PyTorch implementation: torch.cuda.amp.autocast() context manager + GradScaler

Mixed Precision Benefits & Trade-offs

Memory reduction: activations stored in FP16 — up to 2× less GPU memory → larger batch sizes
Speed: NVIDIA Tensor Cores multiply FP16 matrices and accumulate in FP32 — major throughput gains
Accuracy: near-identical to FP32 training when loss scaling is applied correctly
BF16 alternative: no loss scaling needed; better numerical stability; supported on A100+
When to avoid: some models (e.g. with very small gradient magnitudes) may still diverge in FP16

Knowledge Distillation

Distillation Overview

Train a small student model to mimic a large teacher model's output distributions (soft labels)
Student learns from teacher's softmax probabilities (temperature-scaled) — richer signal than hard labels
Result: compact model that approximates teacher performance
Temperature T > 1 softens probability distribution, revealing class relationships the student can learn
Used in DistilBERT (66% smaller, 97% BERT performance), MobileNet family

Optimization Technique Comparison

Quantization: reduce numerical precision; no structural change; 2–4× speed/memory gain
Pruning: remove weights/channels; structural or sparse; reduces parameter count
Mixed precision: FP16 computation + FP32 updates; 2× memory; keeps accuracy
Distillation: train smaller model using teacher; requires training; most flexible compression
Batching (Triton): combine requests; maximizes GPU utilization without model change

Hyperparameter Tuning

Hyperparameters control the training process and model architecture — they are set before training begins and are not learned from data. Efficient tuning is essential for achieving high model performance.

Key Hyperparameters

Learning Rate & Schedules

Learning rate (LR): most important hyperparameter; controls step size for gradient descent
Too high: oscillates or diverges (loss spikes); too low: extremely slow convergence
Warmup + cosine decay: start with very low LR, linearly increase to peak LR, then cosine-decay to near zero; standard for LLMs and transformers
Step decay: reduce LR by factor (e.g. 0.1) at fixed epoch milestones; used in CNNs
Reduce on plateau: reduce LR when validation metric stops improving
Cyclical LR: oscillate between min and max LR; can escape local minima

Batch Size & Optimizers

Batch size: larger = more stable gradient estimates but more memory; smaller = noisier gradients (regularization effect)
Gradient accumulation: simulate large batch by accumulating gradients over multiple small batches before updating
Adam: adaptive per-parameter LR; most common; fast convergence; may generalize worse than SGD
AdamW: Adam + decoupled weight decay; standard for LLMs and transformers
SGD with momentum: simple; often better generalization on vision tasks; slower to tune
Weight decay (L2): regularization; penalizes large weights; reduces overfitting

Search Strategies

Grid Search

Exhaustively try all combinations of hyperparameter values
Guaranteed to find optimum within specified grid
Computation grows exponentially with number of hyperparameters
Only practical for ≤2–3 hyperparameters with small grids

Random Search

Sample random combinations from hyperparameter ranges
Better coverage than grid search for same compute budget
Effective when only a few hyperparameters actually matter
Embarrassingly parallel — easy to scale

Bayesian Optimization

Use results from previous trials to build a probabilistic surrogate model (Gaussian Process)
Acquisition function selects next trial that balances exploration vs exploitation
Most sample-efficient — finds good configurations with fewest trials
Tools: Optuna, Weights & Biases Sweeps, Ray Tune, NeMo hyperparameter search

Hyperparameter Search Comparison

Strategy	Efficiency	Best For	Parallelizable
Grid Search	Low (exhaustive)	≤2–3 hyperparams, small ranges	Yes
Random Search	Medium	High-dim spaces, quick baseline	Yes (very easy)
Bayesian Optimization	High (sample-efficient)	Expensive evaluations, limited budget	Partial (sequential by design)
Population-Based (PBT)	High	Adaptive schedule tuning during training	Yes (parallel population)

Trustworthy AI

Domain 7 covers the ethical, privacy, and safety principles that govern responsible AI development. These concepts are increasingly central to enterprise AI deployment and regulatory compliance.

Ethical AI Principles

Core Principles

Fairness: AI should not discriminate based on protected attributes (race, gender, age, disability); equal performance across demographic groups
Accountability: humans remain responsible for AI system decisions; audit trails document decision pathways
Transparency: model decisions should be explainable (XAI); users should understand how decisions are made
Safety: systems should not cause harm; especially critical in high-stakes domains (medical, legal, autonomous vehicles)
Reliability: consistent, predictable performance across contexts and user groups
Privacy: minimize collection and retention of personal data; protect individual information

Fairness Metrics

Demographic parity: prediction rates should be equal across groups (equal positive prediction rates)
Equal opportunity: true positive rates (recall) should be equal across groups
Disparate impact ratio: ratio of positive outcome rates between groups; values <0.8 often indicate unfair treatment
Calibration: predicted probabilities should match actual outcomes equally for all groups
Trade-off: some fairness criteria are mathematically incompatible — must choose based on context

Data Privacy

Privacy Principles

Informed consent: users must understand and agree to how their data is collected and used before data is taken
Data minimization: collect only what is strictly necessary for the stated purpose
Purpose limitation: data collected for one purpose should not be used for another
Right to deletion (GDPR Art. 17): users can request erasure of their personal data
Anonymization: irreversibly remove identifying information; Pseudonymization: replace identifiers with pseudonyms — can be reversed with the key

Privacy-Preserving ML Techniques

Differential privacy: add carefully calibrated mathematical noise to training data or model outputs so that individual records cannot be identified; privacy budget (ε) controls the trade-off between privacy and utility
Federated learning: model trains on data located on user devices; only model gradients (not raw data) are sent to the central server; protects raw data privacy
Secure aggregation: combine gradients from federated clients without the server seeing individual contributions
Synthetic data generation: use GANs or diffusion models to create privacy-safe training data

Bias in AI

Types of Bias

Historical bias: training data reflects past societal inequities (e.g. hiring data skewed by historical discrimination)
Representation bias: training data underrepresents certain groups (e.g. face recognition worse for darker skin tones)
Measurement bias: different accuracy or error rates across groups due to inconsistent data collection
Aggregation bias: treating heterogeneous subgroups as one homogeneous population

Bias Mitigation Strategies

Pre-processing: resampling (oversample minority), reweighting, diverse data collection
In-processing: fairness constraints during training (adversarial debiasing); regularization terms penalizing disparate impact
Post-processing: calibrate decision thresholds per group; reject-option classification
Evaluation: measure performance disaggregated by demographic group; report disparate impact ratios

NVIDIA Trustworthy AI Tools

NeMo Guardrails: define rules for what an LLM application can/cannot say; blocks off-topic, harmful, or jailbreak responses; programmable YAML/Colang policies
Model cards: standardized documentation covering model purpose, training data, limitations, evaluation metrics, demographic performance breakdowns
NVIDIA AI Safety: alignment research, red-teaming, safety testing frameworks
Bias detection: evaluate model on fairness benchmarks before deployment

Practice Quiz — Domains 5 & 7

10 questions covering performance optimization and trustworthy AI principles. Select the best answer for each question.

Memory Hooks

Lock in the key distinctions with these memorable phrases before exam day.

🗜️

Quantization Levels

"FP32 → FP16 → INT8: Shrink Precision, Shrink Size"

Each step halves the bits: 32→16→8. INT8 = 4× smaller than FP32. The trade-off: less precision, more speed. TensorRT automates this calibration for NVIDIA GPUs.

✂️

Structured vs Unstructured Pruning

"Structured = Hardware Friendly; Unstructured = Just Sparse"

Structured pruning removes whole channels/filters → smaller dense model → works on any hardware. Unstructured pruning removes individual weights → sparse matrix → needs special hardware to actually speed up.

⚖️

Mixed Precision

"FP16 to Fly, FP32 to Stay Stable"

FP16 for the fast computation (forward + backward passes); FP32 master weights to keep updates stable. Loss scaling prevents FP16 underflow. PyTorch: autocast() + GradScaler.

🔍

Bayesian Optimization

"Bayesian: Learn from Each Trial to Pick the Next"

Unlike grid/random which ignore past results, Bayesian optimization builds a model of the hyperparameter space from previous runs, then picks the most promising next trial. Most sample-efficient = fewest expensive evaluations needed.

🛡️

Differential Privacy

"Add Noise to Protect the Individual"

Differential privacy adds calibrated mathematical noise so that any single person's data can't be identified in the model's outputs or training. Privacy budget ε: smaller = more private, less utility.

🚦

NeMo Guardrails

"NeMo Guardrails = Traffic Lights for LLMs"

Just like traffic lights control which directions are allowed, NeMo Guardrails define what an LLM can and cannot say. It enforces topic restrictions, blocks jailbreaks, and adds behavioral safety policies programmatically.

Flashcards & Advisor

Click a card to reveal the answer

Quantization

What are the levels and what does each achieve?

FP32→FP16→INT8. INT8 = 4× smaller than FP32, fastest inference. PTQ: quantize after training (fast). QAT: simulate during training (more accurate). TensorRT automates calibration for NVIDIA GPUs.

Structured vs Unstructured Pruning

Key difference and hardware implications?

Structured: remove whole filters/channels → smaller dense model → hardware-friendly, works everywhere. Unstructured: remove individual weights → sparse matrix → needs sparse hardware support to accelerate (e.g. NVIDIA 2:4 sparsity).

Mixed Precision Training

Which precision for which operation, and what is loss scaling?

FP16: forward and backward passes (fast, Tensor Core accelerated). FP32: master weights and weight updates (stable). Loss scaling: multiply loss before backward pass to prevent FP16 gradient underflow; divide after.

Bayesian Optimization

How does it differ from grid and random search?

Uses results of previous trials to build a probabilistic model of the objective function, then picks the most promising next trial via an acquisition function. Most sample-efficient — ideal when each training run is expensive.

Differential Privacy

What does it do and what is the privacy budget?

Adds calibrated mathematical noise to training data or model outputs so individual records cannot be identified. Privacy budget ε: smaller ε = stronger privacy, larger utility loss. Enables ML on sensitive data while protecting individuals.

Federated Learning

How does it protect privacy?

Trains on data residing on user devices. Only model gradients (not raw data) are sent to the central server. Raw user data never leaves the device. Used in mobile keyboard prediction, healthcare applications.

NeMo Guardrails

What does it do and what problem does it solve?

NVIDIA SDK for adding programmable safety guardrails to LLM applications. Defines allowed/blocked topics and behaviors using Colang policies. Prevents jailbreaks, off-topic responses, and harmful outputs in production LLM deployments.

Warmup + Cosine Decay

Why is this schedule used for LLMs?

Start with very low LR, linearly increase to peak LR (warmup phase) to prevent early instability, then cosine-decay to near zero for smooth convergence. Standard for transformer and LLM training. Avoids early divergence and allows gradual fine-grained optimization.

Study Advisor

Model Optimization Techniques

Quantization: FP32 → FP16 → INT8; each step reduces bits and memory; INT8 = 4× smaller than FP32
PTQ: quantize after training (fast, uses calibration dataset); QAT: simulate during training (better accuracy)
TensorRT: automates INT8/FP16 quantization, layer fusion, kernel autotuning for NVIDIA GPUs
Structured pruning: remove whole filters/channels; hardware-friendly; directly reduces FLOPS
Unstructured pruning: remove individual weights; creates sparse matrix; needs sparse hardware to accelerate
Mixed precision: FP16 for forward/backward (fast); FP32 for weight updates (stable); loss scaling for FP16 underflow
Knowledge distillation: train small student model on teacher's soft labels; compact model approximates teacher quality

Performance Optimization & Trustworthy AI

Domains 5 & 7: Performance Optimization & Trustworthy AI

Domain 5: Performance Optimization (10%)

Domain 7: Trustworthy AI (5%)

Combined Domain Subtopics

Model Optimization Techniques

Quantization: Reducing Numerical Precision

PTQ vs QAT

Unstructured Pruning

Structured Pruning

How Mixed Precision Works

Mixed Precision Benefits & Trade-offs

Distillation Overview

Optimization Technique Comparison

Hyperparameter Tuning

Learning Rate & Schedules

Batch Size & Optimizers

Grid Search

Random Search

Bayesian Optimization

Hyperparameter Search Comparison

Trustworthy AI

Core Principles

Fairness Metrics

Privacy Principles

Privacy-Preserving ML Techniques

Types of Bias

Bias Mitigation Strategies

NVIDIA Trustworthy AI Tools

Practice Quiz — Domains 5 & 7

Memory Hooks

Flashcards & Advisor

Study Advisor

Model Optimization Techniques

You've Covered All 5 NCA-GENM Domains!