Domains 5 & 7: Performance Optimization & Trustworthy AI
This final page combines two domains: Domain 5 (Performance Optimization, 10%) and Domain 7 (Trustworthy AI, 5%). Together they make up 15% of the exam — covering how to make models faster, cheaper, and ethically responsible.
Domain 5: Performance Optimization (10%)
- Quantization: reduce numerical precision (FP32→FP16→INT8) for smaller, faster models
- Pruning: remove redundant weights to reduce model size and FLOPS
- Mixed precision training: FP16 for speed, FP32 for stability
- Hyperparameter tuning: learning rate, batch size, optimizer, schedule
- Inference optimization: TensorRT, batching, model distillation
Domain 7: Trustworthy AI (5%)
- AI ethics principles: fairness, accountability, transparency, safety
- Data privacy: informed consent, differential privacy, federated learning
- Bias types and mitigation strategies
- NVIDIA tools: NeMo Guardrails for LLM safety
- Model cards and responsible AI documentation
Combined Domain Subtopics
| Domain | Subtopic | Key Concepts | Weight |
|---|---|---|---|
| 5 | 5.1 — Quantization | FP32→FP16→INT8, PTQ vs QAT, TensorRT | ⭐⭐⭐ |
| 5 | 5.2 — Pruning | Structured vs unstructured, magnitude pruning | ⭐⭐⭐ |
| 5 | 5.3 — Mixed Precision | FP16 forward/backward, FP32 updates, loss scaling | ⭐⭐⭐ |
| 5 | 5.4 — Hyperparameter Tuning | Learning rate schedule, grid/random/Bayesian search | ⭐⭐ |
| 7 | 7.1 — Ethical Principles | Fairness, accountability, transparency, safety | ⭐⭐⭐ |
| 7 | 7.2 — Data Privacy | Differential privacy, federated learning, consent | ⭐⭐⭐ |
| 7 | 7.3 — NVIDIA Tools | NeMo Guardrails, model cards | ⭐⭐ |
| 7 | 7.4 — Bias Mitigation | Bias types, resampling, fairness constraints | ⭐⭐ |
Model Optimization Techniques
Performance optimization reduces model size, memory, and latency without unacceptably degrading accuracy. These techniques are essential for deploying AI at scale on NVIDIA hardware.
Quantization: Reducing Numerical Precision
- FP32 (32-bit float): standard training precision; 4 bytes per value; high accuracy baseline
- FP16 (16-bit float): 2 bytes per value; 2× memory savings; slight accuracy trade-off; Tensor Core accelerated
- INT8 (8-bit integer): 1 byte per value; 4× memory savings vs FP32; fastest inference; requires calibration to minimize accuracy loss
- BF16 (Brain Float 16): same exponent range as FP32 but fewer mantissa bits; better training stability than FP16; preferred for LLMs
PTQ vs QAT
- Post-Training Quantization (PTQ): quantize a fully trained FP32 model; fast, no retraining; uses calibration dataset to determine optimal quantization ranges; slight accuracy drop
- Quantization-Aware Training (QAT): simulate quantization noise during training itself; model learns to be robust to precision reduction; better accuracy but requires retraining
- TensorRT: automates INT8/FP16 quantization and calibration for NVIDIA GPUs; produces optimized engine file
- Key benefit: INT8 inference can be 4× faster than FP32 with minimal accuracy loss after calibration
Unstructured Pruning
- Remove individual weight values (set to zero) anywhere in the network
- Creates a sparse weight matrix — most values are zero
- High compression ratios possible (80–90% sparsity)
- Challenge: sparse matrix operations require special hardware support to actually speed up inference
- NVIDIA A100/H100 Tensor Cores support 2:4 structured sparsity (2 non-zeros per 4-weight block)
Structured Pruning
- Remove entire filters, channels, attention heads, or layers
- Results in a smaller, denser model — no special sparse hardware required
- Directly reduces model FLOPS and parameter count
- L1 magnitude pruning: rank channels by sum of absolute weight values; remove smallest
- After pruning: fine-tune on original task to recover lost accuracy
- Hardware-friendly — compatible with all standard inference frameworks
How Mixed Precision Works
- FP16 for forward and backward pass: leverages Tensor Core acceleration for matrix operations; 2× less memory; faster computation
- FP32 master weights: weight updates are accumulated in FP32 to prevent precision loss; ensures stable convergence
- Loss scaling: multiply loss by a large scale factor (e.g. 128) before backward pass; prevents FP16 underflow of small gradients; scale back after gradient computation
- PyTorch implementation:
torch.cuda.amp.autocast()context manager +GradScaler
Mixed Precision Benefits & Trade-offs
- Memory reduction: activations stored in FP16 — up to 2× less GPU memory → larger batch sizes
- Speed: NVIDIA Tensor Cores multiply FP16 matrices and accumulate in FP32 — major throughput gains
- Accuracy: near-identical to FP32 training when loss scaling is applied correctly
- BF16 alternative: no loss scaling needed; better numerical stability; supported on A100+
- When to avoid: some models (e.g. with very small gradient magnitudes) may still diverge in FP16
Distillation Overview
- Train a small student model to mimic a large teacher model's output distributions (soft labels)
- Student learns from teacher's softmax probabilities (temperature-scaled) — richer signal than hard labels
- Result: compact model that approximates teacher performance
- Temperature T > 1 softens probability distribution, revealing class relationships the student can learn
- Used in DistilBERT (66% smaller, 97% BERT performance), MobileNet family
Optimization Technique Comparison
- Quantization: reduce numerical precision; no structural change; 2–4× speed/memory gain
- Pruning: remove weights/channels; structural or sparse; reduces parameter count
- Mixed precision: FP16 computation + FP32 updates; 2× memory; keeps accuracy
- Distillation: train smaller model using teacher; requires training; most flexible compression
- Batching (Triton): combine requests; maximizes GPU utilization without model change
Hyperparameter Tuning
Hyperparameters control the training process and model architecture — they are set before training begins and are not learned from data. Efficient tuning is essential for achieving high model performance.
Learning Rate & Schedules
- Learning rate (LR): most important hyperparameter; controls step size for gradient descent
- Too high: oscillates or diverges (loss spikes); too low: extremely slow convergence
- Warmup + cosine decay: start with very low LR, linearly increase to peak LR, then cosine-decay to near zero; standard for LLMs and transformers
- Step decay: reduce LR by factor (e.g. 0.1) at fixed epoch milestones; used in CNNs
- Reduce on plateau: reduce LR when validation metric stops improving
- Cyclical LR: oscillate between min and max LR; can escape local minima
Batch Size & Optimizers
- Batch size: larger = more stable gradient estimates but more memory; smaller = noisier gradients (regularization effect)
- Gradient accumulation: simulate large batch by accumulating gradients over multiple small batches before updating
- Adam: adaptive per-parameter LR; most common; fast convergence; may generalize worse than SGD
- AdamW: Adam + decoupled weight decay; standard for LLMs and transformers
- SGD with momentum: simple; often better generalization on vision tasks; slower to tune
- Weight decay (L2): regularization; penalizes large weights; reduces overfitting
Grid Search
- Exhaustively try all combinations of hyperparameter values
- Guaranteed to find optimum within specified grid
- Computation grows exponentially with number of hyperparameters
- Only practical for ≤2–3 hyperparameters with small grids
Random Search
- Sample random combinations from hyperparameter ranges
- Better coverage than grid search for same compute budget
- Effective when only a few hyperparameters actually matter
- Embarrassingly parallel — easy to scale
Bayesian Optimization
- Use results from previous trials to build a probabilistic surrogate model (Gaussian Process)
- Acquisition function selects next trial that balances exploration vs exploitation
- Most sample-efficient — finds good configurations with fewest trials
- Tools: Optuna, Weights & Biases Sweeps, Ray Tune, NeMo hyperparameter search
Hyperparameter Search Comparison
| Strategy | Efficiency | Best For | Parallelizable |
|---|---|---|---|
| Grid Search | Low (exhaustive) | ≤2–3 hyperparams, small ranges | Yes |
| Random Search | Medium | High-dim spaces, quick baseline | Yes (very easy) |
| Bayesian Optimization | High (sample-efficient) | Expensive evaluations, limited budget | Partial (sequential by design) |
| Population-Based (PBT) | High | Adaptive schedule tuning during training | Yes (parallel population) |
Trustworthy AI
Domain 7 covers the ethical, privacy, and safety principles that govern responsible AI development. These concepts are increasingly central to enterprise AI deployment and regulatory compliance.
Core Principles
- Fairness: AI should not discriminate based on protected attributes (race, gender, age, disability); equal performance across demographic groups
- Accountability: humans remain responsible for AI system decisions; audit trails document decision pathways
- Transparency: model decisions should be explainable (XAI); users should understand how decisions are made
- Safety: systems should not cause harm; especially critical in high-stakes domains (medical, legal, autonomous vehicles)
- Reliability: consistent, predictable performance across contexts and user groups
- Privacy: minimize collection and retention of personal data; protect individual information
Fairness Metrics
- Demographic parity: prediction rates should be equal across groups (equal positive prediction rates)
- Equal opportunity: true positive rates (recall) should be equal across groups
- Disparate impact ratio: ratio of positive outcome rates between groups; values <0.8 often indicate unfair treatment
- Calibration: predicted probabilities should match actual outcomes equally for all groups
- Trade-off: some fairness criteria are mathematically incompatible — must choose based on context
Privacy Principles
- Informed consent: users must understand and agree to how their data is collected and used before data is taken
- Data minimization: collect only what is strictly necessary for the stated purpose
- Purpose limitation: data collected for one purpose should not be used for another
- Right to deletion (GDPR Art. 17): users can request erasure of their personal data
- Anonymization: irreversibly remove identifying information; Pseudonymization: replace identifiers with pseudonyms — can be reversed with the key
Privacy-Preserving ML Techniques
- Differential privacy: add carefully calibrated mathematical noise to training data or model outputs so that individual records cannot be identified; privacy budget (ε) controls the trade-off between privacy and utility
- Federated learning: model trains on data located on user devices; only model gradients (not raw data) are sent to the central server; protects raw data privacy
- Secure aggregation: combine gradients from federated clients without the server seeing individual contributions
- Synthetic data generation: use GANs or diffusion models to create privacy-safe training data
Types of Bias
- Historical bias: training data reflects past societal inequities (e.g. hiring data skewed by historical discrimination)
- Representation bias: training data underrepresents certain groups (e.g. face recognition worse for darker skin tones)
- Measurement bias: different accuracy or error rates across groups due to inconsistent data collection
- Aggregation bias: treating heterogeneous subgroups as one homogeneous population
Bias Mitigation Strategies
- Pre-processing: resampling (oversample minority), reweighting, diverse data collection
- In-processing: fairness constraints during training (adversarial debiasing); regularization terms penalizing disparate impact
- Post-processing: calibrate decision thresholds per group; reject-option classification
- Evaluation: measure performance disaggregated by demographic group; report disparate impact ratios
NVIDIA Trustworthy AI Tools
- NeMo Guardrails: define rules for what an LLM application can/cannot say; blocks off-topic, harmful, or jailbreak responses; programmable YAML/Colang policies
- Model cards: standardized documentation covering model purpose, training data, limitations, evaluation metrics, demographic performance breakdowns
- NVIDIA AI Safety: alignment research, red-teaming, safety testing frameworks
- Bias detection: evaluate model on fairness benchmarks before deployment
Practice Quiz — Domains 5 & 7
10 questions covering performance optimization and trustworthy AI principles. Select the best answer for each question.
Memory Hooks
Lock in the key distinctions with these memorable phrases before exam day.
Flashcards & Advisor
Click a card to reveal the answer
Study Advisor
Model Optimization Techniques
- Quantization: FP32 → FP16 → INT8; each step reduces bits and memory; INT8 = 4× smaller than FP32
- PTQ: quantize after training (fast, uses calibration dataset); QAT: simulate during training (better accuracy)
- TensorRT: automates INT8/FP16 quantization, layer fusion, kernel autotuning for NVIDIA GPUs
- Structured pruning: remove whole filters/channels; hardware-friendly; directly reduces FLOPS
- Unstructured pruning: remove individual weights; creates sparse matrix; needs sparse hardware to accelerate
- Mixed precision: FP16 for forward/backward (fast); FP32 for weight updates (stable); loss scaling for FP16 underflow
- Knowledge distillation: train small student model on teacher's soft labels; compact model approximates teacher quality