FlashGenius Logo FlashGenius
NCA-GENM Exam Prep · Domains 5 & 7

Performance Optimization & Trustworthy AI

Quantization · Pruning · Mixed Precision · Fairness · Privacy · NeMo Guardrails

15% of the NCA-GENM Exam — Domains 5 (10%) & 7 (5%)

Study with Practice Tests →

Domains 5 & 7: Performance Optimization & Trustworthy AI

This final page combines two domains: Domain 5 (Performance Optimization, 10%) and Domain 7 (Trustworthy AI, 5%). Together they make up 15% of the exam — covering how to make models faster, cheaper, and ethically responsible.

Domain 5: Performance Optimization (10%)

  • Quantization: reduce numerical precision (FP32→FP16→INT8) for smaller, faster models
  • Pruning: remove redundant weights to reduce model size and FLOPS
  • Mixed precision training: FP16 for speed, FP32 for stability
  • Hyperparameter tuning: learning rate, batch size, optimizer, schedule
  • Inference optimization: TensorRT, batching, model distillation

Domain 7: Trustworthy AI (5%)

  • AI ethics principles: fairness, accountability, transparency, safety
  • Data privacy: informed consent, differential privacy, federated learning
  • Bias types and mitigation strategies
  • NVIDIA tools: NeMo Guardrails for LLM safety
  • Model cards and responsible AI documentation

Combined Domain Subtopics

DomainSubtopicKey ConceptsWeight
55.1 — QuantizationFP32→FP16→INT8, PTQ vs QAT, TensorRT⭐⭐⭐
55.2 — PruningStructured vs unstructured, magnitude pruning⭐⭐⭐
55.3 — Mixed PrecisionFP16 forward/backward, FP32 updates, loss scaling⭐⭐⭐
55.4 — Hyperparameter TuningLearning rate schedule, grid/random/Bayesian search⭐⭐
77.1 — Ethical PrinciplesFairness, accountability, transparency, safety⭐⭐⭐
77.2 — Data PrivacyDifferential privacy, federated learning, consent⭐⭐⭐
77.3 — NVIDIA ToolsNeMo Guardrails, model cards⭐⭐
77.4 — Bias MitigationBias types, resampling, fairness constraints⭐⭐

Model Optimization Techniques

Performance optimization reduces model size, memory, and latency without unacceptably degrading accuracy. These techniques are essential for deploying AI at scale on NVIDIA hardware.

Quantization

Quantization: Reducing Numerical Precision

  • FP32 (32-bit float): standard training precision; 4 bytes per value; high accuracy baseline
  • FP16 (16-bit float): 2 bytes per value; 2× memory savings; slight accuracy trade-off; Tensor Core accelerated
  • INT8 (8-bit integer): 1 byte per value; 4× memory savings vs FP32; fastest inference; requires calibration to minimize accuracy loss
  • BF16 (Brain Float 16): same exponent range as FP32 but fewer mantissa bits; better training stability than FP16; preferred for LLMs

PTQ vs QAT

  • Post-Training Quantization (PTQ): quantize a fully trained FP32 model; fast, no retraining; uses calibration dataset to determine optimal quantization ranges; slight accuracy drop
  • Quantization-Aware Training (QAT): simulate quantization noise during training itself; model learns to be robust to precision reduction; better accuracy but requires retraining
  • TensorRT: automates INT8/FP16 quantization and calibration for NVIDIA GPUs; produces optimized engine file
  • Key benefit: INT8 inference can be 4× faster than FP32 with minimal accuracy loss after calibration
Pruning

Unstructured Pruning

  • Remove individual weight values (set to zero) anywhere in the network
  • Creates a sparse weight matrix — most values are zero
  • High compression ratios possible (80–90% sparsity)
  • Challenge: sparse matrix operations require special hardware support to actually speed up inference
  • NVIDIA A100/H100 Tensor Cores support 2:4 structured sparsity (2 non-zeros per 4-weight block)

Structured Pruning

  • Remove entire filters, channels, attention heads, or layers
  • Results in a smaller, denser model — no special sparse hardware required
  • Directly reduces model FLOPS and parameter count
  • L1 magnitude pruning: rank channels by sum of absolute weight values; remove smallest
  • After pruning: fine-tune on original task to recover lost accuracy
  • Hardware-friendly — compatible with all standard inference frameworks
Mixed Precision Training

How Mixed Precision Works

  • FP16 for forward and backward pass: leverages Tensor Core acceleration for matrix operations; 2× less memory; faster computation
  • FP32 master weights: weight updates are accumulated in FP32 to prevent precision loss; ensures stable convergence
  • Loss scaling: multiply loss by a large scale factor (e.g. 128) before backward pass; prevents FP16 underflow of small gradients; scale back after gradient computation
  • PyTorch implementation: torch.cuda.amp.autocast() context manager + GradScaler

Mixed Precision Benefits & Trade-offs

  • Memory reduction: activations stored in FP16 — up to 2× less GPU memory → larger batch sizes
  • Speed: NVIDIA Tensor Cores multiply FP16 matrices and accumulate in FP32 — major throughput gains
  • Accuracy: near-identical to FP32 training when loss scaling is applied correctly
  • BF16 alternative: no loss scaling needed; better numerical stability; supported on A100+
  • When to avoid: some models (e.g. with very small gradient magnitudes) may still diverge in FP16
Knowledge Distillation

Distillation Overview

  • Train a small student model to mimic a large teacher model's output distributions (soft labels)
  • Student learns from teacher's softmax probabilities (temperature-scaled) — richer signal than hard labels
  • Result: compact model that approximates teacher performance
  • Temperature T > 1 softens probability distribution, revealing class relationships the student can learn
  • Used in DistilBERT (66% smaller, 97% BERT performance), MobileNet family

Optimization Technique Comparison

  • Quantization: reduce numerical precision; no structural change; 2–4× speed/memory gain
  • Pruning: remove weights/channels; structural or sparse; reduces parameter count
  • Mixed precision: FP16 computation + FP32 updates; 2× memory; keeps accuracy
  • Distillation: train smaller model using teacher; requires training; most flexible compression
  • Batching (Triton): combine requests; maximizes GPU utilization without model change

Hyperparameter Tuning

Hyperparameters control the training process and model architecture — they are set before training begins and are not learned from data. Efficient tuning is essential for achieving high model performance.

Key Hyperparameters

Learning Rate & Schedules

  • Learning rate (LR): most important hyperparameter; controls step size for gradient descent
  • Too high: oscillates or diverges (loss spikes); too low: extremely slow convergence
  • Warmup + cosine decay: start with very low LR, linearly increase to peak LR, then cosine-decay to near zero; standard for LLMs and transformers
  • Step decay: reduce LR by factor (e.g. 0.1) at fixed epoch milestones; used in CNNs
  • Reduce on plateau: reduce LR when validation metric stops improving
  • Cyclical LR: oscillate between min and max LR; can escape local minima

Batch Size & Optimizers

  • Batch size: larger = more stable gradient estimates but more memory; smaller = noisier gradients (regularization effect)
  • Gradient accumulation: simulate large batch by accumulating gradients over multiple small batches before updating
  • Adam: adaptive per-parameter LR; most common; fast convergence; may generalize worse than SGD
  • AdamW: Adam + decoupled weight decay; standard for LLMs and transformers
  • SGD with momentum: simple; often better generalization on vision tasks; slower to tune
  • Weight decay (L2): regularization; penalizes large weights; reduces overfitting
Search Strategies

Grid Search

  • Exhaustively try all combinations of hyperparameter values
  • Guaranteed to find optimum within specified grid
  • Computation grows exponentially with number of hyperparameters
  • Only practical for ≤2–3 hyperparameters with small grids

Random Search

  • Sample random combinations from hyperparameter ranges
  • Better coverage than grid search for same compute budget
  • Effective when only a few hyperparameters actually matter
  • Embarrassingly parallel — easy to scale

Bayesian Optimization

  • Use results from previous trials to build a probabilistic surrogate model (Gaussian Process)
  • Acquisition function selects next trial that balances exploration vs exploitation
  • Most sample-efficient — finds good configurations with fewest trials
  • Tools: Optuna, Weights & Biases Sweeps, Ray Tune, NeMo hyperparameter search

Hyperparameter Search Comparison

StrategyEfficiencyBest ForParallelizable
Grid SearchLow (exhaustive)≤2–3 hyperparams, small rangesYes
Random SearchMediumHigh-dim spaces, quick baselineYes (very easy)
Bayesian OptimizationHigh (sample-efficient)Expensive evaluations, limited budgetPartial (sequential by design)
Population-Based (PBT)HighAdaptive schedule tuning during trainingYes (parallel population)

Trustworthy AI

Domain 7 covers the ethical, privacy, and safety principles that govern responsible AI development. These concepts are increasingly central to enterprise AI deployment and regulatory compliance.

Ethical AI Principles

Core Principles

  • Fairness: AI should not discriminate based on protected attributes (race, gender, age, disability); equal performance across demographic groups
  • Accountability: humans remain responsible for AI system decisions; audit trails document decision pathways
  • Transparency: model decisions should be explainable (XAI); users should understand how decisions are made
  • Safety: systems should not cause harm; especially critical in high-stakes domains (medical, legal, autonomous vehicles)
  • Reliability: consistent, predictable performance across contexts and user groups
  • Privacy: minimize collection and retention of personal data; protect individual information

Fairness Metrics

  • Demographic parity: prediction rates should be equal across groups (equal positive prediction rates)
  • Equal opportunity: true positive rates (recall) should be equal across groups
  • Disparate impact ratio: ratio of positive outcome rates between groups; values <0.8 often indicate unfair treatment
  • Calibration: predicted probabilities should match actual outcomes equally for all groups
  • Trade-off: some fairness criteria are mathematically incompatible — must choose based on context
Data Privacy

Privacy Principles

  • Informed consent: users must understand and agree to how their data is collected and used before data is taken
  • Data minimization: collect only what is strictly necessary for the stated purpose
  • Purpose limitation: data collected for one purpose should not be used for another
  • Right to deletion (GDPR Art. 17): users can request erasure of their personal data
  • Anonymization: irreversibly remove identifying information; Pseudonymization: replace identifiers with pseudonyms — can be reversed with the key

Privacy-Preserving ML Techniques

  • Differential privacy: add carefully calibrated mathematical noise to training data or model outputs so that individual records cannot be identified; privacy budget (ε) controls the trade-off between privacy and utility
  • Federated learning: model trains on data located on user devices; only model gradients (not raw data) are sent to the central server; protects raw data privacy
  • Secure aggregation: combine gradients from federated clients without the server seeing individual contributions
  • Synthetic data generation: use GANs or diffusion models to create privacy-safe training data
Bias in AI

Types of Bias

  • Historical bias: training data reflects past societal inequities (e.g. hiring data skewed by historical discrimination)
  • Representation bias: training data underrepresents certain groups (e.g. face recognition worse for darker skin tones)
  • Measurement bias: different accuracy or error rates across groups due to inconsistent data collection
  • Aggregation bias: treating heterogeneous subgroups as one homogeneous population

Bias Mitigation Strategies

  • Pre-processing: resampling (oversample minority), reweighting, diverse data collection
  • In-processing: fairness constraints during training (adversarial debiasing); regularization terms penalizing disparate impact
  • Post-processing: calibrate decision thresholds per group; reject-option classification
  • Evaluation: measure performance disaggregated by demographic group; report disparate impact ratios

NVIDIA Trustworthy AI Tools

  • NeMo Guardrails: define rules for what an LLM application can/cannot say; blocks off-topic, harmful, or jailbreak responses; programmable YAML/Colang policies
  • Model cards: standardized documentation covering model purpose, training data, limitations, evaluation metrics, demographic performance breakdowns
  • NVIDIA AI Safety: alignment research, red-teaming, safety testing frameworks
  • Bias detection: evaluate model on fairness benchmarks before deployment

Practice Quiz — Domains 5 & 7

10 questions covering performance optimization and trustworthy AI principles. Select the best answer for each question.

Memory Hooks

Lock in the key distinctions with these memorable phrases before exam day.

🗜️
Quantization Levels
"FP32 → FP16 → INT8: Shrink Precision, Shrink Size"
Each step halves the bits: 32→16→8. INT8 = 4× smaller than FP32. The trade-off: less precision, more speed. TensorRT automates this calibration for NVIDIA GPUs.
✂️
Structured vs Unstructured Pruning
"Structured = Hardware Friendly; Unstructured = Just Sparse"
Structured pruning removes whole channels/filters → smaller dense model → works on any hardware. Unstructured pruning removes individual weights → sparse matrix → needs special hardware to actually speed up.
⚖️
Mixed Precision
"FP16 to Fly, FP32 to Stay Stable"
FP16 for the fast computation (forward + backward passes); FP32 master weights to keep updates stable. Loss scaling prevents FP16 underflow. PyTorch: autocast() + GradScaler.
🔍
Bayesian Optimization
"Bayesian: Learn from Each Trial to Pick the Next"
Unlike grid/random which ignore past results, Bayesian optimization builds a model of the hyperparameter space from previous runs, then picks the most promising next trial. Most sample-efficient = fewest expensive evaluations needed.
🛡️
Differential Privacy
"Add Noise to Protect the Individual"
Differential privacy adds calibrated mathematical noise so that any single person's data can't be identified in the model's outputs or training. Privacy budget ε: smaller = more private, less utility.
🚦
NeMo Guardrails
"NeMo Guardrails = Traffic Lights for LLMs"
Just like traffic lights control which directions are allowed, NeMo Guardrails define what an LLM can and cannot say. It enforces topic restrictions, blocks jailbreaks, and adds behavioral safety policies programmatically.

Flashcards & Advisor

Click a card to reveal the answer

Quantization
What are the levels and what does each achieve?
FP32→FP16→INT8. INT8 = 4× smaller than FP32, fastest inference. PTQ: quantize after training (fast). QAT: simulate during training (more accurate). TensorRT automates calibration for NVIDIA GPUs.
Structured vs Unstructured Pruning
Key difference and hardware implications?
Structured: remove whole filters/channels → smaller dense model → hardware-friendly, works everywhere. Unstructured: remove individual weights → sparse matrix → needs sparse hardware support to accelerate (e.g. NVIDIA 2:4 sparsity).
Mixed Precision Training
Which precision for which operation, and what is loss scaling?
FP16: forward and backward passes (fast, Tensor Core accelerated). FP32: master weights and weight updates (stable). Loss scaling: multiply loss before backward pass to prevent FP16 gradient underflow; divide after.
Bayesian Optimization
How does it differ from grid and random search?
Uses results of previous trials to build a probabilistic model of the objective function, then picks the most promising next trial via an acquisition function. Most sample-efficient — ideal when each training run is expensive.
Differential Privacy
What does it do and what is the privacy budget?
Adds calibrated mathematical noise to training data or model outputs so individual records cannot be identified. Privacy budget ε: smaller ε = stronger privacy, larger utility loss. Enables ML on sensitive data while protecting individuals.
Federated Learning
How does it protect privacy?
Trains on data residing on user devices. Only model gradients (not raw data) are sent to the central server. Raw user data never leaves the device. Used in mobile keyboard prediction, healthcare applications.
NeMo Guardrails
What does it do and what problem does it solve?
NVIDIA SDK for adding programmable safety guardrails to LLM applications. Defines allowed/blocked topics and behaviors using Colang policies. Prevents jailbreaks, off-topic responses, and harmful outputs in production LLM deployments.
Warmup + Cosine Decay
Why is this schedule used for LLMs?
Start with very low LR, linearly increase to peak LR (warmup phase) to prevent early instability, then cosine-decay to near zero for smooth convergence. Standard for transformer and LLM training. Avoids early divergence and allows gradual fine-grained optimization.

Study Advisor

Model Optimization Techniques

  • Quantization: FP32 → FP16 → INT8; each step reduces bits and memory; INT8 = 4× smaller than FP32
  • PTQ: quantize after training (fast, uses calibration dataset); QAT: simulate during training (better accuracy)
  • TensorRT: automates INT8/FP16 quantization, layer fusion, kernel autotuning for NVIDIA GPUs
  • Structured pruning: remove whole filters/channels; hardware-friendly; directly reduces FLOPS
  • Unstructured pruning: remove individual weights; creates sparse matrix; needs sparse hardware to accelerate
  • Mixed precision: FP16 for forward/backward (fast); FP32 for weight updates (stable); loss scaling for FP16 underflow
  • Knowledge distillation: train small student model on teacher's soft labels; compact model approximates teacher quality

You've Covered All 5 NCA-GENM Domains!

Test yourself with full-length practice exams, timed simulations, and detailed explanations

Unlock Full Practice Tests on FlashGenius →