Master LoRA, QLoRA, quantization, KV caching, continuous batching, and the NVIDIA TensorRT-LLM stack — high-weight topics on the NCA-GENL certification.
Practice with Flashcards →The Four Optimization Pillars
Adapting, compressing, accelerating, and deploying LLMs — the full NCA-GENL optimization domain
Fine-tuning adapts a pre-trained foundation model to a specific domain or task. Full fine-tuning updates all parameters. Parameter-efficient methods (PEFT) like LoRA and QLoRA update only a small fraction — enabling customization on a single GPU without sacrificing much performance.
Quantization reduces the numerical precision of model weights from FP32 → FP16 → INT8 → INT4, cutting memory use and increasing throughput. Post-Training Quantization (PTQ) requires no retraining. Quantization-Aware Training (QAT) recovers accuracy after aggressive compression.
Inference optimization reduces latency and increases throughput without changing model accuracy. The KV cache eliminates redundant computation. Continuous batching maximizes GPU utilization. Flash Attention reduces memory bandwidth. Speculative decoding parallelizes token generation.
NVIDIA's tool stack covers the full pipeline: NeMo for training and fine-tuning, TensorRT-LLM for inference optimization, Triton Inference Server for model serving, and NIM for packaged one-click deployment. Together they deliver optimized LLMs from research to production.
How It Works
Deep-dive mechanics for fine-tuning methods, quantization strategies, inference optimizations, and the NVIDIA deployment stack
From Full Fine-tuning to LoRA, QLoRA & DPO
PEFT Spectrum — most to least compute-intensive
Research shows that the weight updates needed to fine-tune a large model for a specific task have a low intrinsic rank — they live in a low-dimensional subspace of the full weight matrix. LoRA exploits this: instead of updating the full d×d weight matrix (where d can be 4096+), it learns a rank-r approximation via two small matrices A and B. At rank r=8, a 4096×4096 weight matrix needs 4096×4096=16.7M updates vs. 2×4096×8=65K LoRA updates — 256× fewer.
QLoRA (Quantized LoRA) combines three innovations: (1) NF4 quantization of the frozen base model — weights stored in 4-bit NormalFloat format, which is information-theoretically optimal for normally distributed (Gaussian) weights. (2) Double quantization — also quantizes the quantization constants themselves, saving ~0.4 bits per parameter. (3) Paged optimizers — uses NVIDIA unified memory to page optimizer states to CPU RAM during gradient checkpoints, preventing OOM errors. Net result: fine-tune a 65B model on a single 48GB A100.
RLHF requires training a separate Reward Model and running a complex PPO reinforcement learning loop — expensive and unstable. DPO simplifies this by reformulating the preference optimization problem as a supervised classification task directly on the policy model. Given a (prompt, chosen response, rejected response) triple, DPO increases the likelihood of the chosen response and decreases the rejected one — no reward model needed. Simpler, more stable, and competitive with RLHF. DPO is now widely used as a practical RLHF alternative.
PTQ, QAT, GPTQ, AWQ & Knowledge Distillation
PTQ quantizes a fully-trained model without any further training passes. The process: determine the range of weight and activation values using a small calibration dataset, then map those values to the lower-precision format (e.g., INT8). Fast and requires no GPU cluster. Works well for INT8 (minimal accuracy loss). For INT4, accuracy can degrade for complex tasks — QAT or advanced algorithms like GPTQ/AWQ are preferred.
QAT simulates quantization noise during a training or fine-tuning pass, allowing the model weights to adjust to compensate for the precision loss. Fake quantization nodes insert and remove quantization rounding during the forward pass, but gradients flow through normally (straight-through estimator). The result is a model whose weights are already "calibrated" for the target precision — significantly better accuracy than PTQ at INT4/INT8 for difficult tasks. More compute-intensive than PTQ.
GPTQ (Generative Pre-trained Transformer Quantization) uses second-order (Hessian) information to find optimal INT4 quantization values that minimize reconstruction error — layer by layer. AWQ (Activation-aware Weight Quantization) protects the 1% of "salient" weights that have the most impact on output quality from heavy quantization, quantizing only the remaining 99% aggressively. Both achieve near-FP16 quality at INT4 weight precision — making 70B models runnable on 2× 24GB consumer GPUs.
Distillation trains a smaller "student" model to mimic the output distribution of a larger "teacher" model. Instead of training on hard labels (0 or 1), the student learns from the teacher's soft probability outputs — which contain richer information about the relationships between classes. Example: distilling a 70B teacher into a 7B student that achieves 80–90% of the teacher's performance at 10% of the inference cost. Used to create models like DistilBERT (40% smaller than BERT, 97% of performance).
KV Cache, Continuous Batching, Flash Attention & Parallelism
During autoregressive generation, each new token requires attention over all previous tokens. Without a cache, generating token t requires recomputing K and V for tokens 1 through t-1 from scratch — O(t²) total compute. The KV cache stores K and V tensors for all past tokens, so generating token t only requires computing K,V for token t itself and retrieving the rest from cache. This reduces generation from O(t²) to O(t) compute — a massive speedup, especially for long outputs.
Traditional KV caches pre-allocate a contiguous block of GPU memory per sequence based on the maximum sequence length — wasting memory for short sequences and causing fragmentation. Paged KV cache (inspired by OS virtual memory paging) allocates KV cache in fixed-size "pages" and manages them dynamically. This allows higher batch sizes (more concurrent users), eliminates memory fragmentation, and enables memory sharing across requests with common prefixes (prefix caching). TensorRT-LLM implements paged KV cache natively.
Static batching processes a fixed group of requests together and waits for all to finish before accepting new ones. If one request completes early, its GPU slot idles. Continuous batching inserts new requests into available slots at each generation iteration — keeping GPU utilization near 100% regardless of variable response lengths. For typical LLM serving workloads with mixed short and long responses, continuous batching improves throughput by 5–10× compared to static batching.
Standard attention computes a full n×n attention score matrix in GPU High Bandwidth Memory (HBM) — O(n²) memory. For long sequences, this becomes the bottleneck. Flash Attention reorders the attention computation into tiles that fit in the GPU's fast SRAM (on-chip cache), dramatically reducing HBM reads and writes. The mathematical result is identical, but it requires far fewer memory operations — 2–4× faster and uses O(n) memory instead of O(n²). Flash Attention 2 and 3 further improve parallelism on modern GPU architectures.
Tensor parallelism splits individual weight matrices across multiple GPUs horizontally — each GPU computes a slice of each layer's output in parallel. Requires all-reduce communication at each layer boundary. Best for reducing latency within a single request. Pipeline parallelism assigns consecutive Transformer layers to different GPUs — GPU 0 runs layers 1–12, GPU 1 runs layers 13–24, etc. More communication-efficient for large models but requires micro-batching to keep all GPUs busy. Most production deployments combine both (3D parallelism: tensor + pipeline + data).
NeMo, TensorRT-LLM, Triton & NIM
NeMo is an end-to-end framework for building and customizing LLMs. Fine-tuning capabilities: full SFT, LoRA/QLoRA (all rank targets configurable), P-Tuning, Adapters, DPO and RLHF pipelines (reward model training + PPO). Distributed training via Megatron-LM — tensor, pipeline, and data parallelism across 100s of GPUs. NeMo exports models in formats ready for TensorRT-LLM optimization. Use NeMo when you need to create a custom domain-specific model or align a foundation model to specific behavior.
TensorRT-LLM compiles and optimizes LLMs for maximum inference performance on NVIDIA GPUs. Optimization techniques applied: (1) Quantization: INT8/INT4/FP8 weight and activation quantization. (2) Kernel fusion: multiple sequential GPU operations merged into single optimized kernels to reduce memory bandwidth overhead. (3) Continuous batching + paged KV cache. (4) Tensor parallelism across multiple GPUs. (5) In-flight request batching. (6) Optimized attention kernels (including Flash Attention variants). Net result: 2–5× lower latency and 5–10× higher throughput vs. PyTorch baseline for the same hardware.
Triton is a production-grade model serving framework that supports multiple backends (TensorRT, TensorRT-LLM, PyTorch, ONNX, TensorFlow) via a single unified API. Features: dynamic batching, model ensembles (chaining multiple models), concurrent model execution, GPU/CPU/custom backend support, Prometheus metrics integration, and health check endpoints. Triton is the serving layer that TensorRT-LLM and NIM are built on top of — it handles request routing, queuing, and scaling.
NIM (NVIDIA Inference Microservices) packages pre-optimized TensorRT-LLM engines with Triton serving, health checks, and an OpenAI-compatible REST API into a single Docker container. Deployment is a single docker run command — NIM auto-selects the best TensorRT-LLM engine for the detected GPU hardware. Enterprise features: security scanning, support SLAs, regular model updates. NIM is designed for organizations that need production LLM APIs on their own infrastructure without the complexity of manual TensorRT-LLM optimization.
Compare
Filter by pillar to compare fine-tuning methods, quantization formats, inference techniques, and NVIDIA tools
| Technique | Category | Key Metric | What It Does | NCA-GENL Exam Point |
|---|---|---|---|---|
| Full Fine-tuning | Fine-tuning | 100% params updated | Updates every weight — best accuracy, highest cost | Requires storing full model gradients + optimizer states (~3–4× model size in GPU memory) |
| LoRA | Fine-tuning | Rank r=4–64; ~0.1–1% params | Trains low-rank matrices injected into attention layers; base model frozen | ΔW = B×A · No inference overhead (can merge) · 100–10,000× fewer params than full FT |
| QLoRA | Fine-tuning | NF4 base + BF16 adapters | LoRA + 4-bit base model quantization for single-GPU 70B fine-tuning | NF4 = NormalFloat4, optimal for Gaussian-distributed weights · Double quantization adds extra savings |
| DPO | Fine-tuning | No reward model needed | Directly optimizes policy on (prompt, chosen, rejected) triples | Simpler and more stable than RLHF/PPO · Competitive alignment results · Growing adoption |
| Adapter Tuning | Fine-tuning | ~3–5% params | Small bottleneck layers inserted between Transformer blocks; only adapters trained | Older PEFT approach; LoRA generally preferred for LLMs due to lower overhead |
| Prefix / P-Tuning | Fine-tuning | <0.1% params | Trains soft prompt embeddings prepended to input; no model weights changed | Very low compute; works for some tasks but less effective than LoRA for complex domain adaptation |
| FP32 → FP16 / BF16 | Quantization | 2× memory reduction | Half-precision floating point; standard for training and inference | BF16 preferred for training (wider dynamic range); FP16 used for inference on older GPUs |
| INT8 (PTQ) | Quantization | 4× memory reduction vs FP32 | Post-training quantization to 8-bit integer; minimal accuracy loss | TensorRT-LLM default for many deployments · LLM.int8() by bitsandbytes is common library |
| INT4 / NF4 (GPTQ / AWQ) | Quantization | 8× memory reduction vs FP32 | 4-bit weight quantization; advanced algorithms preserve quality | GPTQ uses Hessian-based optimal quantization · AWQ protects salient weights · Both near-FP16 quality |
| FP8 (H100 native) | Quantization | 4× memory reduction vs FP32 | 8-bit floating point with hardware support on H100 | Better accuracy/range trade-off than INT8 · Supported natively by H100 Transformer Engine · TensorRT-LLM default on H100 |
| Knowledge Distillation | Quantization | Smaller student model | Trains student to mimic teacher's soft output probabilities | Not quantization per se, but a compression technique · DistilBERT = 40% smaller, 97% BERT performance |
| Pruning | Quantization | Remove low-magnitude weights | Zeros out (unstructured) or removes (structured) weights with small absolute values | Structured pruning removes entire heads/layers — hardware speedup. Unstructured needs sparse compute support. |
| KV Cache | Inference | O(t²) → O(t) compute | Stores past K,V tensors to avoid recomputing attention over previous tokens | Essential for all production LLM inference · Grows with seq_len · Major source of GPU memory consumption |
| Paged KV Cache | Inference | Eliminates memory fragmentation | KV cache in fixed-size pages; dynamically allocated like OS virtual memory | Higher batch sizes · Enables prefix caching · Used by vLLM and TensorRT-LLM |
| Continuous Batching | Inference | 5–10× throughput improvement | Inserts new requests as slots free up; keeps GPU near 100% utilization | vs. static batching where finished requests leave idle GPU slots · Default in TensorRT-LLM |
| Flash Attention | Inference | O(n²) → O(n) HBM memory | Tiled attention computation in SRAM; reduces memory bandwidth usage | Same math as standard attention but 2–4× faster · Supports longer sequences · FA2/FA3 improve further |
| Tensor Parallelism | Inference | Splits weight matrices across GPUs | Each GPU computes a slice of each layer in parallel | Reduces latency per request · All-reduce communication overhead at each layer · Best for latency-critical serving |
| Speculative Decoding | Inference | 2–3× decode speedup | Small draft model proposes tokens; large model verifies multiple at once | Decode phase is memory-bandwidth bound — speculative decoding adds compute to speed it up |
| NVIDIA NeMo | NVIDIA Tools | Training + fine-tuning FW | End-to-end LLM training, SFT, LoRA/QLoRA, RLHF, DPO — multi-GPU | Use for custom model training/fine-tuning · Exports to TensorRT-LLM · Supports Megatron parallelism |
| NVIDIA TensorRT-LLM | NVIDIA Tools | 2–5× latency reduction | Quantization, kernel fusion, batching, paged KV cache, parallelism | Open-source · Inference-only (not training) · Foundation of NIM |
| NVIDIA Triton Server | NVIDIA Tools | Multi-framework serving | Unified serving for TensorRT-LLM, PyTorch, ONNX, TensorFlow backends | Dynamic batching · Model ensembles · Prometheus metrics · Production serving layer |
| NVIDIA NIM | NVIDIA Tools | One-command deployment | Pre-optimized containers with OpenAI-compatible API; auto-selects best TRT-LLM engine | Enterprise support + security · Abstracts TensorRT-LLM complexity · docker run to deploy |
Real Examples
Concrete scenarios showing how fine-tuning and inference optimization decisions play out in practice
Practice Quiz
10 NCA-GENL style questions across all four pillars — instant explanations after each answer
Optimization Advisor
Describe your goal and get a targeted recommendation for fine-tuning, quantization, or inference optimization
Memory Hooks
Click any card to flip — 8 high-yield fine-tuning and inference optimization mnemonics