What is LoRA and how does it reduce fine-tuning compute?

LoRA (Low-Rank Adaptation) freezes all original model weights and injects small trainable low-rank matrix pairs (A and B, where rank r is typically 4–64) into target attention layers. The weight update ΔW = B×A has far fewer parameters than the original weight matrix. This reduces trainable parameters by 100–10,000× vs full fine-tuning, making it possible to fine-tune large models on a single GPU.

What is the difference between PTQ and QAT quantization?

Post-Training Quantization (PTQ) applies quantization to an already-trained model without any retraining — fast and simple but may lose accuracy for aggressive quantization levels (INT4). Quantization-Aware Training (QAT) simulates quantization during the training or fine-tuning process, allowing the model to adapt its weights to compensate for quantization error — producing better accuracy at the cost of requiring a training pass.

What does the KV cache do during LLM inference?

During autoregressive generation, the model must compute attention over all previous tokens at every new step. The KV cache stores the Key and Value tensors computed for all past tokens so they don't need to be recomputed. This is essential for inference speed — without it, generating a 1,000-token response would require recomputing attention from scratch 1,000 times. The cache grows with sequence length and is a major source of GPU memory consumption.

How does continuous batching improve LLM inference throughput?

Static batching groups a fixed set of requests and waits for all of them to complete before accepting new ones. If one request finishes early, its GPU slot sits idle. Continuous (iteration-level) batching inserts new requests into the batch as soon as a slot frees up — keeping GPU utilization near 100%. This can improve throughput by 5–10× compared to static batching for typical LLM serving workloads.

What quantization format does QLoRA use for the base model?

QLoRA quantizes the frozen base model weights to 4-bit NormalFloat (NF4) format, which is information-theoretically optimal for normally distributed weights. The trainable LoRA adapter matrices are kept in BF16 precision. Double quantization further compresses the quantization constants. Combined, QLoRA enables fine-tuning of a 65B–70B parameter model on a single 48GB GPU — a task that would otherwise require 8+ high-memory GPUs.

LLM Fine-Tuning & Inference Optimization — NVIDIA NCA-GENL Exam Prep

The Four Optimization Pillars

Adapting, compressing, accelerating, and deploying LLMs — the full NCA-GENL optimization domain

Fine-tuning Methods

Adapting Models to Your Task

Fine-tuning adapts a pre-trained foundation model to a specific domain or task. Full fine-tuning updates all parameters. Parameter-efficient methods (PEFT) like LoRA and QLoRA update only a small fraction — enabling customization on a single GPU without sacrificing much performance.

LoRA

PEFT Leader

~1%

Params Trained

Quantization & Compression

Shrinking Models Without Losing Quality

Quantization reduces the numerical precision of model weights from FP32 → FP16 → INT8 → INT4, cutting memory use and increasing throughput. Post-Training Quantization (PTQ) requires no retraining. Quantization-Aware Training (QAT) recovers accuracy after aggressive compression.

4×

FP32→INT8

8×

FP32→INT4

Inference Optimization

Making Generation Faster & Cheaper

Inference optimization reduces latency and increases throughput without changing model accuracy. The KV cache eliminates redundant computation. Continuous batching maximizes GPU utilization. Flash Attention reduces memory bandwidth. Speculative decoding parallelizes token generation.

KV$

Cache

5–10×

Batch Speedup

NVIDIA Tools & Deployment

Production-Grade LLM Infrastructure

NVIDIA's tool stack covers the full pipeline: NeMo for training and fine-tuning, TensorRT-LLM for inference optimization, Triton Inference Server for model serving, and NIM for packaged one-click deployment. Together they deliver optimized LLMs from research to production.

NeMo

Train/FT

NIM

Deploy

Numerical Precision Reference — Memory Impact per Weight

FP32

32 bits · 4 bytes

Baseline memory (1×)

Training baseline; highest precision

BF16

16 bits · 2 bytes

2× reduction vs FP32

Standard training on A100/H100

FP16

16 bits · 2 bytes

2× reduction vs FP32

Mixed-precision training; inference

FP8

8 bits · 1 byte

4× reduction vs FP32

H100 native; fine-grained inference

INT8

8 bits · 1 byte

4× reduction vs FP32

PTQ inference; TensorRT-LLM default

NF4

4 bits · 0.5 bytes

8× reduction vs FP32

QLoRA base model quantization

💡

Rule of thumb for GPU memory: A model with N billion parameters in FP16 requires approximately N × 2 GB of GPU memory for weights alone. A 7B model ≈ 14 GB. A 70B model ≈ 140 GB. Quantizing to INT8 halves this; INT4/NF4 quarters it — making large models accessible on consumer or mid-range GPUs.

How It Works

Deep-dive mechanics for fine-tuning methods, quantization strategies, inference optimizations, and the NVIDIA deployment stack

Fine-tuning Methods

From Full Fine-tuning to LoRA, QLoRA & DPO

PEFT Spectrum — most to least compute-intensive

Full Fine-tuning

All parameters updated. Maximum accuracy, maximum cost.

100% params

Supervised FT (SFT)

All params updated on instruction-response data. Turns base → chat.

100% params

Adapter Tuning

Small bottleneck layers inserted between Transformer blocks. Only adapters trained.

~3–5% params

LoRA

Low-rank matrices injected into attention layers. Base model frozen.

~0.1–1% params

QLoRA

LoRA + 4-bit base model quantization. Single GPU fine-tuning of 70B models.

~0.1–1% params

Prefix / P-tuning

Trainable soft prompt tokens prepended to input. No model weight changes.

<0.1% params

LoRA Weight Update

W' = W₀ + ΔW = W₀ + B × A · (α / r)

W₀ = frozen pre-trained weight · A ∈ ℝ^(d×r) · B ∈ ℝ^(r×d) · r = rank (4–64) · α = scaling factor · Only A and B are trained

1

Why LoRA Works: Low-Rank Hypothesis

Research shows that the weight updates needed to fine-tune a large model for a specific task have a low intrinsic rank — they live in a low-dimensional subspace of the full weight matrix. LoRA exploits this: instead of updating the full d×d weight matrix (where d can be 4096+), it learns a rank-r approximation via two small matrices A and B. At rank r=8, a 4096×4096 weight matrix needs 4096×4096=16.7M updates vs. 2×4096×8=65K LoRA updates — 256× fewer.

2

QLoRA: 4-bit Base Model + BF16 Adapters

QLoRA (Quantized LoRA) combines three innovations: (1) NF4 quantization of the frozen base model — weights stored in 4-bit NormalFloat format, which is information-theoretically optimal for normally distributed (Gaussian) weights. (2) Double quantization — also quantizes the quantization constants themselves, saving ~0.4 bits per parameter. (3) Paged optimizers — uses NVIDIA unified memory to page optimizer states to CPU RAM during gradient checkpoints, preventing OOM errors. Net result: fine-tune a 65B model on a single 48GB A100.

3

DPO: Direct Preference Optimization

RLHF requires training a separate Reward Model and running a complex PPO reinforcement learning loop — expensive and unstable. DPO simplifies this by reformulating the preference optimization problem as a supervised classification task directly on the policy model. Given a (prompt, chosen response, rejected response) triple, DPO increases the likelihood of the chosen response and decreases the rejected one — no reward model needed. Simpler, more stable, and competitive with RLHF. DPO is now widely used as a practical RLHF alternative.

Quantization & Compression

PTQ, QAT, GPTQ, AWQ & Knowledge Distillation

1

Post-Training Quantization (PTQ)

PTQ quantizes a fully-trained model without any further training passes. The process: determine the range of weight and activation values using a small calibration dataset, then map those values to the lower-precision format (e.g., INT8). Fast and requires no GPU cluster. Works well for INT8 (minimal accuracy loss). For INT4, accuracy can degrade for complex tasks — QAT or advanced algorithms like GPTQ/AWQ are preferred.

2

Quantization-Aware Training (QAT)

QAT simulates quantization noise during a training or fine-tuning pass, allowing the model weights to adjust to compensate for the precision loss. Fake quantization nodes insert and remove quantization rounding during the forward pass, but gradients flow through normally (straight-through estimator). The result is a model whose weights are already "calibrated" for the target precision — significantly better accuracy than PTQ at INT4/INT8 for difficult tasks. More compute-intensive than PTQ.

3

GPTQ & AWQ — Advanced Weight-Only Quantization

GPTQ (Generative Pre-trained Transformer Quantization) uses second-order (Hessian) information to find optimal INT4 quantization values that minimize reconstruction error — layer by layer. AWQ (Activation-aware Weight Quantization) protects the 1% of "salient" weights that have the most impact on output quality from heavy quantization, quantizing only the remaining 99% aggressively. Both achieve near-FP16 quality at INT4 weight precision — making 70B models runnable on 2× 24GB consumer GPUs.

4

Knowledge Distillation

Distillation trains a smaller "student" model to mimic the output distribution of a larger "teacher" model. Instead of training on hard labels (0 or 1), the student learns from the teacher's soft probability outputs — which contain richer information about the relationships between classes. Example: distilling a 70B teacher into a 7B student that achieves 80–90% of the teacher's performance at 10% of the inference cost. Used to create models like DistilBERT (40% smaller than BERT, 97% of performance).

Inference Optimization

KV Cache, Continuous Batching, Flash Attention & Parallelism

KV Cache Memory Growth

KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element

Grows linearly with sequence length and batch size · A 7B model at seq_len=4096, batch=32, FP16 ≈ 16–32 GB of KV cache alone · Paged KV cache (vLLM) solves memory fragmentation

1

KV Cache — Eliminating Redundant Attention Computation

During autoregressive generation, each new token requires attention over all previous tokens. Without a cache, generating token t requires recomputing K and V for tokens 1 through t-1 from scratch — O(t²) total compute. The KV cache stores K and V tensors for all past tokens, so generating token t only requires computing K,V for token t itself and retrieving the rest from cache. This reduces generation from O(t²) to O(t) compute — a massive speedup, especially for long outputs.

2

Paged KV Cache (vLLM / TensorRT-LLM)

Traditional KV caches pre-allocate a contiguous block of GPU memory per sequence based on the maximum sequence length — wasting memory for short sequences and causing fragmentation. Paged KV cache (inspired by OS virtual memory paging) allocates KV cache in fixed-size "pages" and manages them dynamically. This allows higher batch sizes (more concurrent users), eliminates memory fragmentation, and enables memory sharing across requests with common prefixes (prefix caching). TensorRT-LLM implements paged KV cache natively.

3

Continuous (Iteration-Level) Batching

Static batching processes a fixed group of requests together and waits for all to finish before accepting new ones. If one request completes early, its GPU slot idles. Continuous batching inserts new requests into available slots at each generation iteration — keeping GPU utilization near 100% regardless of variable response lengths. For typical LLM serving workloads with mixed short and long responses, continuous batching improves throughput by 5–10× compared to static batching.

4

Flash Attention — Tiled Memory-Efficient Attention

Standard attention computes a full n×n attention score matrix in GPU High Bandwidth Memory (HBM) — O(n²) memory. For long sequences, this becomes the bottleneck. Flash Attention reorders the attention computation into tiles that fit in the GPU's fast SRAM (on-chip cache), dramatically reducing HBM reads and writes. The mathematical result is identical, but it requires far fewer memory operations — 2–4× faster and uses O(n) memory instead of O(n²). Flash Attention 2 and 3 further improve parallelism on modern GPU architectures.

5

Tensor & Pipeline Parallelism

Tensor parallelism splits individual weight matrices across multiple GPUs horizontally — each GPU computes a slice of each layer's output in parallel. Requires all-reduce communication at each layer boundary. Best for reducing latency within a single request. Pipeline parallelism assigns consecutive Transformer layers to different GPUs — GPU 0 runs layers 1–12, GPU 1 runs layers 13–24, etc. More communication-efficient for large models but requires micro-batching to keep all GPUs busy. Most production deployments combine both (3D parallelism: tensor + pipeline + data).

ℹ️

Prefill vs. Decode phases: LLM inference has two distinct phases. Prefill (prompt processing): the entire input prompt is processed in parallel — compute-bound, fast. Decode (generation): tokens are generated one at a time, autoregressively — memory-bandwidth-bound, slower. Most latency optimization targets the decode phase. Speculative decoding uses a small draft model to propose multiple tokens, then verifies them with the large model in parallel — speeding up the decode phase without changing output quality.

NVIDIA Tools & Deployment

NeMo, TensorRT-LLM, Triton & NIM

1

NVIDIA NeMo — Training & Fine-tuning Framework

NeMo is an end-to-end framework for building and customizing LLMs. Fine-tuning capabilities: full SFT, LoRA/QLoRA (all rank targets configurable), P-Tuning, Adapters, DPO and RLHF pipelines (reward model training + PPO). Distributed training via Megatron-LM — tensor, pipeline, and data parallelism across 100s of GPUs. NeMo exports models in formats ready for TensorRT-LLM optimization. Use NeMo when you need to create a custom domain-specific model or align a foundation model to specific behavior.

2

NVIDIA TensorRT-LLM — Inference Engine

TensorRT-LLM compiles and optimizes LLMs for maximum inference performance on NVIDIA GPUs. Optimization techniques applied: (1) Quantization: INT8/INT4/FP8 weight and activation quantization. (2) Kernel fusion: multiple sequential GPU operations merged into single optimized kernels to reduce memory bandwidth overhead. (3) Continuous batching + paged KV cache. (4) Tensor parallelism across multiple GPUs. (5) In-flight request batching. (6) Optimized attention kernels (including Flash Attention variants). Net result: 2–5× lower latency and 5–10× higher throughput vs. PyTorch baseline for the same hardware.

3

NVIDIA Triton Inference Server

Triton is a production-grade model serving framework that supports multiple backends (TensorRT, TensorRT-LLM, PyTorch, ONNX, TensorFlow) via a single unified API. Features: dynamic batching, model ensembles (chaining multiple models), concurrent model execution, GPU/CPU/custom backend support, Prometheus metrics integration, and health check endpoints. Triton is the serving layer that TensorRT-LLM and NIM are built on top of — it handles request routing, queuing, and scaling.

4

NVIDIA NIM — One-Command Enterprise Deployment

NIM (NVIDIA Inference Microservices) packages pre-optimized TensorRT-LLM engines with Triton serving, health checks, and an OpenAI-compatible REST API into a single Docker container. Deployment is a single docker run command — NIM auto-selects the best TensorRT-LLM engine for the detected GPU hardware. Enterprise features: security scanning, support SLAs, regular model updates. NIM is designed for organizations that need production LLM APIs on their own infrastructure without the complexity of manual TensorRT-LLM optimization.

🟢

NVIDIA Tool Selection Guide: Building/fine-tuning a custom model → NeMo. Optimizing an existing model for faster inference → TensorRT-LLM. Serving multiple models with complex routing → Triton. Deploying a standard LLM as an API with minimal setup → NIM. For production, the typical stack is: NeMo (training) → TensorRT-LLM (optimization) → Triton/NIM (serving).

Compare

Filter by pillar to compare fine-tuning methods, quantization formats, inference techniques, and NVIDIA tools

Technique	Category	Key Metric	What It Does	NCA-GENL Exam Point
Full Fine-tuning	Fine-tuning	100% params updated	Updates every weight — best accuracy, highest cost	Requires storing full model gradients + optimizer states (~3–4× model size in GPU memory)
LoRA	Fine-tuning	Rank r=4–64; ~0.1–1% params	Trains low-rank matrices injected into attention layers; base model frozen	ΔW = B×A · No inference overhead (can merge) · 100–10,000× fewer params than full FT
QLoRA	Fine-tuning	NF4 base + BF16 adapters	LoRA + 4-bit base model quantization for single-GPU 70B fine-tuning	NF4 = NormalFloat4, optimal for Gaussian-distributed weights · Double quantization adds extra savings
DPO	Fine-tuning	No reward model needed	Directly optimizes policy on (prompt, chosen, rejected) triples	Simpler and more stable than RLHF/PPO · Competitive alignment results · Growing adoption
Adapter Tuning	Fine-tuning	~3–5% params	Small bottleneck layers inserted between Transformer blocks; only adapters trained	Older PEFT approach; LoRA generally preferred for LLMs due to lower overhead
Prefix / P-Tuning	Fine-tuning	<0.1% params	Trains soft prompt embeddings prepended to input; no model weights changed	Very low compute; works for some tasks but less effective than LoRA for complex domain adaptation
FP32 → FP16 / BF16	Quantization	2× memory reduction	Half-precision floating point; standard for training and inference	BF16 preferred for training (wider dynamic range); FP16 used for inference on older GPUs
INT8 (PTQ)	Quantization	4× memory reduction vs FP32	Post-training quantization to 8-bit integer; minimal accuracy loss	TensorRT-LLM default for many deployments · LLM.int8() by bitsandbytes is common library
INT4 / NF4 (GPTQ / AWQ)	Quantization	8× memory reduction vs FP32	4-bit weight quantization; advanced algorithms preserve quality	GPTQ uses Hessian-based optimal quantization · AWQ protects salient weights · Both near-FP16 quality
FP8 (H100 native)	Quantization	4× memory reduction vs FP32	8-bit floating point with hardware support on H100	Better accuracy/range trade-off than INT8 · Supported natively by H100 Transformer Engine · TensorRT-LLM default on H100
Knowledge Distillation	Quantization	Smaller student model	Trains student to mimic teacher's soft output probabilities	Not quantization per se, but a compression technique · DistilBERT = 40% smaller, 97% BERT performance
Pruning	Quantization	Remove low-magnitude weights	Zeros out (unstructured) or removes (structured) weights with small absolute values	Structured pruning removes entire heads/layers — hardware speedup. Unstructured needs sparse compute support.
KV Cache	Inference	O(t²) → O(t) compute	Stores past K,V tensors to avoid recomputing attention over previous tokens	Essential for all production LLM inference · Grows with seq_len · Major source of GPU memory consumption
Paged KV Cache	Inference	Eliminates memory fragmentation	KV cache in fixed-size pages; dynamically allocated like OS virtual memory	Higher batch sizes · Enables prefix caching · Used by vLLM and TensorRT-LLM
Continuous Batching	Inference	5–10× throughput improvement	Inserts new requests as slots free up; keeps GPU near 100% utilization	vs. static batching where finished requests leave idle GPU slots · Default in TensorRT-LLM
Flash Attention	Inference	O(n²) → O(n) HBM memory	Tiled attention computation in SRAM; reduces memory bandwidth usage	Same math as standard attention but 2–4× faster · Supports longer sequences · FA2/FA3 improve further
Tensor Parallelism	Inference	Splits weight matrices across GPUs	Each GPU computes a slice of each layer in parallel	Reduces latency per request · All-reduce communication overhead at each layer · Best for latency-critical serving
Speculative Decoding	Inference	2–3× decode speedup	Small draft model proposes tokens; large model verifies multiple at once	Decode phase is memory-bandwidth bound — speculative decoding adds compute to speed it up
NVIDIA NeMo	NVIDIA Tools	Training + fine-tuning FW	End-to-end LLM training, SFT, LoRA/QLoRA, RLHF, DPO — multi-GPU	Use for custom model training/fine-tuning · Exports to TensorRT-LLM · Supports Megatron parallelism
NVIDIA TensorRT-LLM	NVIDIA Tools	2–5× latency reduction	Quantization, kernel fusion, batching, paged KV cache, parallelism	Open-source · Inference-only (not training) · Foundation of NIM
NVIDIA Triton Server	NVIDIA Tools	Multi-framework serving	Unified serving for TensorRT-LLM, PyTorch, ONNX, TensorFlow backends	Dynamic batching · Model ensembles · Prometheus metrics · Production serving layer
NVIDIA NIM	NVIDIA Tools	One-command deployment	Pre-optimized containers with OpenAI-compatible API; auto-selects best TRT-LLM engine	Enterprise support + security · Abstracts TensorRT-LLM complexity · docker run to deploy

Real Examples

Concrete scenarios showing how fine-tuning and inference optimization decisions play out in practice

Fine-tuning Methods

Fine-tuning a 13B Model on a Single GPU Using QLoRA

Your team has a single NVIDIA A100 80GB GPU. You need to fine-tune a 13B-parameter LLM on 50,000 customer support conversations to improve domain-specific response quality. Full fine-tuning would require ~200GB of GPU memory for weights, gradients, and optimizer states. What do you do?

A 13B model in FP16 requires ~26 GB for weights. Full fine-tuning adds gradients (~26 GB) and Adam optimizer states (~52 GB) — roughly 104 GB total, exceeding the 80 GB A100.
QLoRA solution: Load the 13B base model quantized to NF4 (4-bit) — ~6.5 GB. The base model gradients are never computed (frozen), so no gradient memory for the base.
Add LoRA adapters (rank=16) to all attention layers in BF16 — approximately 8–15M trainable parameters, ~30–60 MB.
Use gradient checkpointing and paged AdamW optimizer to handle peak memory spikes. Total GPU memory: ~12–16 GB — comfortably fits on one A100.
Fine-tune for 3–5 epochs on the 50K examples. Merge the LoRA adapters into the base model weights at the end for zero-overhead inference.

✅ Key point: QLoRA turns a multi-GPU fine-tuning job into a single-GPU job by quantizing the frozen base model to 4-bit. The LoRA adapters remain in full BF16 precision for training quality.

Quantization & Compression

Choosing Between PTQ and AWQ for a 70B Production Model

You're deploying a 70B model for a financial analysis chatbot. The model must run on 4× A100 80GB GPUs. In FP16, the 70B model requires ~140 GB — just barely fitting across 4 GPUs with little room for KV cache or batch size. You need to quantize. Which method do you choose?

Option A — INT8 PTQ: Reduces 70B to ~70 GB (4× GPUs easily accommodate). Fast to apply. For financial analysis, INT8 is generally well-preserved for reasoning tasks. Viable choice.
Option B — AWQ INT4: Reduces 70B to ~35 GB — frees up significant GPU memory for larger KV cache and higher batch sizes. AWQ's activation-aware protection of salient weights keeps quality near FP16 level.
The financial chatbot handles complex numerical reasoning — INT4 without protection (naive GPTQ) might degrade on multi-step calculations. AWQ's salient weight protection makes INT4 much safer here.
Decision: AWQ INT4. The 35 GB footprint frees ~45 GB per 4-GPU node for KV cache, enabling 2× larger batch size — doubling throughput at the same cost.
Validation step: always benchmark accuracy on a held-out financial Q&A set at each quantization level before deploying. A 1–2% accuracy drop may be acceptable; 5%+ requires reconsideration.

✅ Key point: AWQ INT4 is preferred over naive INT4 for accuracy-sensitive tasks. The memory savings from INT4 vs INT8 directly translate to larger KV cache and higher throughput — not just smaller model footprint.

Inference Optimization

Why KV Cache Size Becomes the Bottleneck at Scale

You deploy a 7B model on an A100 GPU with 80 GB memory. The model weights in FP16 use ~14 GB. You expect to support 100 concurrent users with average 2,000 token conversations. You calculate that weights + batch KV cache = ~60 GB. But under load, you're getting OOM errors. Why?

The KV cache memory formula: 2 (K+V) × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element.
For a 7B model (32 layers, 32 heads, head_dim 128, FP16): 2 × 32 × 32 × 128 × 2000 tokens × 100 users × 2 bytes = ~52 GB — more than the available 66 GB after weights.
The problem: static KV cache allocation pre-allocates max_seq_len × batch_size memory at startup — memory fragmentation means many allocations fail even before actually reaching capacity.
Solution: switch to paged KV cache (as implemented in TensorRT-LLM and vLLM). Instead of one contiguous block per sequence, allocate 16-token pages dynamically. Memory fragmentation is eliminated.
With paged KV cache, the same 66 GB can serve more concurrent requests because unused page slots are recycled between requests rather than reserved per-sequence.

✅ Key point: KV cache is often the actual memory bottleneck, not model weights. At 100 concurrent users with long sequences, KV cache can dwarf model weight memory. Paged KV cache solves fragmentation.

NVIDIA Tools & Deployment

The NeMo → TensorRT-LLM → NIM Pipeline in Practice

A biotech company needs a custom LLM fine-tuned on internal research papers (domain adaptation) and deployed as an internal API. Walk through the NVIDIA tool stack from fine-tuning to production.

Step 1 — Fine-tuning with NeMo: Use NeMo's LoRA fine-tuning pipeline. Load the base Llama 3 8B model, configure rank=16 LoRA adapters on all attention layers, train on 200K biotech paper excerpts for 3 epochs across 4× A100 GPUs using tensor parallelism. NeMo handles distributed training automatically.
Step 2 — Merge and export: Merge LoRA adapters into base model weights. Export the merged model from NeMo in a TensorRT-LLM-compatible checkpoint format.
Step 3 — TensorRT-LLM optimization: Run TensorRT-LLM's build pipeline: quantize weights to FP8 (H100) or INT8 (A100), fuse kernels, enable paged KV cache and continuous batching. This produces an optimized TRT engine file.
Step 4 — Deploy via NIM (or Triton): Package the TRT engine in a NIM container or directly serve via Triton Inference Server with an OpenAI-compatible /v1/chat/completions endpoint. Internal teams access the model exactly like they would the OpenAI API.
Result: a domain-specific 8B model serving 200 researchers with 3× lower latency than the unoptimized baseline, running entirely on-premises with no data leaving the company network.

✅ Key point: NeMo (fine-tune) → TensorRT-LLM (optimize) → NIM/Triton (serve) is the standard NVIDIA production pipeline. Each tool has a distinct role — don't conflate them on the exam.

Practice Quiz

10 NCA-GENL style questions across all four pillars — instant explanations after each answer

Question 1 of 10

Fine-tune

—

Quant.

—

Inference

—

NVIDIA

—

Optimization Advisor

Describe your goal and get a targeted recommendation for fine-tuning, quantization, or inference optimization

What are you trying to achieve?

Memory Hooks

Click any card to flip — 8 high-yield fine-tuning and inference optimization mnemonics

🔧

LoRA: what's frozen and what's trained?

Tap to reveal →

Frozen: base model weights. Trained: low-rank matrices A and B.

ΔW = B×A · rank r = 4–64 · ~0.1–1% of total params trained · Can merge at inference for zero overhead

⚡

QLoRA vs LoRA — one key difference?

Tap to reveal →

QLoRA quantizes the frozen base model to 4-bit NF4

LoRA keeps base in FP16 (~13 GB for 7B). QLoRA keeps base in NF4 (~3.5 GB for 7B). Makes 70B fine-tuning possible on a single 48GB GPU.

📐

FP32 → INT8: how much memory reduction?

Tap to reveal →

4× reduction (32 bits → 8 bits)

FP32=4 bytes · FP16/BF16=2 bytes (2×) · INT8=1 byte (4×) · NF4/INT4=0.5 bytes (8×). A 70B model in FP16=140GB; in INT4≈35GB.

🧠

What does the KV cache store — and why?

Tap to reveal →

Past attention Key & Value tensors — avoids recomputing them

Without KV cache, generating token t requires O(t²) compute. With cache, O(t). Essential for any production LLM inference. Grows with sequence length × batch size.

🚀

Continuous batching vs static batching — key win?

Tap to reveal →

Inserts new requests as slots free — near-100% GPU utilization

Static: wait for full batch to finish → idle GPU slots. Continuous: fill slots immediately as requests complete. 5–10× throughput improvement for typical mixed-length workloads.

⚡

Flash Attention: same math, what's different?

Tap to reveal →

Tiled computation in SRAM — reduces HBM reads/writes

Standard attention materializes full n×n score matrix in HBM (O(n²) memory). Flash Attention tiles it through fast SRAM: same output, 2–4× faster, O(n) memory usage.

🔀

Tensor parallelism vs pipeline parallelism?

Tap to reveal →

Tensor: split weight matrices across GPUs. Pipeline: split layers across GPUs.

Tensor parallelism = horizontal split (reduces per-layer latency, needs all-reduce). Pipeline parallelism = vertical split (layer groups per GPU, needs micro-batching). Production uses both.

🟢

NeMo vs TensorRT-LLM vs NIM — one-line each?

Tap to reveal →

NeMo = train/fine-tune · TRT-LLM = optimize inference · NIM = deploy as API

NeMo: LoRA, SFT, DPO, multi-GPU training. TensorRT-LLM: quantization, kernel fusion, batching. NIM: pre-packaged Docker container, OpenAI-compatible API, one command to deploy.

LLM Fine-Tuning & Inference Optimization

Adapting Models to Your Task

Shrinking Models Without Losing Quality

Making Generation Faster & Cheaper

Production-Grade LLM Infrastructure

Numerical Precision Reference — Memory Impact per Weight

Why LoRA Works: Low-Rank Hypothesis

QLoRA: 4-bit Base Model + BF16 Adapters

DPO: Direct Preference Optimization

Post-Training Quantization (PTQ)

Quantization-Aware Training (QAT)

GPTQ & AWQ — Advanced Weight-Only Quantization

Knowledge Distillation

KV Cache — Eliminating Redundant Attention Computation

Paged KV Cache (vLLM / TensorRT-LLM)

Continuous (Iteration-Level) Batching

Flash Attention — Tiled Memory-Efficient Attention

Tensor & Pipeline Parallelism

NVIDIA NeMo — Training & Fine-tuning Framework

NVIDIA TensorRT-LLM — Inference Engine

NVIDIA Triton Inference Server

NVIDIA NIM — One-Command Enterprise Deployment

Fine-tuning a 13B Model on a Single GPU Using QLoRA

Choosing Between PTQ and AWQ for a 70B Production Model

Why KV Cache Size Becomes the Bottleneck at Scale

The NeMo → TensorRT-LLM → NIM Pipeline in Practice

Ready to Pass the NCA-GENL? Get Everything You Need in One Place.