NVIDIA NCA-GENL · Fine-Tuning & Inference Optimization

LLM Fine-Tuning & Inference Optimization

Master LoRA, QLoRA, quantization, KV caching, continuous batching, and the NVIDIA TensorRT-LLM stack — high-weight topics on the NCA-GENL certification.

Practice with Flashcards →

The Four Optimization Pillars

Adapting, compressing, accelerating, and deploying LLMs — the full NCA-GENL optimization domain

Fine-tuning Methods

Adapting Models to Your Task

Fine-tuning adapts a pre-trained foundation model to a specific domain or task. Full fine-tuning updates all parameters. Parameter-efficient methods (PEFT) like LoRA and QLoRA update only a small fraction — enabling customization on a single GPU without sacrificing much performance.

LoRA
PEFT Leader
~1%
Params Trained
Quantization & Compression

Shrinking Models Without Losing Quality

Quantization reduces the numerical precision of model weights from FP32 → FP16 → INT8 → INT4, cutting memory use and increasing throughput. Post-Training Quantization (PTQ) requires no retraining. Quantization-Aware Training (QAT) recovers accuracy after aggressive compression.

FP32→INT8
FP32→INT4
Inference Optimization

Making Generation Faster & Cheaper

Inference optimization reduces latency and increases throughput without changing model accuracy. The KV cache eliminates redundant computation. Continuous batching maximizes GPU utilization. Flash Attention reduces memory bandwidth. Speculative decoding parallelizes token generation.

KV$
Cache
5–10×
Batch Speedup
NVIDIA Tools & Deployment

Production-Grade LLM Infrastructure

NVIDIA's tool stack covers the full pipeline: NeMo for training and fine-tuning, TensorRT-LLM for inference optimization, Triton Inference Server for model serving, and NIM for packaged one-click deployment. Together they deliver optimized LLMs from research to production.

NeMo
Train/FT
NIM
Deploy

Numerical Precision Reference — Memory Impact per Weight

FP32
32 bits · 4 bytes
Baseline memory (1×)
Training baseline; highest precision
BF16
16 bits · 2 bytes
2× reduction vs FP32
Standard training on A100/H100
FP16
16 bits · 2 bytes
2× reduction vs FP32
Mixed-precision training; inference
FP8
8 bits · 1 byte
4× reduction vs FP32
H100 native; fine-grained inference
INT8
8 bits · 1 byte
4× reduction vs FP32
PTQ inference; TensorRT-LLM default
NF4
4 bits · 0.5 bytes
8× reduction vs FP32
QLoRA base model quantization
💡
Rule of thumb for GPU memory: A model with N billion parameters in FP16 requires approximately N × 2 GB of GPU memory for weights alone. A 7B model ≈ 14 GB. A 70B model ≈ 140 GB. Quantizing to INT8 halves this; INT4/NF4 quarters it — making large models accessible on consumer or mid-range GPUs.

How It Works

Deep-dive mechanics for fine-tuning methods, quantization strategies, inference optimizations, and the NVIDIA deployment stack

Fine-tuning Methods

From Full Fine-tuning to LoRA, QLoRA & DPO

PEFT Spectrum — most to least compute-intensive

Full Fine-tuning
All parameters updated. Maximum accuracy, maximum cost.
100% params
Supervised FT (SFT)
All params updated on instruction-response data. Turns base → chat.
100% params
Adapter Tuning
Small bottleneck layers inserted between Transformer blocks. Only adapters trained.
~3–5% params
LoRA
Low-rank matrices injected into attention layers. Base model frozen.
~0.1–1% params
QLoRA
LoRA + 4-bit base model quantization. Single GPU fine-tuning of 70B models.
~0.1–1% params
Prefix / P-tuning
Trainable soft prompt tokens prepended to input. No model weight changes.
<0.1% params
LoRA Weight Update
W' = W₀ + ΔW = W₀ + B × A  ·  (α / r)
W₀ = frozen pre-trained weight · A ∈ ℝ^(d×r) · B ∈ ℝ^(r×d) · r = rank (4–64) · α = scaling factor · Only A and B are trained
1

Why LoRA Works: Low-Rank Hypothesis

Research shows that the weight updates needed to fine-tune a large model for a specific task have a low intrinsic rank — they live in a low-dimensional subspace of the full weight matrix. LoRA exploits this: instead of updating the full d×d weight matrix (where d can be 4096+), it learns a rank-r approximation via two small matrices A and B. At rank r=8, a 4096×4096 weight matrix needs 4096×4096=16.7M updates vs. 2×4096×8=65K LoRA updates — 256× fewer.

2

QLoRA: 4-bit Base Model + BF16 Adapters

QLoRA (Quantized LoRA) combines three innovations: (1) NF4 quantization of the frozen base model — weights stored in 4-bit NormalFloat format, which is information-theoretically optimal for normally distributed (Gaussian) weights. (2) Double quantization — also quantizes the quantization constants themselves, saving ~0.4 bits per parameter. (3) Paged optimizers — uses NVIDIA unified memory to page optimizer states to CPU RAM during gradient checkpoints, preventing OOM errors. Net result: fine-tune a 65B model on a single 48GB A100.

3

DPO: Direct Preference Optimization

RLHF requires training a separate Reward Model and running a complex PPO reinforcement learning loop — expensive and unstable. DPO simplifies this by reformulating the preference optimization problem as a supervised classification task directly on the policy model. Given a (prompt, chosen response, rejected response) triple, DPO increases the likelihood of the chosen response and decreases the rejected one — no reward model needed. Simpler, more stable, and competitive with RLHF. DPO is now widely used as a practical RLHF alternative.

Quantization & Compression

PTQ, QAT, GPTQ, AWQ & Knowledge Distillation

1

Post-Training Quantization (PTQ)

PTQ quantizes a fully-trained model without any further training passes. The process: determine the range of weight and activation values using a small calibration dataset, then map those values to the lower-precision format (e.g., INT8). Fast and requires no GPU cluster. Works well for INT8 (minimal accuracy loss). For INT4, accuracy can degrade for complex tasks — QAT or advanced algorithms like GPTQ/AWQ are preferred.

2

Quantization-Aware Training (QAT)

QAT simulates quantization noise during a training or fine-tuning pass, allowing the model weights to adjust to compensate for the precision loss. Fake quantization nodes insert and remove quantization rounding during the forward pass, but gradients flow through normally (straight-through estimator). The result is a model whose weights are already "calibrated" for the target precision — significantly better accuracy than PTQ at INT4/INT8 for difficult tasks. More compute-intensive than PTQ.

3

GPTQ & AWQ — Advanced Weight-Only Quantization

GPTQ (Generative Pre-trained Transformer Quantization) uses second-order (Hessian) information to find optimal INT4 quantization values that minimize reconstruction error — layer by layer. AWQ (Activation-aware Weight Quantization) protects the 1% of "salient" weights that have the most impact on output quality from heavy quantization, quantizing only the remaining 99% aggressively. Both achieve near-FP16 quality at INT4 weight precision — making 70B models runnable on 2× 24GB consumer GPUs.

4

Knowledge Distillation

Distillation trains a smaller "student" model to mimic the output distribution of a larger "teacher" model. Instead of training on hard labels (0 or 1), the student learns from the teacher's soft probability outputs — which contain richer information about the relationships between classes. Example: distilling a 70B teacher into a 7B student that achieves 80–90% of the teacher's performance at 10% of the inference cost. Used to create models like DistilBERT (40% smaller than BERT, 97% of performance).

Inference Optimization

KV Cache, Continuous Batching, Flash Attention & Parallelism

KV Cache Memory Growth
KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element
Grows linearly with sequence length and batch size · A 7B model at seq_len=4096, batch=32, FP16 ≈ 16–32 GB of KV cache alone · Paged KV cache (vLLM) solves memory fragmentation
1

KV Cache — Eliminating Redundant Attention Computation

During autoregressive generation, each new token requires attention over all previous tokens. Without a cache, generating token t requires recomputing K and V for tokens 1 through t-1 from scratch — O(t²) total compute. The KV cache stores K and V tensors for all past tokens, so generating token t only requires computing K,V for token t itself and retrieving the rest from cache. This reduces generation from O(t²) to O(t) compute — a massive speedup, especially for long outputs.

2

Paged KV Cache (vLLM / TensorRT-LLM)

Traditional KV caches pre-allocate a contiguous block of GPU memory per sequence based on the maximum sequence length — wasting memory for short sequences and causing fragmentation. Paged KV cache (inspired by OS virtual memory paging) allocates KV cache in fixed-size "pages" and manages them dynamically. This allows higher batch sizes (more concurrent users), eliminates memory fragmentation, and enables memory sharing across requests with common prefixes (prefix caching). TensorRT-LLM implements paged KV cache natively.

3

Continuous (Iteration-Level) Batching

Static batching processes a fixed group of requests together and waits for all to finish before accepting new ones. If one request completes early, its GPU slot idles. Continuous batching inserts new requests into available slots at each generation iteration — keeping GPU utilization near 100% regardless of variable response lengths. For typical LLM serving workloads with mixed short and long responses, continuous batching improves throughput by 5–10× compared to static batching.

4

Flash Attention — Tiled Memory-Efficient Attention

Standard attention computes a full n×n attention score matrix in GPU High Bandwidth Memory (HBM) — O(n²) memory. For long sequences, this becomes the bottleneck. Flash Attention reorders the attention computation into tiles that fit in the GPU's fast SRAM (on-chip cache), dramatically reducing HBM reads and writes. The mathematical result is identical, but it requires far fewer memory operations — 2–4× faster and uses O(n) memory instead of O(n²). Flash Attention 2 and 3 further improve parallelism on modern GPU architectures.

5

Tensor & Pipeline Parallelism

Tensor parallelism splits individual weight matrices across multiple GPUs horizontally — each GPU computes a slice of each layer's output in parallel. Requires all-reduce communication at each layer boundary. Best for reducing latency within a single request. Pipeline parallelism assigns consecutive Transformer layers to different GPUs — GPU 0 runs layers 1–12, GPU 1 runs layers 13–24, etc. More communication-efficient for large models but requires micro-batching to keep all GPUs busy. Most production deployments combine both (3D parallelism: tensor + pipeline + data).

ℹ️
Prefill vs. Decode phases: LLM inference has two distinct phases. Prefill (prompt processing): the entire input prompt is processed in parallel — compute-bound, fast. Decode (generation): tokens are generated one at a time, autoregressively — memory-bandwidth-bound, slower. Most latency optimization targets the decode phase. Speculative decoding uses a small draft model to propose multiple tokens, then verifies them with the large model in parallel — speeding up the decode phase without changing output quality.
NVIDIA Tools & Deployment

NeMo, TensorRT-LLM, Triton & NIM

1

NVIDIA NeMo — Training & Fine-tuning Framework

NeMo is an end-to-end framework for building and customizing LLMs. Fine-tuning capabilities: full SFT, LoRA/QLoRA (all rank targets configurable), P-Tuning, Adapters, DPO and RLHF pipelines (reward model training + PPO). Distributed training via Megatron-LM — tensor, pipeline, and data parallelism across 100s of GPUs. NeMo exports models in formats ready for TensorRT-LLM optimization. Use NeMo when you need to create a custom domain-specific model or align a foundation model to specific behavior.

2

NVIDIA TensorRT-LLM — Inference Engine

TensorRT-LLM compiles and optimizes LLMs for maximum inference performance on NVIDIA GPUs. Optimization techniques applied: (1) Quantization: INT8/INT4/FP8 weight and activation quantization. (2) Kernel fusion: multiple sequential GPU operations merged into single optimized kernels to reduce memory bandwidth overhead. (3) Continuous batching + paged KV cache. (4) Tensor parallelism across multiple GPUs. (5) In-flight request batching. (6) Optimized attention kernels (including Flash Attention variants). Net result: 2–5× lower latency and 5–10× higher throughput vs. PyTorch baseline for the same hardware.

3

NVIDIA Triton Inference Server

Triton is a production-grade model serving framework that supports multiple backends (TensorRT, TensorRT-LLM, PyTorch, ONNX, TensorFlow) via a single unified API. Features: dynamic batching, model ensembles (chaining multiple models), concurrent model execution, GPU/CPU/custom backend support, Prometheus metrics integration, and health check endpoints. Triton is the serving layer that TensorRT-LLM and NIM are built on top of — it handles request routing, queuing, and scaling.

4

NVIDIA NIM — One-Command Enterprise Deployment

NIM (NVIDIA Inference Microservices) packages pre-optimized TensorRT-LLM engines with Triton serving, health checks, and an OpenAI-compatible REST API into a single Docker container. Deployment is a single docker run command — NIM auto-selects the best TensorRT-LLM engine for the detected GPU hardware. Enterprise features: security scanning, support SLAs, regular model updates. NIM is designed for organizations that need production LLM APIs on their own infrastructure without the complexity of manual TensorRT-LLM optimization.

🟢
NVIDIA Tool Selection Guide: Building/fine-tuning a custom model → NeMo. Optimizing an existing model for faster inference → TensorRT-LLM. Serving multiple models with complex routing → Triton. Deploying a standard LLM as an API with minimal setup → NIM. For production, the typical stack is: NeMo (training) → TensorRT-LLM (optimization) → Triton/NIM (serving).

Compare

Filter by pillar to compare fine-tuning methods, quantization formats, inference techniques, and NVIDIA tools

TechniqueCategoryKey MetricWhat It DoesNCA-GENL Exam Point
Full Fine-tuningFine-tuning100% params updatedUpdates every weight — best accuracy, highest costRequires storing full model gradients + optimizer states (~3–4× model size in GPU memory)
LoRAFine-tuningRank r=4–64; ~0.1–1% paramsTrains low-rank matrices injected into attention layers; base model frozenΔW = B×A · No inference overhead (can merge) · 100–10,000× fewer params than full FT
QLoRAFine-tuningNF4 base + BF16 adaptersLoRA + 4-bit base model quantization for single-GPU 70B fine-tuningNF4 = NormalFloat4, optimal for Gaussian-distributed weights · Double quantization adds extra savings
DPOFine-tuningNo reward model neededDirectly optimizes policy on (prompt, chosen, rejected) triplesSimpler and more stable than RLHF/PPO · Competitive alignment results · Growing adoption
Adapter TuningFine-tuning~3–5% paramsSmall bottleneck layers inserted between Transformer blocks; only adapters trainedOlder PEFT approach; LoRA generally preferred for LLMs due to lower overhead
Prefix / P-TuningFine-tuning<0.1% paramsTrains soft prompt embeddings prepended to input; no model weights changedVery low compute; works for some tasks but less effective than LoRA for complex domain adaptation
FP32 → FP16 / BF16Quantization2× memory reductionHalf-precision floating point; standard for training and inferenceBF16 preferred for training (wider dynamic range); FP16 used for inference on older GPUs
INT8 (PTQ)Quantization4× memory reduction vs FP32Post-training quantization to 8-bit integer; minimal accuracy lossTensorRT-LLM default for many deployments · LLM.int8() by bitsandbytes is common library
INT4 / NF4 (GPTQ / AWQ)Quantization8× memory reduction vs FP324-bit weight quantization; advanced algorithms preserve qualityGPTQ uses Hessian-based optimal quantization · AWQ protects salient weights · Both near-FP16 quality
FP8 (H100 native)Quantization4× memory reduction vs FP328-bit floating point with hardware support on H100Better accuracy/range trade-off than INT8 · Supported natively by H100 Transformer Engine · TensorRT-LLM default on H100
Knowledge DistillationQuantizationSmaller student modelTrains student to mimic teacher's soft output probabilitiesNot quantization per se, but a compression technique · DistilBERT = 40% smaller, 97% BERT performance
PruningQuantizationRemove low-magnitude weightsZeros out (unstructured) or removes (structured) weights with small absolute valuesStructured pruning removes entire heads/layers — hardware speedup. Unstructured needs sparse compute support.
KV CacheInferenceO(t²) → O(t) computeStores past K,V tensors to avoid recomputing attention over previous tokensEssential for all production LLM inference · Grows with seq_len · Major source of GPU memory consumption
Paged KV CacheInferenceEliminates memory fragmentationKV cache in fixed-size pages; dynamically allocated like OS virtual memoryHigher batch sizes · Enables prefix caching · Used by vLLM and TensorRT-LLM
Continuous BatchingInference5–10× throughput improvementInserts new requests as slots free up; keeps GPU near 100% utilizationvs. static batching where finished requests leave idle GPU slots · Default in TensorRT-LLM
Flash AttentionInferenceO(n²) → O(n) HBM memoryTiled attention computation in SRAM; reduces memory bandwidth usageSame math as standard attention but 2–4× faster · Supports longer sequences · FA2/FA3 improve further
Tensor ParallelismInferenceSplits weight matrices across GPUsEach GPU computes a slice of each layer in parallelReduces latency per request · All-reduce communication overhead at each layer · Best for latency-critical serving
Speculative DecodingInference2–3× decode speedupSmall draft model proposes tokens; large model verifies multiple at onceDecode phase is memory-bandwidth bound — speculative decoding adds compute to speed it up
NVIDIA NeMoNVIDIA ToolsTraining + fine-tuning FWEnd-to-end LLM training, SFT, LoRA/QLoRA, RLHF, DPO — multi-GPUUse for custom model training/fine-tuning · Exports to TensorRT-LLM · Supports Megatron parallelism
NVIDIA TensorRT-LLMNVIDIA Tools2–5× latency reductionQuantization, kernel fusion, batching, paged KV cache, parallelismOpen-source · Inference-only (not training) · Foundation of NIM
NVIDIA Triton ServerNVIDIA ToolsMulti-framework servingUnified serving for TensorRT-LLM, PyTorch, ONNX, TensorFlow backendsDynamic batching · Model ensembles · Prometheus metrics · Production serving layer
NVIDIA NIMNVIDIA ToolsOne-command deploymentPre-optimized containers with OpenAI-compatible API; auto-selects best TRT-LLM engineEnterprise support + security · Abstracts TensorRT-LLM complexity · docker run to deploy

Real Examples

Concrete scenarios showing how fine-tuning and inference optimization decisions play out in practice

Fine-tuning Methods

Fine-tuning a 13B Model on a Single GPU Using QLoRA

Your team has a single NVIDIA A100 80GB GPU. You need to fine-tune a 13B-parameter LLM on 50,000 customer support conversations to improve domain-specific response quality. Full fine-tuning would require ~200GB of GPU memory for weights, gradients, and optimizer states. What do you do?
  • A 13B model in FP16 requires ~26 GB for weights. Full fine-tuning adds gradients (~26 GB) and Adam optimizer states (~52 GB) — roughly 104 GB total, exceeding the 80 GB A100.
  • QLoRA solution: Load the 13B base model quantized to NF4 (4-bit) — ~6.5 GB. The base model gradients are never computed (frozen), so no gradient memory for the base.
  • Add LoRA adapters (rank=16) to all attention layers in BF16 — approximately 8–15M trainable parameters, ~30–60 MB.
  • Use gradient checkpointing and paged AdamW optimizer to handle peak memory spikes. Total GPU memory: ~12–16 GB — comfortably fits on one A100.
  • Fine-tune for 3–5 epochs on the 50K examples. Merge the LoRA adapters into the base model weights at the end for zero-overhead inference.
✅ Key point: QLoRA turns a multi-GPU fine-tuning job into a single-GPU job by quantizing the frozen base model to 4-bit. The LoRA adapters remain in full BF16 precision for training quality.
Quantization & Compression

Choosing Between PTQ and AWQ for a 70B Production Model

You're deploying a 70B model for a financial analysis chatbot. The model must run on 4× A100 80GB GPUs. In FP16, the 70B model requires ~140 GB — just barely fitting across 4 GPUs with little room for KV cache or batch size. You need to quantize. Which method do you choose?
  • Option A — INT8 PTQ: Reduces 70B to ~70 GB (4× GPUs easily accommodate). Fast to apply. For financial analysis, INT8 is generally well-preserved for reasoning tasks. Viable choice.
  • Option B — AWQ INT4: Reduces 70B to ~35 GB — frees up significant GPU memory for larger KV cache and higher batch sizes. AWQ's activation-aware protection of salient weights keeps quality near FP16 level.
  • The financial chatbot handles complex numerical reasoning — INT4 without protection (naive GPTQ) might degrade on multi-step calculations. AWQ's salient weight protection makes INT4 much safer here.
  • Decision: AWQ INT4. The 35 GB footprint frees ~45 GB per 4-GPU node for KV cache, enabling 2× larger batch size — doubling throughput at the same cost.
  • Validation step: always benchmark accuracy on a held-out financial Q&A set at each quantization level before deploying. A 1–2% accuracy drop may be acceptable; 5%+ requires reconsideration.
✅ Key point: AWQ INT4 is preferred over naive INT4 for accuracy-sensitive tasks. The memory savings from INT4 vs INT8 directly translate to larger KV cache and higher throughput — not just smaller model footprint.
Inference Optimization

Why KV Cache Size Becomes the Bottleneck at Scale

You deploy a 7B model on an A100 GPU with 80 GB memory. The model weights in FP16 use ~14 GB. You expect to support 100 concurrent users with average 2,000 token conversations. You calculate that weights + batch KV cache = ~60 GB. But under load, you're getting OOM errors. Why?
  • The KV cache memory formula: 2 (K+V) × num_layers × num_heads × head_dim × seq_len × batch_size × bytes_per_element.
  • For a 7B model (32 layers, 32 heads, head_dim 128, FP16): 2 × 32 × 32 × 128 × 2000 tokens × 100 users × 2 bytes = ~52 GB — more than the available 66 GB after weights.
  • The problem: static KV cache allocation pre-allocates max_seq_len × batch_size memory at startup — memory fragmentation means many allocations fail even before actually reaching capacity.
  • Solution: switch to paged KV cache (as implemented in TensorRT-LLM and vLLM). Instead of one contiguous block per sequence, allocate 16-token pages dynamically. Memory fragmentation is eliminated.
  • With paged KV cache, the same 66 GB can serve more concurrent requests because unused page slots are recycled between requests rather than reserved per-sequence.
✅ Key point: KV cache is often the actual memory bottleneck, not model weights. At 100 concurrent users with long sequences, KV cache can dwarf model weight memory. Paged KV cache solves fragmentation.
NVIDIA Tools & Deployment

The NeMo → TensorRT-LLM → NIM Pipeline in Practice

A biotech company needs a custom LLM fine-tuned on internal research papers (domain adaptation) and deployed as an internal API. Walk through the NVIDIA tool stack from fine-tuning to production.
  • Step 1 — Fine-tuning with NeMo: Use NeMo's LoRA fine-tuning pipeline. Load the base Llama 3 8B model, configure rank=16 LoRA adapters on all attention layers, train on 200K biotech paper excerpts for 3 epochs across 4× A100 GPUs using tensor parallelism. NeMo handles distributed training automatically.
  • Step 2 — Merge and export: Merge LoRA adapters into base model weights. Export the merged model from NeMo in a TensorRT-LLM-compatible checkpoint format.
  • Step 3 — TensorRT-LLM optimization: Run TensorRT-LLM's build pipeline: quantize weights to FP8 (H100) or INT8 (A100), fuse kernels, enable paged KV cache and continuous batching. This produces an optimized TRT engine file.
  • Step 4 — Deploy via NIM (or Triton): Package the TRT engine in a NIM container or directly serve via Triton Inference Server with an OpenAI-compatible /v1/chat/completions endpoint. Internal teams access the model exactly like they would the OpenAI API.
  • Result: a domain-specific 8B model serving 200 researchers with 3× lower latency than the unoptimized baseline, running entirely on-premises with no data leaving the company network.
✅ Key point: NeMo (fine-tune) → TensorRT-LLM (optimize) → NIM/Triton (serve) is the standard NVIDIA production pipeline. Each tool has a distinct role — don't conflate them on the exam.

Practice Quiz

10 NCA-GENL style questions across all four pillars — instant explanations after each answer

Question 1 of 10

Fine-tune
Quant.
Inference
NVIDIA

Optimization Advisor

Describe your goal and get a targeted recommendation for fine-tuning, quantization, or inference optimization

What are you trying to achieve?

Memory Hooks

Click any card to flip — 8 high-yield fine-tuning and inference optimization mnemonics

🔧
LoRA: what's frozen and what's trained?
Tap to reveal →
Frozen: base model weights. Trained: low-rank matrices A and B.
ΔW = B×A · rank r = 4–64 · ~0.1–1% of total params trained · Can merge at inference for zero overhead
QLoRA vs LoRA — one key difference?
Tap to reveal →
QLoRA quantizes the frozen base model to 4-bit NF4
LoRA keeps base in FP16 (~13 GB for 7B). QLoRA keeps base in NF4 (~3.5 GB for 7B). Makes 70B fine-tuning possible on a single 48GB GPU.
📐
FP32 → INT8: how much memory reduction?
Tap to reveal →
4× reduction (32 bits → 8 bits)
FP32=4 bytes · FP16/BF16=2 bytes (2×) · INT8=1 byte (4×) · NF4/INT4=0.5 bytes (8×). A 70B model in FP16=140GB; in INT4≈35GB.
🧠
What does the KV cache store — and why?
Tap to reveal →
Past attention Key & Value tensors — avoids recomputing them
Without KV cache, generating token t requires O(t²) compute. With cache, O(t). Essential for any production LLM inference. Grows with sequence length × batch size.
🚀
Continuous batching vs static batching — key win?
Tap to reveal →
Inserts new requests as slots free — near-100% GPU utilization
Static: wait for full batch to finish → idle GPU slots. Continuous: fill slots immediately as requests complete. 5–10× throughput improvement for typical mixed-length workloads.
Flash Attention: same math, what's different?
Tap to reveal →
Tiled computation in SRAM — reduces HBM reads/writes
Standard attention materializes full n×n score matrix in HBM (O(n²) memory). Flash Attention tiles it through fast SRAM: same output, 2–4× faster, O(n) memory usage.
🔀
Tensor parallelism vs pipeline parallelism?
Tap to reveal →
Tensor: split weight matrices across GPUs. Pipeline: split layers across GPUs.
Tensor parallelism = horizontal split (reduces per-layer latency, needs all-reduce). Pipeline parallelism = vertical split (layer groups per GPU, needs micro-batching). Production uses both.
🟢
NeMo vs TensorRT-LLM vs NIM — one-line each?
Tap to reveal →
NeMo = train/fine-tune · TRT-LLM = optimize inference · NIM = deploy as API
NeMo: LoRA, SFT, DPO, multi-GPU training. TensorRT-LLM: quantization, kernel fusion, batching. NIM: pre-packaged Docker container, OpenAI-compatible API, one command to deploy.
🟢 NVIDIA NCA-GENL Exam Prep Platform

Ready to Pass the NCA-GENL? Get Everything You Need in One Place.

These concept pages are just the start. FlashGenius gives you a complete NCA-GENL prep toolkit — practice tests, flashcard decks, concept cheat sheets, and scenario quizzes built for the NVIDIA Generative AI LLMs exam.