Master transformer architecture, tokenization, pre-training, fine-tuning, and text generation mechanics — the core concepts tested on the NVIDIA NCA-GENL certification.
Practice with Flashcards →The Four LLM Knowledge Pillars
Every NCA-GENL question on LLM fundamentals traces back to one of these four interconnected areas
The Transformer is the neural network architecture that powers all modern LLMs. Its key innovation — self-attention — allows every token to directly relate to every other token in the input, enabling the model to capture context and meaning regardless of how far apart words are in a sequence.
LLMs don't read words — they read tokens. Tokenization splits text into subword units using algorithms like Byte-Pair Encoding (BPE). Each token is mapped to a high-dimensional embedding vector. Context windows limit how many tokens a model can process at once — a key practical constraint.
Foundation models are pre-trained on massive text corpora using self-supervised objectives (next-token prediction for decoder models; masked language modeling for encoders). Fine-tuning adapts them to specific tasks. RLHF aligns models with human preferences. LoRA makes fine-tuning efficient.
LLMs generate text autoregressively — one token at a time. Temperature, top-p, and top-k control the randomness of outputs. NVIDIA NeMo, TensorRT-LLM, and NIM provide the infrastructure for training, optimizing, and deploying LLMs at scale on NVIDIA GPU hardware.
How It Works
Deep-dive into each pillar — transformer internals, tokenization, training objectives, and generation mechanics
Self-Attention, Multi-Head Attention & the Transformer Stack
Transformer Layer Stack (decoder-only, e.g. GPT)
For each token, three vectors are created: a Query (what am I looking for?), a Key (what do I represent?), and a Value (what information do I carry?). The dot product of a token's Query with every other token's Key produces raw attention scores. After scaling by √dₖ and applying softmax, these become attention weights — how much focus each token places on every other. The output is a weighted sum of the Value vectors.
Instead of running attention once, Multi-Head Attention runs h parallel attention heads, each with its own learned Q, K, V projection matrices. Each head can learn to attend to different types of relationships simultaneously — syntactic, semantic, long-range, short-range. Outputs from all heads are concatenated and linearly projected back to the model dimension.
Transformers process all tokens in parallel (unlike RNNs) — so they have no inherent sense of order. Positional encodings are added to token embeddings to inject sequence position information. Original Transformers used fixed sinusoidal encodings. Modern LLMs use learned positional embeddings or Rotary Positional Encoding (RoPE), which enables better generalization to longer sequences than seen during training.
Decoder-only models (GPT, Llama) use causal masking — when generating token N, the model can only attend to tokens 1 through N-1. Future tokens are masked out with -∞ before softmax. This ensures autoregressive generation is valid: each token is predicted only from preceding context, not from future tokens it hasn't generated yet.
BPE, Context Windows & Embedding Spaces
Foundation Models, RLHF & Parameter-Efficient Fine-tuning
Decoder-only models (GPT, Llama) are pre-trained with causal language modeling — given a sequence of tokens, predict the next token. No labels are needed — the next token IS the label. Training on trillions of tokens from internet text, books, and code gives the model broad world knowledge and language understanding. This is self-supervised learning at massive scale.
Encoder-only models (BERT) are pre-trained with masked language modeling (MLM) — randomly mask 15% of tokens and train the model to predict the masked tokens using both left and right context. This forces bidirectional understanding. BERT also uses Next Sentence Prediction (NSP) as a secondary objective. The result: deep contextual text representations ideal for classification and extraction tasks.
A pre-trained base model generates raw text continuations — it's not yet a chat assistant. Supervised fine-tuning on curated (instruction, response) pairs teaches the model to follow instructions. This transforms a text-predictor into an instruction-following assistant (e.g., InstructGPT, Llama-Instruct). The fine-tuning dataset is much smaller than pre-training data — thousands to millions of examples vs. trillions of tokens.
RLHF is the alignment technique used to make models more helpful, harmless, and honest. Process: (1) Collect human preference data — rank model outputs from best to worst. (2) Train a Reward Model (RM) to predict human preferences. (3) Fine-tune the LLM using PPO (Proximal Policy Optimization) to maximize reward model scores, with a KL-divergence penalty to prevent drifting too far from the base model. ChatGPT, Claude, and Gemini all use RLHF variants.
Full fine-tuning of a 70B-parameter model requires enormous GPU resources. LoRA freezes all original model weights and injects small trainable low-rank matrices (A and B, where rank r ≪ model_dim) into each attention layer. The weight update ΔW = A × B is low-rank. This reduces trainable parameters by 100–10,000× while achieving performance very close to full fine-tuning. QLoRA extends this by quantizing the frozen base model to 4-bit, further reducing memory requirements.
Sampling Strategies, Inference Optimization & NVIDIA Stack
Temperature T scales the logits before softmax: logits_scaled = logits / T. T = 0: Deterministic greedy — always picks the highest-probability token. Same input always gives same output. T = 1: Unmodified probabilities from the model. T > 1: Flatter distribution — more random, more diverse, but more likely to make errors. Typical production values: 0.0 for factual tasks, 0.7–0.9 for balanced chat, 1.0+ for creative tasks.
At each generation step, restrict sampling to only the top-k most probable tokens (discard all others). Common values: k = 50 or k = 100. Prevents very unlikely tokens from ever being sampled. The weakness: k is a fixed number regardless of how concentrated or spread the probability distribution is — sometimes k tokens may cover almost 100% of probability mass, sometimes much less.
Select the smallest set of tokens whose cumulative probability mass is ≥ p (e.g., p = 0.9). The size of this set adapts dynamically: when the model is very confident, only a few tokens may cover 90% of probability — so sampling is tight. When uncertain, many tokens may be needed — so the set expands. This is generally preferred over top-k alone because it adapts to the model's confidence. Top-p and top-k are often combined.
Instead of sampling one token at a time, beam search maintains b candidate sequences ("beams") in parallel and expands each at every step, keeping only the top b most probable full sequences. Produces more coherent and grammatically correct output than pure sampling. Commonly used in translation and summarization. Downside: deterministic and less diverse than sampling; computationally more expensive than greedy decoding.
Compare
Filter by pillar to compare architectures, training methods, sampling strategies, and NVIDIA tools
| Concept | Category | Key Value / Range | What It Controls | NCA-GENL Exam Point |
|---|---|---|---|---|
| Self-Attention | Architecture | O(n²) complexity | How tokens relate to each other; captures long-range dependencies | Uses Q, K, V vectors. Score = QKᵀ/√dₖ then softmax. Foundation of all Transformers. |
| Multi-Head Attention | Architecture | h parallel heads (e.g., 32) | Multiple relationship types learned simultaneously | Each head learns different attention patterns. Outputs concatenated then projected. |
| Feed-Forward Network (FFN) | Architecture | 4× model dimension | Per-token nonlinear transformation after attention | Applied position-wise (independently to each token). Usually ReLU or GELU activation. |
| Decoder-Only (GPT-style) | Architecture | Causal mask | Text generation — attends only to past tokens | GPT, Llama, Mistral, Falcon. Best for chat, generation, code completion. |
| Encoder-Only (BERT-style) | Architecture | Bidirectional attention | Text understanding — attends to all tokens in both directions | BERT, RoBERTa. Best for classification, NER, embeddings. Cannot generate freely. |
| Encoder-Decoder (T5-style) | Architecture | Cross-attention bridge | Sequence-to-sequence tasks | T5, BART. Encoder processes input; decoder generates output. Best for translation, summarization. |
| Model Parameters | Architecture | Millions → Trillions | Total learnable weights across all layers | More params → more capacity. GPT-3 = 175B. Llama 3.1 = 8B–405B. Compute scales with params. |
| BPE (Byte-Pair Encoding) | Tokenization | 32K–100K vocab | How text is split into subword tokens | Dominant algorithm. Merges most-frequent character pairs iteratively. Used by GPT models. |
| Token | Tokenization | ≈4 chars avg (English) | Atomic unit the model processes | 1K tokens ≈ 750 words. Numbers and rare words may tokenize inefficiently (more tokens). |
| Token Embedding | Tokenization | Dims: 768–8192+ | Dense vector representation of each token | Learned during pre-training. Similar meanings → nearby vectors. Foundation of semantic search. |
| Context Window | Tokenization | 4K–1M tokens | Max tokens (input + output) processed at once | GPT-4 = 128K. Gemini 1.5 = 1M. Longer = more context but quadratic attention cost. |
| Positional Encoding (RoPE) | Tokenization | Rotary embedding | Injects token position into attention mechanism | RoPE used by Llama, Mistral, many modern LLMs. Better length generalization than learned PE. |
| Causal Language Modeling | Training | Next token prediction | Pre-training objective for decoder-only models | Self-supervised — no labels needed. Predicts next token from preceding context. Used by GPT, Llama. |
| Masked Language Modeling | Training | 15% tokens masked | Pre-training objective for encoder-only models | Predict masked tokens using bidirectional context. Used by BERT, RoBERTa. |
| Supervised Fine-tuning (SFT) | Training | Thousands–millions of examples | Adapts base model to follow instructions | Uses (instruction, response) pairs. Transforms base LLM into chat/instruction-following model. |
| RLHF | Training | Human preference pairs | Aligns model with human values and preferences | Train Reward Model → PPO optimization. Used by ChatGPT, Claude, Gemini. Reduces harmful outputs. |
| LoRA (Low-Rank Adaptation) | Training | Rank r = 4–64 | Reduces trainable params for fine-tuning | Freezes base model; trains small A×B matrices. 100–10,000× fewer params than full fine-tuning. |
| QLoRA | Training | 4-bit quantization | LoRA with quantized base model for lower memory | Fine-tune 70B model on a single GPU. Base model in NF4; LoRA adapters in BF16. |
| Temperature | Generation | 0.0 – 2.0 | Randomness of token sampling | T=0 → greedy (deterministic). T=1 → raw probs. T>1 → more random/creative. T=0.7-0.9 typical. |
| Top-k Sampling | Generation | k = 10–100 | Restricts to top-k tokens at each step | Fixed number regardless of distribution shape. Prevents very unlikely tokens. |
| Top-p (Nucleus) Sampling | Generation | p = 0.8–0.95 | Dynamic token selection by cumulative probability | Adapts to confidence. Tight set when confident; expands when uncertain. Generally preferred over top-k alone. |
| Beam Search | Generation | b beams (e.g., 4–10) | Maintains multiple candidate sequences in parallel | More coherent output than greedy. Deterministic. Higher compute. Used in translation/summarization. |
| NVIDIA NeMo | NVIDIA Tools | Training + fine-tuning FW | End-to-end LLM training and customization | Supports LoRA, SFT, RLHF, multi-GPU distributed training. Exports to TensorRT-LLM. |
| NVIDIA TensorRT-LLM | NVIDIA Tools | Inference optimization | Accelerates LLM inference on NVIDIA GPUs | Quantization (INT8/FP8), kernel fusion, continuous batching, paged KV-cache. Major latency reduction. |
| NVIDIA NIM | NVIDIA Tools | Inference microservices | Packaged model deployment with OpenAI-compatible API | Deploy optimized LLMs on own infrastructure. Abstracts GPU selection and TensorRT optimization. |
Real Examples
Intuitive scenarios that make abstract LLM concepts concrete
Practice Quiz
10 NCA-GENL style questions across all four pillars — with instant explanations after each answer
LLM Navigator
Answer a few questions to get targeted explanations on any LLM fundamentals concept
Memory Hooks
Click any card to flip it — 8 high-yield LLM mnemonics for the NCA-GENL exam