What is the self-attention mechanism in a Transformer?

Self-attention allows every token in a sequence to compute a weighted relationship with every other token. For each token, it produces Query, Key, and Value vectors. The dot product of a Query with all Keys produces attention scores, which after softmax normalization become weights applied to the Value vectors. This lets the model capture long-range dependencies regardless of distance in the sequence — something RNNs struggled with.

What is the difference between encoder-only, decoder-only, and encoder-decoder Transformer models?

Encoder-only models (e.g., BERT) process the full input bidirectionally and are best for classification, NER, and understanding tasks. Decoder-only models (e.g., GPT, Llama) generate text autoregressively left-to-right and are best for open-ended generation and chat. Encoder-decoder models (e.g., T5, BART) use an encoder to understand input and a decoder to generate output — ideal for translation and summarization.

What does temperature control in LLM text generation?

Temperature scales the logits (raw scores) before the softmax step, controlling randomness. Temperature = 0 is fully deterministic — the model always picks the highest probability token (greedy). Temperature = 1 uses the raw probabilities. Temperature > 1 flattens the distribution, increasing randomness and creativity but also increasing the chance of errors. Values between 0.7 and 0.9 are common for balanced, coherent output.

What is LoRA and why is it used for fine-tuning LLMs?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects small trainable low-rank matrices into the attention layers. Instead of training billions of parameters, only millions of LoRA parameters are updated. This reduces GPU memory requirements and training time by 10–100x while achieving performance close to full fine-tuning. It is the dominant approach for adapting large models to custom tasks or domains.

Generative AI & LLM Fundamentals — NVIDIA NCA-GENL Exam Prep

Q: What is the difference between top-k and top-p (nucleus) sampling?

Top-k sampling restricts generation to the k highest-probability tokens at each step, regardless of how concentrated or spread the distribution is. Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability reaches p (e.g., 0.9). Top-p adapts to the distribution shape — it samples fewer tokens when the model is confident and more when uncertain, generally producing more natural text than top-k alone.

The Four LLM Knowledge Pillars

Every NCA-GENL question on LLM fundamentals traces back to one of these four interconnected areas

Transformer Architecture

The Engine Inside Every LLM

The Transformer is the neural network architecture that powers all modern LLMs. Its key innovation — self-attention — allows every token to directly relate to every other token in the input, enabling the model to capture context and meaning regardless of how far apart words are in a sequence.

QKV

Attention

MHA

Multi-Head

Tokenization & Embeddings

How Text Becomes Numbers

LLMs don't read words — they read tokens. Tokenization splits text into subword units using algorithms like Byte-Pair Encoding (BPE). Each token is mapped to a high-dimensional embedding vector. Context windows limit how many tokens a model can process at once — a key practical constraint.

BPE

Tokenizer

≈4

Chars/Token

Pre-training & Fine-tuning

How LLMs Learn

Foundation models are pre-trained on massive text corpora using self-supervised objectives (next-token prediction for decoder models; masked language modeling for encoders). Fine-tuning adapts them to specific tasks. RLHF aligns models with human preferences. LoRA makes fine-tuning efficient.

RLHF

Alignment

LoRA

Efficient FT

Text Generation & NVIDIA Tools

From Model to Output

LLMs generate text autoregressively — one token at a time. Temperature, top-p, and top-k control the randomness of outputs. NVIDIA NeMo, TensorRT-LLM, and NIM provide the infrastructure for training, optimizing, and deploying LLMs at scale on NVIDIA GPU hardware.

Temp

Randomness

NeMo

NVIDIA FW

LLM Architecture Families

Decoder-Only

Autoregressive Generator

GPT-4, Llama, Mistral, Falcon, Gemma

Text generation, chat, code, instruction following. Left-to-right causal generation. Most chat LLMs use this architecture.

Encoder-Only

Bidirectional Understander

BERT, RoBERTa, DistilBERT, ALBERT

Classification, NER, sentiment analysis, embeddings. Sees full context in both directions. Cannot generate open-ended text.

Encoder-Decoder

Sequence-to-Sequence

T5, BART, mT5, FLAN-T5

Translation, summarization, question answering. Encoder understands input; decoder generates output. Best for structured transformation tasks.

NVIDIA Models

NVIDIA Foundation Models

Nemotron, NVLM, BioNeMo models

Optimized for NVIDIA GPU infrastructure. Available via NVIDIA NIM for enterprise deployment. Domain-specific variants (biology, chemistry, etc.).

💡

Exam tip — architecture → use case: When the NCA-GENL asks which architecture to use, map task type to architecture. Need to generate text or answer questions freely? → Decoder-only. Need to classify or extract from text? → Encoder-only. Need to transform one sequence into another (translate, summarize)? → Encoder-Decoder.

How It Works

Deep-dive into each pillar — transformer internals, tokenization, training objectives, and generation mechanics

Transformer Architecture

Self-Attention, Multi-Head Attention & the Transformer Stack

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) × V

Q = Query · K = Key · V = Value · dₖ = dimension of Key vectors (scaling prevents vanishing gradients) · Output = weighted sum of Values

Transformer Layer Stack (decoder-only, e.g. GPT)

①

Input Text

"The cat sat on the mat" → raw text string

↓

②

Tokenization + Embedding

Text split into tokens → each token mapped to a dense embedding vector + positional encoding added

↓

③

Multi-Head Self-Attention

Each token attends to all other tokens (or causal mask for decoder). Multiple attention heads run in parallel, each learning different relationship types.

↓

④

Feed-Forward Network (FFN)

Position-wise fully connected layers applied to each token independently. Typically 4× the model dimension. LayerNorm and residual connections wrap each sub-layer.

↓ (layers ③ and ④ repeat N times)

⑤

Output Logits → Softmax → Token

Final hidden state projected to vocabulary size → softmax gives probabilities over all tokens → sampling selects the next token

1

What Self-Attention Actually Does

For each token, three vectors are created: a Query (what am I looking for?), a Key (what do I represent?), and a Value (what information do I carry?). The dot product of a token's Query with every other token's Key produces raw attention scores. After scaling by √dₖ and applying softmax, these become attention weights — how much focus each token places on every other. The output is a weighted sum of the Value vectors.

2

Multi-Head Attention

Instead of running attention once, Multi-Head Attention runs h parallel attention heads, each with its own learned Q, K, V projection matrices. Each head can learn to attend to different types of relationships simultaneously — syntactic, semantic, long-range, short-range. Outputs from all heads are concatenated and linearly projected back to the model dimension.

3

Positional Encoding

Transformers process all tokens in parallel (unlike RNNs) — so they have no inherent sense of order. Positional encodings are added to token embeddings to inject sequence position information. Original Transformers used fixed sinusoidal encodings. Modern LLMs use learned positional embeddings or Rotary Positional Encoding (RoPE), which enables better generalization to longer sequences than seen during training.

4

Causal (Masked) Self-Attention in Decoders

Decoder-only models (GPT, Llama) use causal masking — when generating token N, the model can only attend to tokens 1 through N-1. Future tokens are masked out with -∞ before softmax. This ensures autoregressive generation is valid: each token is predicted only from preceding context, not from future tokens it hasn't generated yet.

ℹ️

Parameters = scale = capability: A model's parameter count reflects the total number of learnable weights across all attention layers, FFN layers, and embeddings. GPT-3 has 175B parameters. Llama 3 has 8B–405B variants. More parameters generally means more capability — but also more compute, memory, and inference cost.

Tokenization & Embeddings

BPE, Context Windows & Embedding Spaces

Context Window Constraint

Max tokens = input tokens + output tokens ≤ context window size

GPT-4: 128K tokens · Llama 3.1: 128K tokens · ~1 token ≈ 4 chars of English · 1,000 tokens ≈ 750 words

BPE

Byte-Pair Encoding

Dominant tokenizer algorithm. Iteratively merges frequent character pairs. Used by GPT models.

WordPiece

WordPiece Tokenizer

Used by BERT. Similar to BPE but uses likelihood maximization instead of frequency.

SentencePiece

Language-agnostic tokenizer. Treats text as raw characters. Used by T5, Llama, Gemma.

Token

Subword Unit

A piece of text: full word, prefix, suffix, or character. "unhappy" → ["un", "happy"].

Embedding

Token Vector

Dense floating-point vector (e.g., 4096-dim). Encodes semantic meaning. Similar words cluster nearby.

Vocab Size

Vocabulary

Typical: 32K–128K tokens. GPT-4: ~100K. Larger vocab = fewer tokens per sentence.

Context Window

Max Sequence Length

Total tokens the model can process at once (input + output combined).

Positional Enc.

Position Information

Added to embeddings so model knows token order. RoPE is the current dominant approach.

💡

Tokenization quirk to know for the exam: Common words are often single tokens ("cat" → 1 token). Rare or long words split into multiple subword tokens ("tokenization" → ["token", "ization"] → 2 tokens). Numbers often tokenize inefficiently — "12345" may become 3–5 tokens. Code with unusual symbols can also be very token-heavy.

⚠️

Context window ≠ training data: A model trained on data up to a certain date cannot know events after that cutoff — this is the knowledge cutoff. The context window is how much text the model can process at inference time. These are independent constraints. A model with a 128K context window still has a fixed training knowledge cutoff.

Pre-training & Fine-tuning

Foundation Models, RLHF & Parameter-Efficient Fine-tuning

1

Pre-training: Next Token Prediction (Decoder-Only)

Decoder-only models (GPT, Llama) are pre-trained with causal language modeling — given a sequence of tokens, predict the next token. No labels are needed — the next token IS the label. Training on trillions of tokens from internet text, books, and code gives the model broad world knowledge and language understanding. This is self-supervised learning at massive scale.

2

Pre-training: Masked Language Modeling (Encoder-Only)

Encoder-only models (BERT) are pre-trained with masked language modeling (MLM) — randomly mask 15% of tokens and train the model to predict the masked tokens using both left and right context. This forces bidirectional understanding. BERT also uses Next Sentence Prediction (NSP) as a secondary objective. The result: deep contextual text representations ideal for classification and extraction tasks.

3

Supervised Fine-tuning (SFT) & Instruction Tuning

A pre-trained base model generates raw text continuations — it's not yet a chat assistant. Supervised fine-tuning on curated (instruction, response) pairs teaches the model to follow instructions. This transforms a text-predictor into an instruction-following assistant (e.g., InstructGPT, Llama-Instruct). The fine-tuning dataset is much smaller than pre-training data — thousands to millions of examples vs. trillions of tokens.

4

RLHF — Reinforcement Learning from Human Feedback

RLHF is the alignment technique used to make models more helpful, harmless, and honest. Process: (1) Collect human preference data — rank model outputs from best to worst. (2) Train a Reward Model (RM) to predict human preferences. (3) Fine-tune the LLM using PPO (Proximal Policy Optimization) to maximize reward model scores, with a KL-divergence penalty to prevent drifting too far from the base model. ChatGPT, Claude, and Gemini all use RLHF variants.

5

LoRA — Low-Rank Adaptation (Parameter-Efficient Fine-tuning)

Full fine-tuning of a 70B-parameter model requires enormous GPU resources. LoRA freezes all original model weights and injects small trainable low-rank matrices (A and B, where rank r ≪ model_dim) into each attention layer. The weight update ΔW = A × B is low-rank. This reduces trainable parameters by 100–10,000× while achieving performance very close to full fine-tuning. QLoRA extends this by quantizing the frozen base model to 4-bit, further reducing memory requirements.

🟢

NVIDIA NeMo: NVIDIA's end-to-end framework for training, fine-tuning, and deploying LLMs. Supports LoRA and full fine-tuning, distributed training across multiple GPUs/nodes, PEFT methods, and direct export to TensorRT-LLM for optimized inference. Used for building custom domain-specific models on NVIDIA infrastructure.

Text Generation & NVIDIA Tools

Sampling Strategies, Inference Optimization & NVIDIA Stack

Autoregressive Generation Loop

output[t] = sample(softmax(logits / temperature)) → append → repeat until [EOS] or max_tokens

Each token is sampled from the probability distribution over the vocabulary · Previous tokens become part of the context for the next step · Generation is inherently sequential

1

Temperature Sampling

Temperature T scales the logits before softmax: logits_scaled = logits / T. T = 0: Deterministic greedy — always picks the highest-probability token. Same input always gives same output. T = 1: Unmodified probabilities from the model. T > 1: Flatter distribution — more random, more diverse, but more likely to make errors. Typical production values: 0.0 for factual tasks, 0.7–0.9 for balanced chat, 1.0+ for creative tasks.

2

Top-k Sampling

At each generation step, restrict sampling to only the top-k most probable tokens (discard all others). Common values: k = 50 or k = 100. Prevents very unlikely tokens from ever being sampled. The weakness: k is a fixed number regardless of how concentrated or spread the probability distribution is — sometimes k tokens may cover almost 100% of probability mass, sometimes much less.

3

Top-p (Nucleus) Sampling

Select the smallest set of tokens whose cumulative probability mass is ≥ p (e.g., p = 0.9). The size of this set adapts dynamically: when the model is very confident, only a few tokens may cover 90% of probability — so sampling is tight. When uncertain, many tokens may be needed — so the set expands. This is generally preferred over top-k alone because it adapts to the model's confidence. Top-p and top-k are often combined.

4

Beam Search

Instead of sampling one token at a time, beam search maintains b candidate sequences ("beams") in parallel and expands each at every step, keeping only the top b most probable full sequences. Produces more coherent and grammatically correct output than pure sampling. Commonly used in translation and summarization. Downside: deterministic and less diverse than sampling; computationally more expensive than greedy decoding.

🟢

NVIDIA TensorRT-LLM: An open-source library for optimizing and running LLM inference on NVIDIA GPUs. Applies techniques including quantization (INT8/INT4/FP8), kernel fusion, continuous batching, paged KV-cache, and tensor parallelism. Dramatically reduces latency and increases throughput vs. running PyTorch models directly. Underlies NVIDIA NIM (NVIDIA Inference Microservices).

🟢

NVIDIA NIM (Inference Microservices): Pre-packaged, optimized containers for deploying NVIDIA-optimized models as API endpoints. Provides OpenAI-compatible APIs for easy integration. Handles GPU selection, TensorRT-LLM optimization, and scaling automatically — enabling enterprises to deploy LLMs on their own infrastructure with minimal configuration.

Compare

Filter by pillar to compare architectures, training methods, sampling strategies, and NVIDIA tools

Millions → Trillionsb beams (e.g., 4–10)

Concept	Category	Key Value / Range	What It Controls	NCA-GENL Exam Point
Self-Attention	Architecture	O(n²) complexity	How tokens relate to each other; captures long-range dependencies	Uses Q, K, V vectors. Score = QKᵀ/√dₖ then softmax. Foundation of all Transformers.
Multi-Head Attention	Architecture	h parallel heads (e.g., 32)	Multiple relationship types learned simultaneously	Each head learns different attention patterns. Outputs concatenated then projected.
Feed-Forward Network (FFN)	Architecture	4× model dimension	Per-token nonlinear transformation after attention	Applied position-wise (independently to each token). Usually ReLU or GELU activation.
Decoder-Only (GPT-style)	Architecture	Causal mask	Text generation — attends only to past tokens	GPT, Llama, Mistral, Falcon. Best for chat, generation, code completion.
Encoder-Only (BERT-style)	Architecture	Bidirectional attention	Text understanding — attends to all tokens in both directions	BERT, RoBERTa. Best for classification, NER, embeddings. Cannot generate freely.
Encoder-Decoder (T5-style)	Architecture	Cross-attention bridge	Sequence-to-sequence tasks	T5, BART. Encoder processes input; decoder generates output. Best for translation, summarization.
Model Parameters	Architecture	Total learnable weights across all layers	More params → more capacity. GPT-3 = 175B. Llama 3.1 = 8B–405B. Compute scales with params.
BPE (Byte-Pair Encoding)	Tokenization	32K–100K vocab	How text is split into subword tokens	Dominant algorithm. Merges most-frequent character pairs iteratively. Used by GPT models.
Token	Tokenization	≈4 chars avg (English)	Atomic unit the model processes	1K tokens ≈ 750 words. Numbers and rare words may tokenize inefficiently (more tokens).
Token Embedding	Tokenization	Dims: 768–8192+	Dense vector representation of each token	Learned during pre-training. Similar meanings → nearby vectors. Foundation of semantic search.
Context Window	Tokenization	4K–1M tokens	Max tokens (input + output) processed at once	GPT-4 = 128K. Gemini 1.5 = 1M. Longer = more context but quadratic attention cost.
Positional Encoding (RoPE)	Tokenization	Rotary embedding	Injects token position into attention mechanism	RoPE used by Llama, Mistral, many modern LLMs. Better length generalization than learned PE.
Causal Language Modeling	Training	Next token prediction	Pre-training objective for decoder-only models	Self-supervised — no labels needed. Predicts next token from preceding context. Used by GPT, Llama.
Masked Language Modeling	Training	15% tokens masked	Pre-training objective for encoder-only models	Predict masked tokens using bidirectional context. Used by BERT, RoBERTa.
Supervised Fine-tuning (SFT)	Training	Thousands–millions of examples	Adapts base model to follow instructions	Uses (instruction, response) pairs. Transforms base LLM into chat/instruction-following model.
RLHF	Training	Human preference pairs	Aligns model with human values and preferences	Train Reward Model → PPO optimization. Used by ChatGPT, Claude, Gemini. Reduces harmful outputs.
LoRA (Low-Rank Adaptation)	Training	Rank r = 4–64	Reduces trainable params for fine-tuning	Freezes base model; trains small A×B matrices. 100–10,000× fewer params than full fine-tuning.
QLoRA	Training	4-bit quantization	LoRA with quantized base model for lower memory	Fine-tune 70B model on a single GPU. Base model in NF4; LoRA adapters in BF16.
Temperature	Generation	0.0 – 2.0	Randomness of token sampling	T=0 → greedy (deterministic). T=1 → raw probs. T>1 → more random/creative. T=0.7-0.9 typical.
Top-k Sampling	Generation	k = 10–100	Restricts to top-k tokens at each step	Fixed number regardless of distribution shape. Prevents very unlikely tokens.
Top-p (Nucleus) Sampling	Generation	p = 0.8–0.95	Dynamic token selection by cumulative probability	Adapts to confidence. Tight set when confident; expands when uncertain. Generally preferred over top-k alone.
Beam Search	Generation	Maintains multiple candidate sequences in parallel	More coherent output than greedy. Deterministic. Higher compute. Used in translation/summarization.
NVIDIA NeMo	NVIDIA Tools	Training + fine-tuning FW	End-to-end LLM training and customization	Supports LoRA, SFT, RLHF, multi-GPU distributed training. Exports to TensorRT-LLM.
NVIDIA TensorRT-LLM	NVIDIA Tools	Inference optimization	Accelerates LLM inference on NVIDIA GPUs	Quantization (INT8/FP8), kernel fusion, continuous batching, paged KV-cache. Major latency reduction.
NVIDIA NIM	NVIDIA Tools	Inference microservices	Packaged model deployment with OpenAI-compatible API	Deploy optimized LLMs on own infrastructure. Abstracts GPU selection and TensorRT optimization.

Real Examples

Intuitive scenarios that make abstract LLM concepts concrete

Transformer Architecture

How Self-Attention Resolves Ambiguity: "The bank by the river was steep"

A traditional RNN processes "the bank" before seeing "river" — it might incorrectly associate "bank" with "financial institution." How does self-attention handle this differently?

In a Transformer, all tokens in the sentence are processed simultaneously. When computing the representation for "bank," the model calculates attention scores against every other token.
The token "river" has a strong semantic association with "steep, natural terrain" rather than "finance." The attention mechanism learns high weights between "bank" and "river" during training.
When the model attends to "bank" with high weight on "river" and "steep," the resulting contextual embedding of "bank" shifts toward the "riverbank" meaning in this vector space.
An RNN or LSTM processes left-to-right sequentially — the signal from "river" must propagate back through multiple recurrent steps, often degrading for long distances. Self-attention is distance-agnostic.
This is why Transformers dramatically outperformed RNNs on tasks requiring long-range contextual understanding.

✅ Key point: Self-attention is O(n²) but computes all pairwise token relationships at once — long-range context is captured as efficiently as short-range. This is the Transformer's core advantage over RNNs.

Tokenization & Embeddings

Why Token Count Matters: The Hidden Cost of Numbers and Code

You're building an application that processes financial reports. Your model has a 128K context window — plenty, you think. But you're hitting context limits much faster than expected. Why?

Text tokenization is not uniform. Common English words like "the" or "is" are single tokens. But numbers tokenize very inefficiently — "1,234,567.89" may tokenize to 8–10 separate tokens.
Financial reports contain dense tables of numbers, dollar amounts, percentages, and dates — all of which expand token counts dramatically compared to plain prose.
Code is similar: special characters, variable names with underscores, and syntax tokens often require more tokens per character than English text.
A 100-page financial report that appears to be ~50,000 words may actually consume 90,000+ tokens due to numerical content — approaching a 128K limit much faster than expected.
Solution options: chunking the document, using a model with longer context, or pre-processing to summarize dense numerical tables before passing to the LLM.

✅ Key point: ≈4 chars/token is an average for English prose. Numbers, code, and rare words tokenize less efficiently. Always estimate token counts — not word counts — when planning LLM applications.

Pre-training & Fine-tuning

Base Model vs. Instruction-Tuned Model: Why You Can't Just Prompt a Base LLM

You download Llama 3 base weights and send it the message: "What is the capital of France?" Instead of answering "Paris," it continues the sentence as if writing a geography quiz. What's happening, and how do you fix it?

A base (pre-trained) LLM is trained on next-token prediction — it learns to continue text patterns, not to answer questions or follow instructions.
Given "What is the capital of France?" the base model has seen many quiz-like documents in training — it predicts the most likely continuation, which might be "What is the capital of Germany? What is the capital of Italy?" (continuing the list of quiz questions).
Supervised Fine-tuning (SFT) on (instruction, response) pairs teaches the model that a question is the beginning of a (Q, A) pair, not a quiz to continue. After SFT, the model learns to respond with "Paris."
RLHF further refines the response style — making it concise, helpful, and safe rather than just technically correct.
The fix: use the Llama 3 Instruct variant (not the base model) — or fine-tune the base model with SFT on instruction-response data using NVIDIA NeMo.

✅ Key point: Base model = text completer. Instruction-tuned model = question answerer. SFT is what bridges them. For production chat applications, always use an instruction-tuned checkpoint.

Text Generation

Temperature in Practice: Writing a Legal Contract vs. Writing a Poem

You're building a multi-use LLM application. For one feature it drafts legal contract language. For another it writes creative poetry. Should you use the same temperature setting for both?

For legal contract drafting: set temperature = 0 (or very close to 0). Legal language must be precise, deterministic, and reproducible. You want the most probable, safest phrasing — not creative variation.
At T=0, the model greedily picks the highest-probability token at every step — the output is deterministic. Running the same prompt twice gives the same output.
For creative poetry: set temperature = 0.9–1.2. Poetry benefits from surprising word choices, novel metaphors, and linguistic variation. Higher temperature flattens the probability distribution, making lower-probability (more "creative") tokens more likely to be sampled.
At T=1.2, running the same prompt twice will almost certainly produce different poems — each drawing from a broader range of the model's vocabulary.
In practice: combine temperature with top-p = 0.9 for creative tasks (prevents genuinely nonsensical outputs while allowing creativity). Use temperature = 0 alone for factual/deterministic tasks.

✅ Key point: Temperature controls the creativity/precision dial. T=0 for deterministic factual tasks. T=0.7–1.0 for balanced chat. T=1.0+ for creative generation. Match temperature to the task, not the model.

Practice Quiz

10 NCA-GENL style questions across all four pillars — with instant explanations after each answer

Question 1 of 10

Arch.

—

Tokens

—

Training

—

Generation

—

LLM Navigator

Answer a few questions to get targeted explanations on any LLM fundamentals concept

What do you need to understand about LLMs?

Memory Hooks

Click any card to flip it — 8 high-yield LLM mnemonics for the NCA-GENL exam

🎯

What do Q, K, V stand for in self-attention?

Tap to reveal →

Query · Key · Value

Q = "what am I looking for?" K = "what do I contain?" V = "what info do I carry?" Score = QKᵀ/√dₖ → softmax → weighted sum of V.

🏗️

Decoder-only vs Encoder-only — which generates text?

Tap to reveal →

Decoder-only generates. Encoder-only understands.

Decoder-only (GPT, Llama): causal mask, autoregressive generation, chat. Encoder-only (BERT): bidirectional, classification, embeddings — cannot generate freely.

✂️

What is BPE and what does it produce?

Tap to reveal →

Byte-Pair Encoding → subword tokens

Iteratively merges most-frequent character pairs to build a vocabulary of subword units. "unhappy" → ["un","happy"]. Balances vocabulary size with rare-word coverage.

📏

Context window — what does it limit?

Tap to reveal →

Total tokens (input + output) at inference

~4 chars = 1 token (English). 1,000 tokens ≈ 750 words. Exceeding the window → model cannot "see" earlier content. GPT-4 = 128K tokens.

🎓

What is the pre-training objective for GPT-style models?

Tap to reveal →

Causal Language Modeling — next token prediction

Self-supervised: the next token is the label. No human annotation needed. Train on trillions of tokens from text, books, and code to learn language patterns and world knowledge.

🔧

What does LoRA freeze, and what does it train?

Tap to reveal →

Freezes base model weights. Trains small low-rank matrices.

ΔW = A×B where rank r ≪ model dimension. 100–10,000× fewer trainable params than full fine-tuning. QLoRA also quantizes base model to 4-bit.

🌡️

Temperature = 0 → what kind of output?

Tap to reveal →

Deterministic — always picks the highest-probability token

T=0 = greedy decoding. Same prompt → same output every time. Best for factual, legal, or code tasks. T=0.7–0.9 for chat. T>1 for creative tasks.

🟢

NVIDIA TensorRT-LLM — what does it optimize?

Tap to reveal →

LLM inference on NVIDIA GPUs

Quantization (INT8/FP8), kernel fusion, continuous batching, paged KV-cache. Reduces latency and increases throughput vs. raw PyTorch. Underlies NVIDIA NIM microservices.

Generative AI & LLM Fundamentals Explained

The Engine Inside Every LLM

How Text Becomes Numbers

How LLMs Learn

From Model to Output

LLM Architecture Families

What Self-Attention Actually Does

Multi-Head Attention

Positional Encoding

Causal (Masked) Self-Attention in Decoders

Pre-training: Next Token Prediction (Decoder-Only)

Pre-training: Masked Language Modeling (Encoder-Only)

Supervised Fine-tuning (SFT) & Instruction Tuning

RLHF — Reinforcement Learning from Human Feedback

LoRA — Low-Rank Adaptation (Parameter-Efficient Fine-tuning)

Temperature Sampling

Top-k Sampling

Top-p (Nucleus) Sampling

Beam Search

How Self-Attention Resolves Ambiguity: "The bank by the river was steep"

Why Token Count Matters: The Hidden Cost of Numbers and Code

Base Model vs. Instruction-Tuned Model: Why You Can't Just Prompt a Base LLM

Temperature in Practice: Writing a Legal Contract vs. Writing a Poem

Ready to Pass the NCA-GENL? Get Everything You Need in One Place.