What is Retrieval-Augmented Generation (RAG) and why is it used?

RAG is a technique that improves LLM accuracy by retrieving relevant external documents at query time and injecting them into the model's context before generating a response. It solves two key problems: (1) LLMs have a knowledge cutoff and cannot access up-to-date information; (2) LLMs hallucinate when asked about specific or domain-specific facts. RAG grounds responses in verifiable, citable documents without requiring expensive model retraining or fine-tuning.

What is the difference between dense and sparse retrieval in RAG?

Dense retrieval uses neural embedding models to convert both documents and queries into high-dimensional vectors, then finds the closest documents by vector similarity (typically cosine similarity). It captures semantic meaning — 'car' and 'automobile' are retrieved together. Sparse retrieval uses statistical keyword methods like BM25 or TF-IDF based on exact term frequency and document frequency. It excels at precise keyword matching. Hybrid retrieval combines both: dense for semantic understanding and sparse for exact keyword precision, typically outperforming either method alone.

What is chunking in a RAG pipeline and how does chunk size affect performance?

Chunking splits documents into smaller segments before embedding and indexing them. Chunk size critically affects RAG performance: smaller chunks (128–256 tokens) provide more precise retrieval but may lack surrounding context; larger chunks (512–1024 tokens) preserve more context but may dilute relevance. Overlap between chunks (e.g., 10–20% overlap) helps preserve context at boundaries. Recursive chunking respects document structure by splitting on paragraph then sentence boundaries. Semantic chunking splits based on embedding similarity shifts, producing the most coherent chunks but requiring more compute.

What is HyDE (Hypothetical Document Embeddings) and how does it improve retrieval?

HyDE addresses the query-document embedding mismatch problem: user queries are short and conversational while indexed documents are long and formal, making their embedding spaces different. HyDE first uses the LLM to generate a hypothetical answer to the query (as if it were a real document), then embeds that hypothetical answer for retrieval instead of the raw query. The hypothetical answer is stylistically and structurally closer to the indexed documents, significantly improving retrieval recall for complex or conversational queries.

What is Self-RAG and how does it differ from standard RAG?

Standard RAG always retrieves documents for every query, regardless of whether retrieval is helpful. Self-RAG is a framework where the LLM itself decides whether to retrieve (using a special retrieve token), critiques the relevance of retrieved passages (using relevance tokens), and evaluates whether its own generated response is supported by the retrieved context (using support tokens). This adaptive, self-reflective approach reduces unnecessary retrieval overhead and improves factual grounding by having the model verify its own outputs against retrieved evidence.

Retrieval-Augmented Generation (RAG) — NVIDIA NCA-GENL Exam Prep

Four Pillars of Retrieval-Augmented Generation

RAG grounds LLM responses in real, verifiable documents — eliminating knowledge cutoff limitations and reducing hallucination. Mastering these four areas covers every RAG concept tested on the NCA-GENL exam.

Pillar 1 · RAG Architecture

Index → Retrieve → Augment → Generate

A RAG pipeline has two phases: at index time, documents are chunked, embedded into vectors, and stored in a vector database. At query time, the query is embedded, similar vectors are retrieved, and the retrieved text is injected into the LLM context before generation.

Pipeline phases

512

Typical chunk tokens

Top-K

Retrieved chunks

Pillar 2 · Retrieval Methods

Dense · Sparse · Hybrid · Re-ranking

Dense retrieval uses neural embeddings for semantic similarity. Sparse retrieval uses BM25/TF-IDF for keyword precision. Hybrid combines both. Re-ranking uses a cross-encoder to re-score the top-K results for final precision — the most impactful single improvement to retrieval quality.

cos

Similarity metric

BM25

Top sparse method

+15%

Re-ranking gain typical

Pillar 3 · Advanced RAG Patterns

HyDE · Self-RAG · Agentic RAG

Advanced RAG addresses naive RAG's failure modes: HyDE closes the query-document embedding gap; query decomposition breaks complex questions into sub-queries; contextual compression reduces noise in retrieved chunks; Self-RAG adds adaptive retrieval and self-critique; Agentic RAG enables multi-hop reasoning across multiple retrieval steps.

HyDE

Hypothetical doc embed

Self

-RAG self-reflection

MMR

Diversity retrieval

Pillar 4 · NVIDIA RAG Tools

NeMo Retriever · cuVS · NIM Embeddings

NVIDIA's RAG stack: NeMo Retriever provides GPU-accelerated embedding and retrieval services; cuVS (CUDA Vector Search) delivers high-throughput approximate nearest neighbor (ANN) search; NIM microservices expose optimized embedding model endpoints; the NVIDIA RAG Blueprint wires the full pipeline end-to-end.

cuVS

GPU vector search

NIM

Embedding endpoints

10×

GPU vs CPU ANN speedup

NCA-GENL Exam Focus: Expect questions on the two RAG pipeline phases, chunking strategy tradeoffs, cosine similarity for dense retrieval, BM25 as a sparse method, why hybrid retrieval outperforms either alone, what HyDE does to improve retrieval, how Self-RAG uses special tokens, and NVIDIA's cuVS and NeMo Retriever roles.

How RAG Works

From two-phase pipeline architecture to chunking strategies, similarity metrics, and advanced retrieval patterns.

RAG Pipeline — Two Phases

RAG has an offline indexing phase (run once, or on document updates) and an online query phase (run for every user request). Understanding both is essential for the exam.

📁 Index Time

📄 Source DocsPDF · Web · DB

→

✂️ ChunkSplit + overlap

→

🔢 EmbedVector model

→

🗄️ Vector DBFAISS · cuVS · Milvus

❓ Query Time

💬 User QueryNatural language

→

🔢 EmbedSame model

→

🔍 RetrieveTop-K similar

→

📝 AugmentInject context

→

🤖 LLMGenerate

→

✅ AnswerGrounded response

Critical exam point: The same embedding model must be used for both indexing documents and embedding queries at retrieval time. Using different models produces incompatible vector spaces and retrieval fails.

Chunking Strategies — Splitting Documents for Retrieval

Chunking is the process of splitting source documents into segments before embedding. Chunk size, overlap, and strategy have a dramatic effect on retrieval quality and generation accuracy.

Fixed-Size

256–512 tokens

Split at exact token count. Simple and fast. May split mid-sentence or mid-thought.

Default baseline

Recursive

Varies

Splits on \n\n, \n, then sentences, then words in order. Respects document structure.

Recommended default

Semantic

Variable

Splits where embedding similarity drops significantly. Most coherent chunks. Slowest.

Highest quality

Sliding Window

512 + 128 overlap

Overlapping chunks preserve boundary context. Increases index size proportionally.

Dense docs

Sentence-Level

1–3 sentences

Each chunk is a single or small group of sentences. High precision, very small context.

Precise retrieval

Doc-Aware

By section

Splits on headers, sections, or document structure. Best for PDFs, manuals, reports.

Structured docs

Chunk overlap: Adding 10–20% overlap (e.g., 50–100 tokens for a 512-token chunk) prevents context loss at chunk boundaries. The overlapping tokens appear in both adjacent chunks so relevant context is never split away from its key term.

Retrieval Methods — Dense · Sparse · Hybrid

Dense Retrieval (Semantic)

Neural Embedding Similarity

Documents and queries are embedded into high-dimensional vectors (768–4096 dims). Retrieval = nearest neighbor search by cosine similarity or dot product. Captures semantic meaning — "automobile" matches "car".

Strengths

Semantic understanding
Cross-lingual
Handles paraphrasing

Weaknesses

Misses exact keywords
Needs GPU for speed
Embedding cost at scale

Sparse Retrieval (Keyword)

BM25 / TF-IDF Term Frequency

Documents represented as sparse term-frequency vectors. BM25 scores based on term frequency (TF), inverse document frequency (IDF), and document length normalization. Exact keyword matching — "NVIDIA A100" will match "NVIDIA A100".

Strengths

Exact keyword match
No GPU needed
Fast and interpretable

Weaknesses

No semantic meaning
Vocabulary mismatch
No cross-lingual

Hybrid Retrieval (Recommended)

Dense + Sparse Combined via Reciprocal Rank Fusion (RRF)

Run both dense and sparse retrieval in parallel, then combine ranked result lists using Reciprocal Rank Fusion (RRF): each document gets score = Σ 1/(k + rank_i) across all rankers. The hybrid consistently outperforms either method alone — especially on queries that mix semantic meaning with precise terminology (e.g., "side effects of atorvastatin in elderly patients").

Cosine Similarity — Dense Retrieval Scoring

similarity(q, d) = (q · d) / (‖q‖ × ‖d‖)

Where q = query embedding vector, d = document chunk embedding vector. Range: −1 (opposite) to +1 (identical). In practice: >0.85 = high relevance, 0.7–0.85 = moderate, <0.7 = likely irrelevant. Normalized embeddings allow using dot product directly (faster).

Re-ranking & Advanced Retrieval Patterns

Cross-Encoder Re-ranking — The Highest-ROI Improvement

Bi-encoders (used for vector retrieval) embed query and document independently — fast but imprecise. Cross-encoders take the (query, document) pair together and score their relevance jointly — much more accurate but 100× slower. Strategy: retrieve Top-50 with fast bi-encoder, then re-rank with cross-encoder, keep Top-5 for the LLM context. Typical precision gain: +10–20%.

HyDE — Hypothetical Document Embeddings

Problem: user query "What causes hallucinations in LLMs?" is short and conversational, while indexed documents are long and technical — their embedding spaces don't align well. HyDE fix: use the LLM to generate a hypothetical answer paragraph (as if it were a document), then embed that for retrieval. The hypothetical document is stylistically and topically closer to indexed content, dramatically improving recall for complex queries.

Query Decomposition — Multi-Query Retrieval

For complex multi-part questions, generate N rephrased or decomposed sub-queries, retrieve for each independently, then de-duplicate and merge all retrieved chunks. Covers different facets of the question that a single query embedding would miss. E.g., "Compare LoRA and QLoRA in terms of memory and accuracy" → two sub-queries: "LoRA memory efficiency" + "QLoRA vs LoRA accuracy comparison".

Contextual Compression — Reducing Retrieved Noise

Top-K retrieved chunks often contain large sections irrelevant to the specific query. Contextual compression passes each retrieved chunk through an LLM (or a smaller extractor model) to extract only the sentences relevant to the query. Reduces context length, lowers hallucination risk, and keeps the LLM focused on relevant evidence.

Self-RAG — Adaptive Retrieval with Self-Critique

Standard RAG always retrieves — even for simple queries that don't need it. Self-RAG trains the LLM with special reflection tokens: [Retrieve] (should I retrieve?), [IsRel] (is this passage relevant?), [IsSup] (does my response cite this passage faithfully?), [IsUse] (is this response useful overall?). The model decides when retrieval helps, critiques the relevance of retrieved passages, and verifies that its output is actually supported by what it retrieved.

NVIDIA RAG Stack — NeMo Retriever · cuVS · NIM

▲

cuVS — CUDA Vector Search

NVIDIA's GPU-accelerated library for Approximate Nearest Neighbor (ANN) search. Supports CAGRA (graph-based), IVF-Flat, IVF-PQ, and BRUTE-FORCE index types. Delivers 10–100× speedup over CPU FAISS for billion-scale vector indices. Integrated into NVIDIA NeMo Retriever and usable standalone via Python/C++ APIs.

▲

NeMo Retriever — End-to-End GPU RAG Service

NVIDIA NeMo Retriever provides a production-grade RAG microservice: GPU-accelerated document ingestion, chunking, embedding (via NIM endpoints), vector indexing (cuVS), and retrieval APIs. Designed for enterprise deployment with support for multi-modal documents (text, tables, images) and integration with enterprise data sources.

▲

NIM Embedding Endpoints

NVIDIA NIM provides optimized inference endpoints for embedding models (E5-Large, NV-Embed, BGE, etc.) running on GPU with TensorRT-LLM optimization. NIM embedding endpoints produce high-throughput, low-latency embeddings at scale — critical for both batch indexing and real-time query embedding in production RAG pipelines.

▲

NVIDIA RAG Blueprint

A reference architecture that wires together: NeMo Retriever (ingestion + retrieval) + cuVS (vector search) + NIM LLM (generation) + NeMo Guardrails (safety). Deployed as a set of containerized microservices. Provides a production-ready starting point for enterprise RAG that avoids manual integration of each component.

Compare RAG Approaches & Tools

22-row side-by-side comparison across retrieval architecture, methods, advanced patterns, and NVIDIA tooling.

Concept	Option A	Option B	When to Choose
Fixed-Size vs Semantic Chunking arch	Fixed-size chunking — Split at N tokens. Simple, fast, deterministic. May split sentences mid-thought.	Semantic chunking — Split where embedding cosine similarity drops between adjacent sentences. Preserves conceptual units.	Fixed-size as default baseline for speed; semantic chunking when retrieval quality is critical and indexing time allows
Chunk Size Trade-off arch	Small chunks (128–256 tokens) — Higher retrieval precision, pinpoints specific facts. Less context per chunk.	Large chunks (512–1024 tokens) — More surrounding context, better for synthesis. Lower precision, more noise in retrieved passage.	Small for precise fact retrieval (Q&A, citation); large for tasks requiring contextual synthesis (summarization, analysis)
Chunk Overlap arch	No overlap — Minimal index size. Risk: context split across chunk boundaries and lost.	Overlap (10–20%) — Adjacent chunks share 50–100 tokens. Boundary context preserved. Index size increases proportionally.	Always use overlap for production; 10–20% is the standard starting point. Only skip overlap if index storage is severely constrained.
RAG vs Fine-Tuning for Knowledge arch	RAG — Knowledge stored externally. Updatable without retraining. Provides citable sources. Higher latency per query (retrieval overhead).	Fine-tuning — Knowledge baked into weights. No retrieval latency. Cannot be updated without retraining. No built-in citations.	RAG for dynamic, updatable, or verifiable knowledge; fine-tuning for stable domain style/format; combine both for best results
Single-Stage vs Multi-Stage Retrieval arch	Single-stage — One retrieval pass, take Top-K results directly. Fast, simple, less precise.	Multi-stage — First pass retrieves Top-50 (fast), second pass re-ranks to Top-5 (accurate cross-encoder). Higher latency, much better precision.	Single-stage for latency-critical simple queries; multi-stage for precision-critical applications where retrieval quality drives answer accuracy
Dense vs Sparse Retrieval retrieve	Dense (semantic) — Embedding model converts text to vectors. Cosine similarity search. Captures meaning. GPU-accelerated.	Sparse (keyword) — BM25/TF-IDF. Term frequency scoring. Exact keyword match. CPU-efficient, no embeddings needed.	Dense for semantic queries; sparse for precise keyword matching (product codes, names, technical terms); hybrid for production
Cosine Similarity vs Dot Product retrieve	Cosine similarity — Measures angle between vectors, normalized by magnitude. Scale-invariant.	Dot product — Measures projection of one vector onto another. Faster computation. Equivalent to cosine for normalized (unit) embeddings.	Use dot product when embeddings are L2-normalized (most modern embedding models do this) — same result as cosine, faster compute
ANN vs Exact Search retrieve	Approximate Nearest Neighbor (ANN) — Fast, sublinear time. Small recall tradeoff (typically 95–99%). Required at million+ scale.	Exact search (brute force) — 100% recall, O(n) time complexity. Only feasible for <100K vectors.	Exact search for small corpora during development; ANN (FAISS, cuVS CAGRA) for production at scale
Bi-encoder vs Cross-encoder retrieve	Bi-encoder — Encodes query and document independently. Fast, scales to millions of docs. Lower precision.	Cross-encoder — Takes (query, document) pair jointly. Much higher precision. ~100× slower — use only for re-ranking Top-K.	Bi-encoder for first-pass retrieval over the full corpus; cross-encoder for re-ranking the Top-20–50 results only
Top-K vs MMR Retrieval retrieve	Top-K — Return the K most similar chunks. May return near-duplicate passages covering the same content.	MMR (Maximum Marginal Relevance) — Balances relevance and diversity. Penalizes documents too similar to already-selected ones.	Top-K when all passages should be highly focused on the query; MMR when query covers multiple aspects and diverse coverage matters
BM25 vs TF-IDF retrieve	BM25 — Extends TF-IDF with document length normalization and TF saturation. The gold standard sparse retrieval method.	TF-IDF — Simple term frequency × inverse document frequency. Older baseline. BM25 outperforms in virtually all benchmarks.	Always prefer BM25 over TF-IDF for new projects. TF-IDF only relevant when discussing historical context or working with legacy systems.
Naive RAG vs Advanced RAG advanced	Naive RAG — Fixed chunking → embed → retrieve Top-K → inject all into prompt → generate. Simple but brittle.	Advanced RAG — Query transformation + hybrid retrieval + re-ranking + contextual compression + iterative/adaptive retrieval.	Naive RAG for prototyping; Advanced RAG for production where answer quality drives user value
HyDE vs Multi-Query Retrieval advanced	HyDE — Generates a hypothetical answer and embeds it for retrieval. Bridges query-document embedding gap.	Multi-query — Rephrases the query N different ways, retrieves for each, merges results. Broader coverage of the question space.	HyDE for improving recall on semantic mismatches; multi-query for complex questions with multiple facets or ambiguous wording
Self-RAG vs CRAG advanced	Self-RAG — LLM uses special tokens to decide when to retrieve and to self-critique retrieved passage relevance and response faithfulness.	CRAG (Corrective RAG) — Evaluates retrieval quality; if poor, triggers a web search fallback to find better sources before generating.	Self-RAG for models that need adaptive retrieval decisions; CRAG for systems where retrieval from static index may be insufficient
Contextual Compression vs Full Chunks advanced	Full chunks — Inject entire retrieved chunks into context. Simple. May include irrelevant sentences that dilute precision.	Contextual compression — Extract only query-relevant sentences from each chunk before injecting. Shorter context, less noise, better answer focus.	Full chunks for simple Q&A; contextual compression for long documents, complex queries, or when context window is a constraint
Agentic RAG vs Single-Pass RAG advanced	Single-pass RAG — One retrieval → one generation. Fast, deterministic. Fails for multi-hop questions.	Agentic RAG — LLM plans multiple retrieval steps, evaluates intermediate results, and iterates until the question is fully answered.	Single-pass for factoid questions; Agentic RAG for multi-hop reasoning (e.g., "Who founded the company that makes the GPU used in the DGX H100?")
cuVS vs CPU FAISS nvidia	cuVS (GPU) — CUDA-accelerated ANN search. CAGRA graph index. 10–100× faster than CPU. Requires NVIDIA GPU.	CPU FAISS — Meta's open-source CPU ANN library. No GPU required. Sufficient for <10M vectors. Slower at large scale.	CPU FAISS for development and small corpora; cuVS for production workloads with >10M vectors or strict latency requirements
NeMo Retriever vs LangChain RAG nvidia	NVIDIA NeMo Retriever — Production-grade, GPU-optimized, enterprise-focused. Integrated cuVS + NIM. Multi-modal support.	LangChain RAG — Developer-friendly Python framework. Broad integrations. CPU-default. Easier for rapid prototyping.	LangChain for fast prototyping and flexibility; NeMo Retriever for production-grade GPU-accelerated enterprise deployment
NIM Embedding Models nvidia	NV-Embed-QA — NVIDIA-optimized embedding model specifically trained for question-answering retrieval tasks. High accuracy for RAG.	Generic embeddings (E5, BGE) — Broad-purpose embedding models. Strong baselines but not RAG-specialized.	NV-Embed-QA as the default for NVIDIA NIM RAG pipelines; evaluate against domain-specific embedding models for specialized corpora
Milvus vs Pinecone vs cuVS nvidia	Milvus / Weaviate (open-source) — Full-featured vector database with filtering, metadata, and multi-tenancy. Self-hosted or cloud.	cuVS (NVIDIA) — GPU-accelerated ANN library (not a full DB). Highest throughput. Best embedded into NVIDIA stack.	cuVS when maximum ANN throughput matters and you control the pipeline; Milvus/Weaviate for full vector DB features (filtering, metadata, multi-tenancy)
NVIDIA RAG Blueprint vs Custom Pipeline nvidia	NVIDIA RAG Blueprint — Reference architecture. Pre-wired: NeMo Retriever + cuVS + NIM + Guardrails. Production-ready starting point.	Custom RAG pipeline — Build each component individually. More flexibility. Higher integration effort and maintenance burden.	NVIDIA RAG Blueprint as the starting point for NVIDIA-stack deployments; custom pipeline when specific components (e.g., proprietary vector DB) are required
Sparse + Dense Index Storage nvidia	Separate indices — Maintain distinct BM25 and vector indices. Query both at retrieval time. Higher storage, simpler architecture.	Unified hybrid store — Systems like Weaviate support both sparse and dense in one index. Single query path. Operationally simpler.	Separate indices for maximum control over each retrieval method; unified store when operational simplicity is prioritized

Real-World RAG Implementation Examples

Four end-to-end scenarios — from building a basic RAG pipeline to enterprise-scale GPU-accelerated retrieval.

Example 1 · RAG Architecture

Enterprise Document Q&A — Building a RAG Pipeline from Scratch

A professional services firm has 50,000 internal policy documents, contracts, and research reports. Employees need accurate answers to questions like "What is our liability cap for software contracts under $100K?" Standard LLM answers hallucinate specific clause details.

Solution: Full RAG pipeline with recursive chunking and dense retrieval.

Document ingestion: Parse PDFs and DOCX files with PyMuPDF. Extract clean text with section headers preserved for document-aware chunking.
Chunking strategy: Recursive chunking on paragraph then sentence boundaries. Chunk size 512 tokens, 50-token overlap. Preserve document title and section as metadata on each chunk.
Embedding: Use NIM NV-Embed-QA endpoint to batch-embed all chunks. Store 1536-dimensional vectors alongside chunk text and metadata in Milvus.
Query pipeline: Embed user query with same NV-Embed-QA model. Retrieve Top-10 by cosine similarity. Inject retrieved chunks + query into LLM context.
Citation enforcement: System prompt requires the LLM to cite the document title and section number for every claim. Any answer without a citation is flagged for human review.

Result: Hallucination rate on policy questions drops from ~34% (no RAG) to 4% (RAG with citation enforcement). Average query latency: 1.2s (embedding 80ms + retrieval 40ms + generation 1.1s).

Example 2 · Retrieval Methods

Technical Support Chatbot — Improving Retrieval with Hybrid + Re-ranking

A hardware company's RAG-powered support chatbot retrieves the wrong documents 28% of the time. Users asking "NVIDIA A100 NVLink bandwidth" get generic GPU pages instead of the specific NVLink spec sheet. Dense-only retrieval misses the exact technical identifier.

Solution: Hybrid retrieval + cross-encoder re-ranking.

Add BM25 layer: Build a parallel BM25 index over the same document corpus using Elasticsearch or BM25Retriever. BM25 will precisely match "A100 NVLink" as exact terms where dense retrieval may miss it.
Reciprocal Rank Fusion: For each query, run dense retrieval (Top-50) and BM25 (Top-50) in parallel. Merge with RRF: score = Σ 1/(60 + rank). Take the Top-30 merged results.
Add cross-encoder re-ranking: Pass all 30 (query, chunk) pairs to a cross-encoder model (e.g., ms-marco-MiniLM-L-12). Re-score and rank. Keep Top-5 for LLM context.
Tune K values: Experiment with first-pass K (50, 100, 200) and re-ranking K (3, 5, 10) on a held-out eval set of 200 support questions with ground-truth relevant documents.
Monitor retrieval quality separately from answer quality: Track "retrieval recall@5" (were the ground-truth documents in Top-5?) as a separate metric from end-to-end answer accuracy.

Result: Wrong-document retrieval rate drops from 28% to 9%. Cross-encoder re-ranking alone accounts for 11 percentage points of that improvement. Answer accuracy improves from 67% to 89% on the eval set.

Example 3 · Advanced RAG

Medical Research Tool — HyDE + Contextual Compression for Complex Queries

A medical research platform uses RAG over 2M PubMed abstracts. Clinicians ask complex queries like "What are the cardiac safety concerns for second-generation antipsychotics in elderly patients with existing arrhythmias?" Dense retrieval returns generic cardiology papers instead of the specific intersection.

Solution: HyDE + multi-query + contextual compression pipeline.

HyDE query transformation: Before embedding the query, prompt the LLM to generate a hypothetical 150-word abstract that would answer the question. Embed this hypothetical abstract for retrieval — it's far closer in style and vocabulary to PubMed documents than the original query.
Multi-query expansion: Also generate 3 decomposed sub-queries: (1) "antipsychotic QT prolongation risk", (2) "arrhythmia drug interactions elderly", (3) "second-generation antipsychotic cardiac mortality". Retrieve independently for each.
Merge and de-duplicate: Combine results from HyDE retrieval and 3 sub-query retrievals (up to 40 chunks total). De-duplicate by chunk ID. Apply MMR to reduce redundancy while preserving coverage.
Contextual compression: Each of the Top-15 chunks is passed to a small extractor prompt: "Extract only the sentences relevant to [original query] from this passage." This reduces average chunk length from 512 to ~80 tokens.
LLM generation with citations: 15 compressed passages (≈1,200 tokens) fit comfortably in context. LLM generates a structured answer citing specific PubMed IDs for each claim.

Result: Retrieval recall@10 improves from 41% (naive dense) to 78% (HyDE + multi-query). Answer accuracy on a clinician-reviewed eval set: 61% → 84%. Physicians report 3× fewer irrelevant citations in responses.

Example 4 · NVIDIA Tools

Enterprise Knowledge Base — GPU-Accelerated RAG at Scale with NVIDIA Stack

A Fortune 500 company needs a RAG system over 5 million internal documents (10 billion tokens) with sub-200ms P95 query latency. CPU-based retrieval takes 800ms at this scale. The existing LangChain prototype is too slow for production SLAs.

Solution: Full NVIDIA RAG Blueprint deployment.

Batch indexing with NeMo Retriever: Process 5M documents through NeMo Retriever's GPU-accelerated ingestion pipeline. NIM NV-Embed-QA endpoint embeds documents at 50,000 chunks/second on 4× A100s — completing full index build in under 3 hours.
cuVS CAGRA index: Build a CAGRA graph-based ANN index over the 5M vectors using cuVS on GPU. CAGRA achieves 99% recall@10 with 5ms retrieval latency — vs 180ms for CPU FAISS at the same scale.
NIM LLM endpoint: Route generation to a Llama-3.1-70B NIM microservice with TensorRT-LLM optimization. Prefix caching on the system prompt reduces generation time for repeat query types by 65%.
NeMo Guardrails safety layer: Wrap the full pipeline with NeMo Guardrails to prevent out-of-scope queries, PII leakage from retrieved documents, and prompt injection via adversarial document content.
NVIDIA RAG Blueprint monitoring: Deploy Prometheus + Grafana dashboards monitoring: retrieval latency, embedding throughput, LLM TTFT (time to first token), retrieval recall@5, and guardrail trigger rate.

Result: P95 query latency: 147ms (vs 800ms CPU baseline). Retrieval recall@10: 99.1%. System handles 3,000 concurrent queries per second. Full NVIDIA stack deployment achieved in 4 weeks using RAG Blueprint as the starting point.

Practice Quiz — Retrieval-Augmented Generation

10 NCA-GENL style questions with instant explanations. Covers all four RAG pillars.

RAG Advisor

Describe your RAG challenge and get a targeted implementation recommendation.

Memory Hooks — Flip Cards

8 core RAG concepts to lock in before exam day. Click any card to flip.

Pillar 1 · Architecture

RAG's two pipeline phases?

Click to reveal →

Index time (offline): Chunk → Embed → Store in vector DB

Query time (online): Embed query → Retrieve Top-K → Inject into prompt → Generate

Same embedding model must be used for both phases.

Pillar 1 · Architecture

Chunking — key tradeoffs?

Click to reveal →

Smaller chunks: more precise retrieval, less context per chunk
Larger chunks: more context, more noise, less precision
Overlap (10–20%): prevents boundary context loss
Recursive: best default — respects paragraph → sentence → word structure

Pillar 2 · Retrieval

Dense vs Sparse vs Hybrid?

Click to reveal →

Dense: embeddings + cosine similarity → semantic meaning
Sparse: BM25/TF-IDF → exact keyword match
Hybrid: both + Reciprocal Rank Fusion → best of both worlds

Hybrid consistently outperforms either alone in production.

Pillar 2 · Retrieval

Bi-encoder vs Cross-encoder?

Click to reveal →

Bi-encoder: encodes query and doc separately → fast, scales to millions, used for first-pass retrieval

Cross-encoder: takes (query + doc) jointly → 100× slower but much more accurate → used for re-ranking Top-20–50 only

Pillar 3 · Advanced

What does HyDE do?

Click to reveal →

Problem: User query is short/conversational; indexed docs are long/technical → embedding spaces don't align

HyDE fix: LLM generates a hypothetical answer → embed that for retrieval instead of the raw query → hypothetical doc is closer to indexed content → better recall

Pillar 3 · Advanced

Self-RAG special tokens?

Click to reveal →

[Retrieve] → should I retrieve at all?
[IsRel] → is this passage relevant to the query?
[IsSup] → does my response faithfully cite this passage?
[IsUse] → is the overall response useful?

LLM uses these to self-critique instead of always blindly retrieving.

Pillar 4 · NVIDIA

cuVS purpose and advantage?

Click to reveal →

cuVS = CUDA Vector Search — NVIDIA's GPU-accelerated ANN library

Supports: CAGRA (graph-based), IVF-Flat, IVF-PQ

Speed: 10–100× faster than CPU FAISS at billion-scale

Used inside NVIDIA NeMo Retriever for production RAG

Pillar 2 · Retrieval

Cosine similarity formula?

Click to reveal →

cos(q, d) = (q · d) / (‖q‖ × ‖d‖)

Range: −1 (opposite) to +1 (identical)

For normalized embeddings: dot product = cosine → use dot product (faster)

>0.85 = high relevance · 0.7–0.85 = moderate · <0.7 = likely irrelevant

Click any card to flip · Click again to return

Retrieval-Augmented Generation (RAG)

Index → Retrieve → Augment → Generate

Dense · Sparse · Hybrid · Re-ranking

HyDE · Self-RAG · Agentic RAG

NeMo Retriever · cuVS · NIM Embeddings

Neural Embedding Similarity

BM25 / TF-IDF Term Frequency

Dense + Sparse Combined via Reciprocal Rank Fusion (RRF)

Cross-Encoder Re-ranking — The Highest-ROI Improvement

HyDE — Hypothetical Document Embeddings

Query Decomposition — Multi-Query Retrieval

Contextual Compression — Reducing Retrieved Noise

Self-RAG — Adaptive Retrieval with Self-Critique

cuVS — CUDA Vector Search

NeMo Retriever — End-to-End GPU RAG Service

NIM Embedding Endpoints

NVIDIA RAG Blueprint

Enterprise Document Q&A — Building a RAG Pipeline from Scratch

Technical Support Chatbot — Improving Retrieval with Hybrid + Re-ranking

Medical Research Tool — HyDE + Contextual Compression for Complex Queries

Enterprise Knowledge Base — GPU-Accelerated RAG at Scale with NVIDIA Stack

RAG's two pipeline phases?

Chunking — key tradeoffs?

Dense vs Sparse vs Hybrid?

Bi-encoder vs Cross-encoder?

What does HyDE do?

Self-RAG special tokens?

cuVS purpose and advantage?

Cosine similarity formula?

Ready to Pass the NCA-GENL?