Master RAG pipeline architecture, vector embeddings, chunking strategies, hybrid retrieval, re-ranking, HyDE, Self-RAG, and NVIDIA NeMo Retriever for the NCA-GENL certification.
Start Free Practice →A RAG pipeline has two phases: at index time, documents are chunked, embedded into vectors, and stored in a vector database. At query time, the query is embedded, similar vectors are retrieved, and the retrieved text is injected into the LLM context before generation.
Dense retrieval uses neural embeddings for semantic similarity. Sparse retrieval uses BM25/TF-IDF for keyword precision. Hybrid combines both. Re-ranking uses a cross-encoder to re-score the top-K results for final precision — the most impactful single improvement to retrieval quality.
Advanced RAG addresses naive RAG's failure modes: HyDE closes the query-document embedding gap; query decomposition breaks complex questions into sub-queries; contextual compression reduces noise in retrieved chunks; Self-RAG adds adaptive retrieval and self-critique; Agentic RAG enables multi-hop reasoning across multiple retrieval steps.
NVIDIA's RAG stack: NeMo Retriever provides GPU-accelerated embedding and retrieval services; cuVS (CUDA Vector Search) delivers high-throughput approximate nearest neighbor (ANN) search; NIM microservices expose optimized embedding model endpoints; the NVIDIA RAG Blueprint wires the full pipeline end-to-end.
RAG has an offline indexing phase (run once, or on document updates) and an online query phase (run for every user request). Understanding both is essential for the exam.
Chunking is the process of splitting source documents into segments before embedding. Chunk size, overlap, and strategy have a dramatic effect on retrieval quality and generation accuracy.
Documents and queries are embedded into high-dimensional vectors (768–4096 dims). Retrieval = nearest neighbor search by cosine similarity or dot product. Captures semantic meaning — "automobile" matches "car".
Documents represented as sparse term-frequency vectors. BM25 scores based on term frequency (TF), inverse document frequency (IDF), and document length normalization. Exact keyword matching — "NVIDIA A100" will match "NVIDIA A100".
Run both dense and sparse retrieval in parallel, then combine ranked result lists using Reciprocal Rank Fusion (RRF): each document gets score = Σ 1/(k + rank_i) across all rankers. The hybrid consistently outperforms either method alone — especially on queries that mix semantic meaning with precise terminology (e.g., "side effects of atorvastatin in elderly patients").
Bi-encoders (used for vector retrieval) embed query and document independently — fast but imprecise. Cross-encoders take the (query, document) pair together and score their relevance jointly — much more accurate but 100× slower. Strategy: retrieve Top-50 with fast bi-encoder, then re-rank with cross-encoder, keep Top-5 for the LLM context. Typical precision gain: +10–20%.
Problem: user query "What causes hallucinations in LLMs?" is short and conversational, while indexed documents are long and technical — their embedding spaces don't align well. HyDE fix: use the LLM to generate a hypothetical answer paragraph (as if it were a document), then embed that for retrieval. The hypothetical document is stylistically and topically closer to indexed content, dramatically improving recall for complex queries.
For complex multi-part questions, generate N rephrased or decomposed sub-queries, retrieve for each independently, then de-duplicate and merge all retrieved chunks. Covers different facets of the question that a single query embedding would miss. E.g., "Compare LoRA and QLoRA in terms of memory and accuracy" → two sub-queries: "LoRA memory efficiency" + "QLoRA vs LoRA accuracy comparison".
Top-K retrieved chunks often contain large sections irrelevant to the specific query. Contextual compression passes each retrieved chunk through an LLM (or a smaller extractor model) to extract only the sentences relevant to the query. Reduces context length, lowers hallucination risk, and keeps the LLM focused on relevant evidence.
Standard RAG always retrieves — even for simple queries that don't need it. Self-RAG trains the LLM with special reflection tokens: [Retrieve] (should I retrieve?), [IsRel] (is this passage relevant?), [IsSup] (does my response cite this passage faithfully?), [IsUse] (is this response useful overall?). The model decides when retrieval helps, critiques the relevance of retrieved passages, and verifies that its output is actually supported by what it retrieved.
NVIDIA's GPU-accelerated library for Approximate Nearest Neighbor (ANN) search. Supports CAGRA (graph-based), IVF-Flat, IVF-PQ, and BRUTE-FORCE index types. Delivers 10–100× speedup over CPU FAISS for billion-scale vector indices. Integrated into NVIDIA NeMo Retriever and usable standalone via Python/C++ APIs.
NVIDIA NeMo Retriever provides a production-grade RAG microservice: GPU-accelerated document ingestion, chunking, embedding (via NIM endpoints), vector indexing (cuVS), and retrieval APIs. Designed for enterprise deployment with support for multi-modal documents (text, tables, images) and integration with enterprise data sources.
NVIDIA NIM provides optimized inference endpoints for embedding models (E5-Large, NV-Embed, BGE, etc.) running on GPU with TensorRT-LLM optimization. NIM embedding endpoints produce high-throughput, low-latency embeddings at scale — critical for both batch indexing and real-time query embedding in production RAG pipelines.
A reference architecture that wires together: NeMo Retriever (ingestion + retrieval) + cuVS (vector search) + NIM LLM (generation) + NeMo Guardrails (safety). Deployed as a set of containerized microservices. Provides a production-ready starting point for enterprise RAG that avoids manual integration of each component.
| Concept | Option A | Option B | When to Choose |
|---|---|---|---|
| Fixed-Size vs Semantic Chunking arch | Fixed-size chunking — Split at N tokens. Simple, fast, deterministic. May split sentences mid-thought. | Semantic chunking — Split where embedding cosine similarity drops between adjacent sentences. Preserves conceptual units. | Fixed-size as default baseline for speed; semantic chunking when retrieval quality is critical and indexing time allows |
| Chunk Size Trade-off arch | Small chunks (128–256 tokens) — Higher retrieval precision, pinpoints specific facts. Less context per chunk. | Large chunks (512–1024 tokens) — More surrounding context, better for synthesis. Lower precision, more noise in retrieved passage. | Small for precise fact retrieval (Q&A, citation); large for tasks requiring contextual synthesis (summarization, analysis) |
| Chunk Overlap arch | No overlap — Minimal index size. Risk: context split across chunk boundaries and lost. | Overlap (10–20%) — Adjacent chunks share 50–100 tokens. Boundary context preserved. Index size increases proportionally. | Always use overlap for production; 10–20% is the standard starting point. Only skip overlap if index storage is severely constrained. |
| RAG vs Fine-Tuning for Knowledge arch | RAG — Knowledge stored externally. Updatable without retraining. Provides citable sources. Higher latency per query (retrieval overhead). | Fine-tuning — Knowledge baked into weights. No retrieval latency. Cannot be updated without retraining. No built-in citations. | RAG for dynamic, updatable, or verifiable knowledge; fine-tuning for stable domain style/format; combine both for best results |
| Single-Stage vs Multi-Stage Retrieval arch | Single-stage — One retrieval pass, take Top-K results directly. Fast, simple, less precise. | Multi-stage — First pass retrieves Top-50 (fast), second pass re-ranks to Top-5 (accurate cross-encoder). Higher latency, much better precision. | Single-stage for latency-critical simple queries; multi-stage for precision-critical applications where retrieval quality drives answer accuracy |
| Dense vs Sparse Retrieval retrieve | Dense (semantic) — Embedding model converts text to vectors. Cosine similarity search. Captures meaning. GPU-accelerated. | Sparse (keyword) — BM25/TF-IDF. Term frequency scoring. Exact keyword match. CPU-efficient, no embeddings needed. | Dense for semantic queries; sparse for precise keyword matching (product codes, names, technical terms); hybrid for production |
| Cosine Similarity vs Dot Product retrieve | Cosine similarity — Measures angle between vectors, normalized by magnitude. Scale-invariant. | Dot product — Measures projection of one vector onto another. Faster computation. Equivalent to cosine for normalized (unit) embeddings. | Use dot product when embeddings are L2-normalized (most modern embedding models do this) — same result as cosine, faster compute |
| ANN vs Exact Search retrieve | Approximate Nearest Neighbor (ANN) — Fast, sublinear time. Small recall tradeoff (typically 95–99%). Required at million+ scale. | Exact search (brute force) — 100% recall, O(n) time complexity. Only feasible for <100K vectors. | Exact search for small corpora during development; ANN (FAISS, cuVS CAGRA) for production at scale |
| Bi-encoder vs Cross-encoder retrieve | Bi-encoder — Encodes query and document independently. Fast, scales to millions of docs. Lower precision. | Cross-encoder — Takes (query, document) pair jointly. Much higher precision. ~100× slower — use only for re-ranking Top-K. | Bi-encoder for first-pass retrieval over the full corpus; cross-encoder for re-ranking the Top-20–50 results only |
| Top-K vs MMR Retrieval retrieve | Top-K — Return the K most similar chunks. May return near-duplicate passages covering the same content. | MMR (Maximum Marginal Relevance) — Balances relevance and diversity. Penalizes documents too similar to already-selected ones. | Top-K when all passages should be highly focused on the query; MMR when query covers multiple aspects and diverse coverage matters |
| BM25 vs TF-IDF retrieve | BM25 — Extends TF-IDF with document length normalization and TF saturation. The gold standard sparse retrieval method. | TF-IDF — Simple term frequency × inverse document frequency. Older baseline. BM25 outperforms in virtually all benchmarks. | Always prefer BM25 over TF-IDF for new projects. TF-IDF only relevant when discussing historical context or working with legacy systems. |
| Naive RAG vs Advanced RAG advanced | Naive RAG — Fixed chunking → embed → retrieve Top-K → inject all into prompt → generate. Simple but brittle. | Advanced RAG — Query transformation + hybrid retrieval + re-ranking + contextual compression + iterative/adaptive retrieval. | Naive RAG for prototyping; Advanced RAG for production where answer quality drives user value |
| HyDE vs Multi-Query Retrieval advanced | HyDE — Generates a hypothetical answer and embeds it for retrieval. Bridges query-document embedding gap. | Multi-query — Rephrases the query N different ways, retrieves for each, merges results. Broader coverage of the question space. | HyDE for improving recall on semantic mismatches; multi-query for complex questions with multiple facets or ambiguous wording |
| Self-RAG vs CRAG advanced | Self-RAG — LLM uses special tokens to decide when to retrieve and to self-critique retrieved passage relevance and response faithfulness. | CRAG (Corrective RAG) — Evaluates retrieval quality; if poor, triggers a web search fallback to find better sources before generating. | Self-RAG for models that need adaptive retrieval decisions; CRAG for systems where retrieval from static index may be insufficient |
| Contextual Compression vs Full Chunks advanced | Full chunks — Inject entire retrieved chunks into context. Simple. May include irrelevant sentences that dilute precision. | Contextual compression — Extract only query-relevant sentences from each chunk before injecting. Shorter context, less noise, better answer focus. | Full chunks for simple Q&A; contextual compression for long documents, complex queries, or when context window is a constraint |
| Agentic RAG vs Single-Pass RAG advanced | Single-pass RAG — One retrieval → one generation. Fast, deterministic. Fails for multi-hop questions. | Agentic RAG — LLM plans multiple retrieval steps, evaluates intermediate results, and iterates until the question is fully answered. | Single-pass for factoid questions; Agentic RAG for multi-hop reasoning (e.g., "Who founded the company that makes the GPU used in the DGX H100?") |
| cuVS vs CPU FAISS nvidia | cuVS (GPU) — CUDA-accelerated ANN search. CAGRA graph index. 10–100× faster than CPU. Requires NVIDIA GPU. | CPU FAISS — Meta's open-source CPU ANN library. No GPU required. Sufficient for <10M vectors. Slower at large scale. | CPU FAISS for development and small corpora; cuVS for production workloads with >10M vectors or strict latency requirements |
| NeMo Retriever vs LangChain RAG nvidia | NVIDIA NeMo Retriever — Production-grade, GPU-optimized, enterprise-focused. Integrated cuVS + NIM. Multi-modal support. | LangChain RAG — Developer-friendly Python framework. Broad integrations. CPU-default. Easier for rapid prototyping. | LangChain for fast prototyping and flexibility; NeMo Retriever for production-grade GPU-accelerated enterprise deployment |
| NIM Embedding Models nvidia | NV-Embed-QA — NVIDIA-optimized embedding model specifically trained for question-answering retrieval tasks. High accuracy for RAG. | Generic embeddings (E5, BGE) — Broad-purpose embedding models. Strong baselines but not RAG-specialized. | NV-Embed-QA as the default for NVIDIA NIM RAG pipelines; evaluate against domain-specific embedding models for specialized corpora |
| Milvus vs Pinecone vs cuVS nvidia | Milvus / Weaviate (open-source) — Full-featured vector database with filtering, metadata, and multi-tenancy. Self-hosted or cloud. | cuVS (NVIDIA) — GPU-accelerated ANN library (not a full DB). Highest throughput. Best embedded into NVIDIA stack. | cuVS when maximum ANN throughput matters and you control the pipeline; Milvus/Weaviate for full vector DB features (filtering, metadata, multi-tenancy) |
| NVIDIA RAG Blueprint vs Custom Pipeline nvidia | NVIDIA RAG Blueprint — Reference architecture. Pre-wired: NeMo Retriever + cuVS + NIM + Guardrails. Production-ready starting point. | Custom RAG pipeline — Build each component individually. More flexibility. Higher integration effort and maintenance burden. | NVIDIA RAG Blueprint as the starting point for NVIDIA-stack deployments; custom pipeline when specific components (e.g., proprietary vector DB) are required |
| Sparse + Dense Index Storage nvidia | Separate indices — Maintain distinct BM25 and vector indices. Query both at retrieval time. Higher storage, simpler architecture. | Unified hybrid store — Systems like Weaviate support both sparse and dense in one index. Single query path. Operationally simpler. | Separate indices for maximum control over each retrieval method; unified store when operational simplicity is prioritized |
A professional services firm has 50,000 internal policy documents, contracts, and research reports. Employees need accurate answers to questions like "What is our liability cap for software contracts under $100K?" Standard LLM answers hallucinate specific clause details.
Solution: Full RAG pipeline with recursive chunking and dense retrieval.
A hardware company's RAG-powered support chatbot retrieves the wrong documents 28% of the time. Users asking "NVIDIA A100 NVLink bandwidth" get generic GPU pages instead of the specific NVLink spec sheet. Dense-only retrieval misses the exact technical identifier.
Solution: Hybrid retrieval + cross-encoder re-ranking.
A medical research platform uses RAG over 2M PubMed abstracts. Clinicians ask complex queries like "What are the cardiac safety concerns for second-generation antipsychotics in elderly patients with existing arrhythmias?" Dense retrieval returns generic cardiology papers instead of the specific intersection.
Solution: HyDE + multi-query + contextual compression pipeline.
A Fortune 500 company needs a RAG system over 5 million internal documents (10 billion tokens) with sub-200ms P95 query latency. CPU-based retrieval takes 800ms at this scale. The existing LangChain prototype is too slow for production SLAs.
Solution: Full NVIDIA RAG Blueprint deployment.
Index time (offline): Chunk → Embed → Store in vector DB
Query time (online): Embed query → Retrieve Top-K → Inject into prompt → Generate
Same embedding model must be used for both phases.
Smaller chunks: more precise retrieval, less context per chunk
Larger chunks: more context, more noise, less precision
Overlap (10–20%): prevents boundary context loss
Recursive: best default — respects paragraph → sentence → word structure
Dense: embeddings + cosine similarity → semantic meaning
Sparse: BM25/TF-IDF → exact keyword match
Hybrid: both + Reciprocal Rank Fusion → best of both worlds
Hybrid consistently outperforms either alone in production.
Bi-encoder: encodes query and doc separately → fast, scales to millions, used for first-pass retrieval
Cross-encoder: takes (query + doc) jointly → 100× slower but much more accurate → used for re-ranking Top-20–50 only
Problem: User query is short/conversational; indexed docs are long/technical → embedding spaces don't align
HyDE fix: LLM generates a hypothetical answer → embed that for retrieval instead of the raw query → hypothetical doc is closer to indexed content → better recall
[Retrieve] → should I retrieve at all?
[IsRel] → is this passage relevant to the query?
[IsSup] → does my response faithfully cite this passage?
[IsUse] → is the overall response useful?
LLM uses these to self-critique instead of always blindly retrieving.
cuVS = CUDA Vector Search — NVIDIA's GPU-accelerated ANN library
Supports: CAGRA (graph-based), IVF-Flat, IVF-PQ
Speed: 10–100× faster than CPU FAISS at billion-scale
Used inside NVIDIA NeMo Retriever for production RAG
cos(q, d) = (q · d) / (‖q‖ × ‖d‖)
Range: −1 (opposite) to +1 (identical)
For normalized embeddings: dot product = cosine → use dot product (faster)
>0.85 = high relevance · 0.7–0.85 = moderate · <0.7 = likely irrelevant
Click any card to flip · Click again to return