What is RAG in the context of agentic AI?

RAG (Retrieval-Augmented Generation) grounds an LLM's responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt, reducing hallucination and enabling up-to-date answers without retraining.

What is the difference between dense and sparse retrieval?

Dense retrieval uses embedding vectors and semantic similarity (cosine/dot product), capturing meaning even when exact words differ. Sparse retrieval (BM25) matches on exact keyword frequency. Hybrid search combines both for better recall and precision.

What is HyDE and when should you use it?

HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical ideal answer, embeds that answer, and uses it as the search vector. It bridges the gap when questions and document passages use very different phrasing.

How does self-RAG differ from standard RAG?

In self-RAG the model dynamically decides whether retrieval is needed at all, selects only relevant passages from results, and then critiques its own output for faithfulness — adding special tokens like [Retrieve], [Relevant], and [IsSupported] to guide each step.

What is GraphRAG and what advantage does it provide?

GraphRAG indexes documents as a knowledge graph of entities and relationships rather than flat chunks. It enables multi-hop reasoning, answers questions requiring relationship traversal, and excels in enterprise knowledge bases with complex entity connections.

Knowledge Integration & RAG for Agents — NVIDIA NCP-AAI Study Guide

What Is Knowledge Integration?

Agents exceed their training data by connecting to external knowledge — RAG is the primary mechanism.

Why Agents Need External Knowledge

LLMs are frozen at training time. RAG lets agents answer questions about current events, proprietary documents, and domain-specific data without expensive retraining — by retrieving relevant context and injecting it into the prompt before generating a response.

📚

RAG Foundations

Indexing pipeline, chunking strategies, embeddings, vector stores

Topic~30%

Key toolsFAISS, Chroma, pgvector

🔎

Retrieval Strategies

Dense vs sparse, hybrid search, query rewriting, reranking

Topic~25%

Key toolsBM25, RRF, cross-encoder

🧩

Advanced RAG Patterns

HyDE, self-RAG, multi-hop, CRAG, agentic retrieval

Topic~25%

Key toolsLangGraph, LlamaIndex

🕸️

GraphRAG & Structured Data

Knowledge graphs, entity relationships, SQL agents, NeMo Retriever

Topic~20%

Key toolsNeo4j, NVIDIA NIM

RAG vs Alternatives

Approach	How It Works	Best When	Weakness
RAG	Retrieve docs at runtime, inject into prompt	Large, changing knowledge bases	Retrieval quality bottleneck
Fine-tuning	Update model weights on domain data	Stable knowledge, style adaptation	Expensive, can't update daily
Long Context	Stuff entire corpus into context window	Small corpora, gemini-1.5 class models	Cost, latency, lost-in-middle effect
RAG + Fine-tune	Fine-tune retriever + generator together	High-accuracy production systems	Complex to set up and maintain

The RAG Pipeline

Two phases: offline indexing and online retrieval + generation.

📥 Phase 1 — Offline Indexing

Runs once (or on schedule). Prepares your knowledge base for fast retrieval.

📄

Raw Documents

PDFs, HTML, DOCX, DB records

→

✂️

Chunking

Split into retrievable units

→

🔢

Embedding

text-embedding model → vector

→

🗄️Vector Store
Index stored (FAISS, pgvector…)

🔍 Phase 2 — Online Retrieval & Generation

Runs for every user query. Retrieves relevant chunks and generates a grounded response.

❓

User Query

"What is our refund policy?"

→

🔢

Query Embed

Same embedding model

→

🎯

Vector Search

Top-k by cosine similarity

→

📋

Retrieved Chunks

k=3–10 most relevant

→

📝

Augment Prompt

Context + question → LLM

→

💬

Answer

Grounded response

Chunking Strategies

Chunk size and method dramatically affect retrieval quality. Too small = missing context; too large = noisy.

Fixed-Size

Split every N tokens with a sliding overlap window (e.g., 512 tokens, 50 overlap). Simple and fast but may cut mid-sentence.

~512 tokens, 10–15% overlap

Sentence / Paragraph

Split on sentence boundaries using spaCy or NLTK. Preserves linguistic units — good for prose documents.

Boundary-aware

Semantic

Embed sentences, compute cosine distance between consecutive sentences, split where similarity drops sharply (topic change).

Content-aware

Hierarchical (Parent-Child)

Index small child chunks for retrieval precision; return the larger parent chunk to the LLM for full context. Best of both worlds.

Parent 1024 / Child 256

Common Vector Stores

Store	Type	ANN Algorithm	Best For	NVIDIA Fit
FAISS	Local lib	IVF-PQ, HNSW	In-process, GPU-accelerated search	GPU FAISS via RAPIDS cuVS
pgvector	Postgres ext	IVFFlat, HNSW	Existing Postgres stacks	Works with NeMo Retriever
Chroma	Embedded / server	HNSW (hnswlib)	Prototyping, LangChain default	Easy local dev
Weaviate	Self-hosted / Cloud	HNSW + BM25	Hybrid search out of the box	Supports NIM embedding endpoints
Milvus	Distributed	IVF, HNSW, DiskANN	Billion-scale production deployments	GPU-native with NVIDIA Raft

NVIDIA NeMo Retriever NIM

OpenAI-compatible API endpoints for embedding and reranking — drop-in replacement for OpenAI text-embedding endpoints, optimized with TensorRT on NVIDIA GPUs.

Embedding NIM

nvidia/nv-embedqa-e5-v5, nvidia/nv-embedqa-mistral-7b-v2

Reranking NIM

nvidia/nv-rerankqa-mistral-4b-v3

API endpoint

POST /v1/embeddings | POST /v1/ranking

Integration

LangChain NVIDIAEmbeddings, LlamaIndex NVIDIARerank

Retrieval Methods Compared

Choosing the right retrieval approach is one of the highest-leverage decisions in a RAG system.

Dense Semantic Retrieval

cosine_similarity(q_embed, doc_embed)

Captures meaning, not just keywords
Works across synonyms & paraphrases
Bi-encoder: encode query + docs separately
Models: E5, BGE, NeMo Retriever, ada-002

Best for:Conversational queries, semantic similarity tasks, multilingual search.

Sparse Keyword Retrieval

BM25(q, doc) = TF·IDF weighted score

Exact term matching — no embedding needed
Fast and interpretable
Great for rare terms, codes, identifiers
Tools: Elasticsearch, OpenSearch, Weaviate BM25

Best for:Product codes, names, jargon, legal citations — where exact match matters.

Hybrid Combined Search

RRF(rank_dense, rank_sparse)

Runs dense + sparse in parallel
Merges result lists with Reciprocal Rank Fusion
Higher recall than either method alone
Industry default for production RAG

Best for:General-purpose RAG where you can't predict query type in advance.

🔄 The Two-Stage Retrieval Pipeline

❓

Query

User input

→

🎯

Recall (k=100)

Dense + sparse, fast bi-encoder

→

🏆

Rerank (top 5)

Cross-encoder scores query×chunk jointly

→

📋

Final Context

High-precision chunks to LLM

Key insight: Bi-encoders are fast (pre-compute doc embeddings) but less accurate. Cross-encoders see both query and document together — much more accurate but too slow to run on all documents. Two-stage gives you speed and precision.

Full RAG Concept Comparison

Concept	RAG Foundations	Retrieval Strategy	Advanced Patterns	GraphRAG
Primary goal	Index knowledge for lookup	Find the right chunks fast	Improve RAG quality	Traverse entity relationships
Key algorithm	Embedding + ANN index	BM25, cosine, RRF	HyDE, CRAG, self-RAG	Graph traversal (Cypher)
Chunk size	Fixed or semantic split	Affects recall@k	HyDE generates virtual doc	Entities, not chunks
Multi-hop support	❌ Single retrieval	❌ Single retrieval	✅ Multi-hop chaining	✅ Native graph traversal
Handles structured data	Partially (metadata filters)	Partially (SQL hybrid)	SQL agent pattern	✅ Primary use case
NVIDIA tool	NeMo Retriever Embedding NIM	NeMo Retriever Reranking NIM	LangGraph + NIM LLM	Neo4j + NVIDIA GPU
Latency	One-time index cost	+5–50ms reranker	+100–500ms (extra LLM calls)	+graph query time
When to skip	Never — always needed	Low-latency constraints	Simple factoid questions	Flat unstructured corpus

Advanced RAG Patterns

Standard RAG fails on complex questions — these patterns fix specific failure modes.

HyDE

Hypothetical Document Embeddings

Ask the LLM to generate a hypothetical ideal answer, then embed that answer and use it as the search vector — not the original query.

💡 Use when:Query vocabulary is very different from document vocabulary (e.g., question asks "why does X cause Y?" but docs use passive-voice technical prose).

Self-RAG

Self-Reflective Retrieval

Model generates special tokens to decide: [Retrieve] when needed, [Relevant] or [Irrelevant] per passage, then [IsSupported] vs [Contradicts] for the final answer.

💡 Use when:You need adaptive retrieval — some queries need retrieval, others don't. Avoids unnecessary latency and citation errors.

Multi-Hop RAG

Iterative Retrieval Chaining

Break a complex question into sub-questions. Retrieve for Q1, inject result into Q2's retrieval query, and so on — chaining until the full answer is assembled.

💡 Use when:Questions require combining facts from multiple documents, e.g., "What did the CEO who joined in 2019 say about Q3 margins?"

CRAG

Corrective RAG

After retrieval, a lightweight evaluator scores each chunk's relevance. If relevance is low → trigger a web search fallback. Strips irrelevant content before generation.

💡 Use when:Your vector store may be out of date, or queries fall outside indexed content. Automatically falls back to live search.

🤖 Agentic RAG — The Agent Decides What to Retrieve

The LLM orchestrates retrieval as a tool call, choosing when to retrieve, what to search for, and how many times — unlike passive RAG which always retrieves once.

User Question

→

LLM Agent

→

Plan retrieval steps

↓

Generate search query

→

retrieve(query)

→

Observe chunks

→

Sufficient?

No → refine query and retrieve again

Yes → generate answer

→

Final Response

Key: The agent can refine its search query based on partial results — a single fixed query is replaced by dynamic retrieval planning.

🕸️ GraphRAG — Knowledge Graph Retrieval

Instead of chunking text, GraphRAG extracts entities (people, orgs, products) and relationships, stores them as a graph, and traverses that graph at query time.

Standard Vector RAG

Splits docs into flat text chunks
Retrieves top-k similar chunks
Can't traverse relationships
Multi-hop requires extra calls
Loses entity context across chunks

GraphRAG

Extracts entities + relationships into graph
Traverses graph paths (Cypher queries)
Native multi-hop: A→B→C in one query
Community detection for summarization
Combines vector + graph for hybrid retrieval

Example: "Which drugs interact with medications taken by patients diagnosed with condition X in 2024?" — requires entity traversal across patients → medications → interactions → diagnoses. GraphRAG handles this natively; standard RAG requires many retrieval steps.

Query Rewriting — Often Overlooked, High Impact

Before embedding, reformulate the user query: (1) Step-back prompting — ask a broader question to retrieve more context; (2) Multi-query — generate 3 paraphrases and union the results; (3) Sub-question decomposition — break complex questions into atomic retrievals. These simple techniques often improve recall@k more than switching embedding models.

Practice Quiz

10 NCP-AAI style questions covering all 4 knowledge areas.

RAG Problem Advisor

Describe your situation — get a targeted recommendation.

What is your primary RAG challenge?

Memory Hooks

Tap each card to flip and reveal the definition. Click again to flip back.

📚

RAG

Full form + core idea

Retrieval-Augmented GenerationRetrieve relevant external docs at query time → inject into prompt → LLM generates a grounded answer without retraining.

🔢

Bi-encoder vs Cross-encoder

When to use each

Bi-encoder = fast retrieval (recall stage)Embeds query and docs separately → cosine similarity. Cross-encoder = accurate rerankingSees query + doc together → jointly scored. Use both in a two-stage pipeline.

🎲

RRF

What it combines and how

Reciprocal Rank FusionMerges dense + sparse ranked lists: score = Σ 1/(k + rank). Simple, robust, no training needed. k=60 is the standard constant.

💡

HyDE

The counterintuitive trick

Hypothetical Document EmbeddingsGenerate a fake ideal answer → embed the fake answer → use it as the search vector. Closes the vocabulary gap between questions and documents.

🤖

Self-RAG Tokens

The 3 decision points

[Retrieve] → [Relevant/Irrelevant] → [IsSupported/Contradicts]Model decides IF to retrieve, WHICH passages to use, and WHETHER its own answer is faithful to them.

🕸️

GraphRAG advantage

What standard RAG can't do

Native multi-hop reasoningTraverses entity→relationship→entity paths in a single graph query. Standard RAG requires N separate retrievals to answer the same question.

👨‍👩‍👧

Parent-Child Chunking

Why use both sizes?

Index small child chunks (precise retrieval) → return large parent chunk to LLM (full context)Small chunks = high precision in search. Large parent = enough context for the LLM to answer correctly.

🟢

NeMo Retriever NIM

What endpoints it provides

POST /v1/embeddings + POST /v1/rankingOpenAI-compatible API. Embedding NIM (E5, Mistral-based) and Reranking NIM (Mistral-4B cross-encoder) — both TensorRT-optimized on NVIDIA GPUs.

Knowledge Integration& RAG for Agents

What Is Knowledge Integration?

RAG vs Alternatives

The RAG Pipeline

📥 Phase 1 — Offline Indexing

🔍 Phase 2 — Online Retrieval & Generation

Chunking Strategies

Common Vector Stores

NVIDIA NeMo Retriever NIM

Retrieval Methods Compared

🔄 The Two-Stage Retrieval Pipeline

Full RAG Concept Comparison

Advanced RAG Patterns

🤖 Agentic RAG — The Agent Decides What to Retrieve

🕸️ GraphRAG — Knowledge Graph Retrieval

Practice Quiz

RAG Problem Advisor

Memory Hooks

Ready to Ace the NCP-AAI?

Knowledge Integration
& RAG for Agents