NCP-AAI · Topic 3 of 5 · 10% Exam Weight

Knowledge Integration
& RAG for Agents

From document ingestion to GraphRAG — master every layer of the retrieval stack that powers production agentic AI systems.

4
Retrieval Modes
4+
Advanced Patterns
10
Practice Questions
8
Memory Cards
Start Free Practice →

What Is Knowledge Integration?

Agents exceed their training data by connecting to external knowledge — RAG is the primary mechanism.

Why Agents Need External Knowledge

LLMs are frozen at training time. RAG lets agents answer questions about current events, proprietary documents, and domain-specific data without expensive retraining — by retrieving relevant context and injecting it into the prompt before generating a response.

📚
RAG Foundations
Indexing pipeline, chunking strategies, embeddings, vector stores
Topic~30%
Key toolsFAISS, Chroma, pgvector
🔎
Retrieval Strategies
Dense vs sparse, hybrid search, query rewriting, reranking
Topic~25%
Key toolsBM25, RRF, cross-encoder
🧩
Advanced RAG Patterns
HyDE, self-RAG, multi-hop, CRAG, agentic retrieval
Topic~25%
Key toolsLangGraph, LlamaIndex
🕸️
GraphRAG & Structured Data
Knowledge graphs, entity relationships, SQL agents, NeMo Retriever
Topic~20%
Key toolsNeo4j, NVIDIA NIM

RAG vs Alternatives

Approach How It Works Best When Weakness
RAGRetrieve docs at runtime, inject into promptLarge, changing knowledge basesRetrieval quality bottleneck
Fine-tuningUpdate model weights on domain dataStable knowledge, style adaptationExpensive, can't update daily
Long ContextStuff entire corpus into context windowSmall corpora, gemini-1.5 class modelsCost, latency, lost-in-middle effect
RAG + Fine-tuneFine-tune retriever + generator togetherHigh-accuracy production systemsComplex to set up and maintain

The RAG Pipeline

Two phases: offline indexing and online retrieval + generation.

📥 Phase 1 — Offline Indexing

Runs once (or on schedule). Prepares your knowledge base for fast retrieval.

📄
Raw Documents
PDFs, HTML, DOCX, DB records
✂️
Chunking
Split into retrievable units
🔢
Embedding
text-embedding model → vector
🗄️
Vector Store
Index stored (FAISS, pgvector…)

🔍 Phase 2 — Online Retrieval & Generation

Runs for every user query. Retrieves relevant chunks and generates a grounded response.

User Query
"What is our refund policy?"
🔢
Query Embed
Same embedding model
🎯
Vector Search
Top-k by cosine similarity
📋
Retrieved Chunks
k=3–10 most relevant
📝
Augment Prompt
Context + question → LLM
💬
Answer
Grounded response

Chunking Strategies

Chunk size and method dramatically affect retrieval quality. Too small = missing context; too large = noisy.

Fixed-Size
Split every N tokens with a sliding overlap window (e.g., 512 tokens, 50 overlap). Simple and fast but may cut mid-sentence.
~512 tokens, 10–15% overlap
Sentence / Paragraph
Split on sentence boundaries using spaCy or NLTK. Preserves linguistic units — good for prose documents.
Boundary-aware
Semantic
Embed sentences, compute cosine distance between consecutive sentences, split where similarity drops sharply (topic change).
Content-aware
Hierarchical (Parent-Child)
Index small child chunks for retrieval precision; return the larger parent chunk to the LLM for full context. Best of both worlds.
Parent 1024 / Child 256

Common Vector Stores

Store Type ANN Algorithm Best For NVIDIA Fit
FAISS Local lib IVF-PQ, HNSW In-process, GPU-accelerated search GPU FAISS via RAPIDS cuVS
pgvector Postgres ext IVFFlat, HNSW Existing Postgres stacks Works with NeMo Retriever
Chroma Embedded / server HNSW (hnswlib) Prototyping, LangChain default Easy local dev
Weaviate Self-hosted / Cloud HNSW + BM25 Hybrid search out of the box Supports NIM embedding endpoints
Milvus Distributed IVF, HNSW, DiskANN Billion-scale production deployments GPU-native with NVIDIA Raft

NVIDIA NeMo Retriever NIM

OpenAI-compatible API endpoints for embedding and reranking — drop-in replacement for OpenAI text-embedding endpoints, optimized with TensorRT on NVIDIA GPUs.

Embedding NIM
nvidia/nv-embedqa-e5-v5, nvidia/nv-embedqa-mistral-7b-v2
Reranking NIM
nvidia/nv-rerankqa-mistral-4b-v3
API endpoint
POST /v1/embeddings | POST /v1/ranking
Integration
LangChain NVIDIAEmbeddings, LlamaIndex NVIDIARerank

Retrieval Methods Compared

Choosing the right retrieval approach is one of the highest-leverage decisions in a RAG system.

Dense Semantic Retrieval
cosine_similarity(q_embed, doc_embed)
  • Captures meaning, not just keywords
  • Works across synonyms & paraphrases
  • Bi-encoder: encode query + docs separately
  • Models: E5, BGE, NeMo Retriever, ada-002
Best for:Conversational queries, semantic similarity tasks, multilingual search.
Sparse Keyword Retrieval
BM25(q, doc) = TF·IDF weighted score
  • Exact term matching — no embedding needed
  • Fast and interpretable
  • Great for rare terms, codes, identifiers
  • Tools: Elasticsearch, OpenSearch, Weaviate BM25
Best for:Product codes, names, jargon, legal citations — where exact match matters.
Hybrid Combined Search
RRF(rank_dense, rank_sparse)
  • Runs dense + sparse in parallel
  • Merges result lists with Reciprocal Rank Fusion
  • Higher recall than either method alone
  • Industry default for production RAG
Best for:General-purpose RAG where you can't predict query type in advance.

🔄 The Two-Stage Retrieval Pipeline

Query
User input
🎯
Recall (k=100)
Dense + sparse, fast bi-encoder
🏆
Rerank (top 5)
Cross-encoder scores query×chunk jointly
📋
Final Context
High-precision chunks to LLM

Key insight: Bi-encoders are fast (pre-compute doc embeddings) but less accurate. Cross-encoders see both query and document together — much more accurate but too slow to run on all documents. Two-stage gives you speed and precision.

Full RAG Concept Comparison

Concept RAG Foundations Retrieval Strategy Advanced Patterns GraphRAG
Primary goalIndex knowledge for lookupFind the right chunks fastImprove RAG qualityTraverse entity relationships
Key algorithmEmbedding + ANN indexBM25, cosine, RRFHyDE, CRAG, self-RAGGraph traversal (Cypher)
Chunk sizeFixed or semantic splitAffects recall@kHyDE generates virtual docEntities, not chunks
Multi-hop support❌ Single retrieval❌ Single retrieval✅ Multi-hop chaining✅ Native graph traversal
Handles structured dataPartially (metadata filters)Partially (SQL hybrid)SQL agent pattern✅ Primary use case
NVIDIA toolNeMo Retriever Embedding NIMNeMo Retriever Reranking NIMLangGraph + NIM LLMNeo4j + NVIDIA GPU
LatencyOne-time index cost+5–50ms reranker+100–500ms (extra LLM calls)+graph query time
When to skipNever — always neededLow-latency constraintsSimple factoid questionsFlat unstructured corpus

Advanced RAG Patterns

Standard RAG fails on complex questions — these patterns fix specific failure modes.

HyDE
Hypothetical Document Embeddings
Ask the LLM to generate a hypothetical ideal answer, then embed that answer and use it as the search vector — not the original query.
💡 Use when:Query vocabulary is very different from document vocabulary (e.g., question asks "why does X cause Y?" but docs use passive-voice technical prose).
Self-RAG
Self-Reflective Retrieval
Model generates special tokens to decide: [Retrieve] when needed, [Relevant] or [Irrelevant] per passage, then [IsSupported] vs [Contradicts] for the final answer.
💡 Use when:You need adaptive retrieval — some queries need retrieval, others don't. Avoids unnecessary latency and citation errors.
Multi-Hop RAG
Iterative Retrieval Chaining
Break a complex question into sub-questions. Retrieve for Q1, inject result into Q2's retrieval query, and so on — chaining until the full answer is assembled.
💡 Use when:Questions require combining facts from multiple documents, e.g., "What did the CEO who joined in 2019 say about Q3 margins?"
CRAG
Corrective RAG
After retrieval, a lightweight evaluator scores each chunk's relevance. If relevance is low → trigger a web search fallback. Strips irrelevant content before generation.
💡 Use when:Your vector store may be out of date, or queries fall outside indexed content. Automatically falls back to live search.

🤖 Agentic RAG — The Agent Decides What to Retrieve

The LLM orchestrates retrieval as a tool call, choosing when to retrieve, what to search for, and how many times — unlike passive RAG which always retrieves once.

User Question
LLM Agent
Plan retrieval steps
Generate search query
retrieve(query)
Observe chunks
Sufficient?
No → refine query and retrieve again
Yes → generate answer
Final Response

Key: The agent can refine its search query based on partial results — a single fixed query is replaced by dynamic retrieval planning.

🕸️ GraphRAG — Knowledge Graph Retrieval

Instead of chunking text, GraphRAG extracts entities (people, orgs, products) and relationships, stores them as a graph, and traverses that graph at query time.

Standard Vector RAG
  • Splits docs into flat text chunks
  • Retrieves top-k similar chunks
  • Can't traverse relationships
  • Multi-hop requires extra calls
  • Loses entity context across chunks
GraphRAG
  • Extracts entities + relationships into graph
  • Traverses graph paths (Cypher queries)
  • Native multi-hop: A→B→C in one query
  • Community detection for summarization
  • Combines vector + graph for hybrid retrieval

Example: "Which drugs interact with medications taken by patients diagnosed with condition X in 2024?" — requires entity traversal across patients → medications → interactions → diagnoses. GraphRAG handles this natively; standard RAG requires many retrieval steps.

Query Rewriting — Often Overlooked, High Impact

Before embedding, reformulate the user query: (1) Step-back prompting — ask a broader question to retrieve more context; (2) Multi-query — generate 3 paraphrases and union the results; (3) Sub-question decomposition — break complex questions into atomic retrievals. These simple techniques often improve recall@k more than switching embedding models.

Practice Quiz

10 NCP-AAI style questions covering all 4 knowledge areas.

RAG Problem Advisor

Describe your situation — get a targeted recommendation.

    Memory Hooks

    Tap each card to flip and reveal the definition. Click again to flip back.

    📚
    RAG
    Full form + core idea
    Retrieval-Augmented GenerationRetrieve relevant external docs at query time → inject into prompt → LLM generates a grounded answer without retraining.
    🔢
    Bi-encoder vs Cross-encoder
    When to use each
    Bi-encoder = fast retrieval (recall stage)Embeds query and docs separately → cosine similarity. Cross-encoder = accurate rerankingSees query + doc together → jointly scored. Use both in a two-stage pipeline.
    🎲
    RRF
    What it combines and how
    Reciprocal Rank FusionMerges dense + sparse ranked lists: score = Σ 1/(k + rank). Simple, robust, no training needed. k=60 is the standard constant.
    💡
    HyDE
    The counterintuitive trick
    Hypothetical Document EmbeddingsGenerate a fake ideal answer → embed the fake answer → use it as the search vector. Closes the vocabulary gap between questions and documents.
    🤖
    Self-RAG Tokens
    The 3 decision points
    [Retrieve] → [Relevant/Irrelevant] → [IsSupported/Contradicts]Model decides IF to retrieve, WHICH passages to use, and WHETHER its own answer is faithful to them.
    🕸️
    GraphRAG advantage
    What standard RAG can't do
    Native multi-hop reasoningTraverses entity→relationship→entity paths in a single graph query. Standard RAG requires N separate retrievals to answer the same question.
    👨‍👩‍👧
    Parent-Child Chunking
    Why use both sizes?
    Index small child chunks (precise retrieval) → return large parent chunk to LLM (full context)Small chunks = high precision in search. Large parent = enough context for the LLM to answer correctly.
    🟢
    NeMo Retriever NIM
    What endpoints it provides
    POST /v1/embeddings + POST /v1/rankingOpenAI-compatible API. Embedding NIM (E5, Mistral-based) and Reranking NIM (Mistral-4B cross-encoder) — both TensorRT-optimized on NVIDIA GPUs.
    Free Practice Access

    Ready to Ace the NCP-AAI?

    Practice RAG, retrieval, and all 5 agentic AI domains with adaptive flashcards and full-length exams.