From document ingestion to GraphRAG — master every layer of the retrieval stack that powers production agentic AI systems.
Agents exceed their training data by connecting to external knowledge — RAG is the primary mechanism.
LLMs are frozen at training time. RAG lets agents answer questions about current events, proprietary documents, and domain-specific data without expensive retraining — by retrieving relevant context and injecting it into the prompt before generating a response.
| Approach | How It Works | Best When | Weakness |
|---|---|---|---|
| RAG | Retrieve docs at runtime, inject into prompt | Large, changing knowledge bases | Retrieval quality bottleneck |
| Fine-tuning | Update model weights on domain data | Stable knowledge, style adaptation | Expensive, can't update daily |
| Long Context | Stuff entire corpus into context window | Small corpora, gemini-1.5 class models | Cost, latency, lost-in-middle effect |
| RAG + Fine-tune | Fine-tune retriever + generator together | High-accuracy production systems | Complex to set up and maintain |
Two phases: offline indexing and online retrieval + generation.
Runs once (or on schedule). Prepares your knowledge base for fast retrieval.
Runs for every user query. Retrieves relevant chunks and generates a grounded response.
Chunk size and method dramatically affect retrieval quality. Too small = missing context; too large = noisy.
| Store | Type | ANN Algorithm | Best For | NVIDIA Fit |
|---|---|---|---|---|
| FAISS | Local lib | IVF-PQ, HNSW | In-process, GPU-accelerated search | GPU FAISS via RAPIDS cuVS |
| pgvector | Postgres ext | IVFFlat, HNSW | Existing Postgres stacks | Works with NeMo Retriever |
| Chroma | Embedded / server | HNSW (hnswlib) | Prototyping, LangChain default | Easy local dev |
| Weaviate | Self-hosted / Cloud | HNSW + BM25 | Hybrid search out of the box | Supports NIM embedding endpoints |
| Milvus | Distributed | IVF, HNSW, DiskANN | Billion-scale production deployments | GPU-native with NVIDIA Raft |
OpenAI-compatible API endpoints for embedding and reranking — drop-in replacement for OpenAI text-embedding endpoints, optimized with TensorRT on NVIDIA GPUs.
Choosing the right retrieval approach is one of the highest-leverage decisions in a RAG system.
Key insight: Bi-encoders are fast (pre-compute doc embeddings) but less accurate. Cross-encoders see both query and document together — much more accurate but too slow to run on all documents. Two-stage gives you speed and precision.
| Concept | RAG Foundations | Retrieval Strategy | Advanced Patterns | GraphRAG |
|---|---|---|---|---|
| Primary goal | Index knowledge for lookup | Find the right chunks fast | Improve RAG quality | Traverse entity relationships |
| Key algorithm | Embedding + ANN index | BM25, cosine, RRF | HyDE, CRAG, self-RAG | Graph traversal (Cypher) |
| Chunk size | Fixed or semantic split | Affects recall@k | HyDE generates virtual doc | Entities, not chunks |
| Multi-hop support | ❌ Single retrieval | ❌ Single retrieval | ✅ Multi-hop chaining | ✅ Native graph traversal |
| Handles structured data | Partially (metadata filters) | Partially (SQL hybrid) | SQL agent pattern | ✅ Primary use case |
| NVIDIA tool | NeMo Retriever Embedding NIM | NeMo Retriever Reranking NIM | LangGraph + NIM LLM | Neo4j + NVIDIA GPU |
| Latency | One-time index cost | +5–50ms reranker | +100–500ms (extra LLM calls) | +graph query time |
| When to skip | Never — always needed | Low-latency constraints | Simple factoid questions | Flat unstructured corpus |
Standard RAG fails on complex questions — these patterns fix specific failure modes.
[Retrieve] when needed, [Relevant] or [Irrelevant] per passage, then [IsSupported] vs [Contradicts] for the final answer.The LLM orchestrates retrieval as a tool call, choosing when to retrieve, what to search for, and how many times — unlike passive RAG which always retrieves once.
Key: The agent can refine its search query based on partial results — a single fixed query is replaced by dynamic retrieval planning.
Instead of chunking text, GraphRAG extracts entities (people, orgs, products) and relationships, stores them as a graph, and traverses that graph at query time.
Example: "Which drugs interact with medications taken by patients diagnosed with condition X in 2024?" — requires entity traversal across patients → medications → interactions → diagnoses. GraphRAG handles this natively; standard RAG requires many retrieval steps.
Before embedding, reformulate the user query: (1) Step-back prompting — ask a broader question to retrieve more context; (2) Multi-query — generate 3 paraphrases and union the results; (3) Sub-question decomposition — break complex questions into atomic retrievals. These simple techniques often improve recall@k more than switching embedding models.
10 NCP-AAI style questions covering all 4 knowledge areas.
Describe your situation — get a targeted recommendation.
Tap each card to flip and reveal the definition. Click again to flip back.