Microsoft Certified · AI-103 · Beta April 2026

AI-103: Information Extraction
Solutions

Domain 5 of 5 | 10–15% of Exam | Azure AI Apps and Agents Developer Associate

100

Minutes

700

Passing Score

40–60

Questions

Pearson VUE

Delivery

Foundry

Platform

🗂️ Exam Domain Weights

Highlighted row = this study page. All percentages are approximate.

#	Domain	Weight
1	Plan and manage an Azure AI solution	25–30%
2	Implement generative AI and agentic solutions	30–35%
3	Implement computer vision solutions	10–15%
4	Implement text analysis solutions	10–15%
5	🎯 Implement information extraction solutions THIS PAGE	10–15%

📌 What Is Information Extraction?

Information extraction transforms unstructured content — PDFs, scanned documents, images, audio, and video — into structured, queryable data. For AI-103, this domain covers three core Azure services working in concert: Azure AI Document Intelligence (extract fields from documents), Azure AI Search (index, enrich, and retrieve that data at scale), and Azure AI Content Understanding (multimodal analyzers for documents, images, audio, and video).

These services are the foundation of every production RAG pipeline — the ingestion side that feeds your Azure OpenAI models with precise, up-to-date grounding data.

Exam focus: Expect scenario-based questions where you must choose between Document Intelligence prebuilt models vs custom models, select the correct Azure AI Search indexer/skillset configuration, design hybrid search strategies (BM25 + vector + Semantic Ranker), and select the right chunking strategy for a RAG pipeline.

📖 What's on This Page (Domain 5)

📄Document Intelligence Prebuilt Models

🛠️Custom Template vs Custom Neural

🔍Azure AI Search Architecture

⚡Hybrid Search & Semantic Ranker

🔗Integrated Vectorization & RAG

🎬Azure AI Content Understanding

✂️Chunking Strategies

📊Decision Tables & Service Selection

🧭 Service Selection Decision Table

Scenario	Use This Service	Why
Extract invoice fields (vendor, total, line items) from PDFs	Document Intelligence — Invoice prebuilt	Pre-trained on millions of invoices, no training needed
Extract fields from proprietary insurance claim forms	Document Intelligence — Custom Template or Neural	Proprietary layout not covered by prebuilt models
Full-text + semantic search over enterprise documents	Azure AI Search with Semantic Ranker	Hybrid BM25 + vector + L2 reranking for best relevance
Ground Azure OpenAI with up-to-date enterprise knowledge	Azure AI Search (RAG pipeline)	Retrieval + generation separation; fresh data without retraining
Analyze video for meeting transcription + speaker ID	Azure AI Content Understanding — Video analyzer	Multimodal; Audio + visual analysis in one service
Classify documents and extract key-value pairs at scale	Document Intelligence + AI Search indexer + skillset	End-to-end pipeline: extract → enrich → index → query

📄 Azure AI Document Intelligence

Azure AI Document Intelligence (formerly Form Recognizer) uses machine learning to extract text, key-value pairs, tables, and structured fields from documents. It exposes a REST API and SDK (Python, C#, Java, JavaScript).

Prebuilt Models — Know These Cold

Model	Extracts	Input
Invoice	Vendor name, invoice #, due date, line items, subtotal, tax, total	PDF, JPEG, PNG, TIFF, BMP
Receipt	Merchant name, date, items, totals, payment method	Images, PDF
Layout	Text, tables, selection marks, reading order — no field semantics	Any document type
ID Document	Name, DOB, address, ID #, expiry from passports & driver's licenses	Images, PDF
W-2	Tax year, employer EIN, wages, withholdings — US tax forms	PDF
Contract	Parties, dates, clauses, obligations from legal contracts	PDF
Health Insurance Card	Member ID, plan, group number, payer name	Images

Layout vs Invoice: Layout extracts ALL text and table structure without knowing what the fields mean. Invoice understands semantics (it knows "Total Due" is the total amount). Use Layout when you need structure for downstream processing; use Invoice/Receipt when you need business-domain field extraction.

Custom Models

Custom Template Model

Fixed, predictable layouts (printed forms)
Fast training: as few as 5 labeled samples
Rule-based field anchoring to page regions
Lower compute cost
Fails with layout variations
Use when: Consistent form structure, high volume

Custom Neural Model

Variable layouts (handwritten, multi-format)
Requires more labeled samples (typically 50+)
Deep learning field extraction — understands context
Handles layout variations gracefully
Higher accuracy, more compute
Use when: Diverse layouts or handwriting

Composed Models

A composed model combines multiple custom models into a single endpoint. Document Intelligence automatically routes each submitted document to the best-matching component model. Use when you process multiple form types (purchase orders, receipts, contracts) through one API call.

Confidence Scores

Every extracted field includes a confidence score from 0.0 to 1.0. Values below 0.5 typically indicate extraction uncertainty — implement business logic to flag low-confidence fields for human review.

POST /documentModels/{modelId}:analyze?api-version=2024-11-30
Content-Type: application/json
{ "urlSource": "https://storage.blob.core.windows.net/docs/invoice.pdf" }

# Poll result:
GET /documentModels/{modelId}/analyzeResults/{resultId}
# Response includes: fields[].value, fields[].confidence

🔍 Azure AI Search

Azure AI Search (formerly Cognitive Search) is a cloud search service that indexes, enriches, and queries content at scale. It is the retrieval backbone of enterprise RAG solutions on Azure.

Core Architecture Components

📚 Index 🔌 Data Source 🤖 Indexer 🧠 Skillset 📦 Knowledge Store

Index Field Attributes

Attribute	Purpose	Example Use
key	Unique document identifier	Document ID field
searchable	Full-text indexed (BM25 / inverted index)	Document content, title
filterable	Supports $filter OData expressions	Category, date, status
sortable	Supports $orderby	Date, price, score
facetable	Supports aggregation buckets (facet navigation)	Category, department
retrievable	Returned in search results (stored)	Display fields

Vector fields must be typed as Collection(Edm.Single) with vectorSearchProfile assigned. Vector fields cannot be filterable or facetable.

Indexers and Skillsets

An indexer pulls documents from a data source (Blob Storage, SQL, Cosmos DB, etc.) and optionally applies a skillset — a pipeline of AI enrichment steps — before writing results to the index.

Built-in cognitive skills include: OCR, language detection, key phrase extraction, entity recognition, image analysis, document cracking. Custom skills call any Azure Function or web API that implements the skill API contract (inputs/outputs JSON schema).

// Custom skill API contract (must implement this shape)
POST https://your-function.azurewebsites.net/api/skill
{
  "values": [
    { "recordId": "doc1", "data": { "text": "..." } }
  ]
}
// Response:
{
  "values": [
    { "recordId": "doc1", "data": { "myField": "extracted value" }, "errors": [], "warnings": [] }
  ]
}

Search Types and When to Use Them

Search Type	Algorithm	Best For
Full-text	BM25 (TF-IDF variant)	Keyword precision, exact term matching
Vector	HNSW approximate nearest neighbor	Semantic similarity, cross-lingual queries
Hybrid	BM25 + Vector merged via RRF	Best relevance — combines both signals
Semantic	Hybrid + L2 reranking (language model)	Highest relevance; generates captions and answers

HNSW — Hierarchical Navigable Small World

HNSW is the approximate nearest-neighbor algorithm used for vector search. Key parameters:

m (connections per node, default 4): More connections = better recall, more memory
efConstruction (build-time exploration, default 400): Higher = better index quality, slower build
efSearch (query-time exploration): Higher = better recall, higher latency
metric: cosine (recommended for normalized embeddings), dotProduct, euclidean

Reciprocal Rank Fusion (RRF)

RRF merges BM25 and vector search result lists into a single ranked list. For each document in each result set, its RRF score is computed as 1 / (k + rank) where k=60 (default). Scores from both lists are summed. Documents appearing high in both lists get the highest combined scores.

Why RRF wins: It is rank-based, not score-based, so BM25 scores (unbounded) and vector cosine scores (0–1) can be combined without normalization. A document in position 1 in both lists gets max RRF score regardless of raw score magnitude.

Semantic Ranker

Semantic Ranker is an optional L2 (second-pass) reranker that takes the top-N results from hybrid search and reranks them using a language model. It also generates semantic captions (highlighted passages) and semantic answers (direct extractions for factual questions).

Cost: Semantic Ranker is an add-on (requires Standard tier or above) billed per query. It does NOT change the index — it is a query-time operation only.

Integrated Vectorization

Integrated vectorization embeds chunking and embedding generation directly into the indexer pipeline — no external preprocessing step needed. You configure a vectorizer (pointing to Azure OpenAI text-embedding model) and Azure AI Search handles chunking documents, calling the embedding API, and storing vectors automatically.

Integrated vectorization pipeline: Blob Storage → Indexer pulls document → Document cracking skill splits content → Text split skill chunks text → Azure OpenAI vectorizer embeds each chunk → Vectors written to index field.

Knowledge Store

Knowledge Store persists AI enrichment outputs from the skillset pipeline to Azure Storage (Blob or Table Storage) in addition to the search index. It enables downstream analytics, Power BI reports, or other applications to consume enriched data without going through the search API.

Projection types:

Table projections → Azure Table Storage (rows/columns — good for structured data, Power BI)
Object projections → Blob Storage as JSON (full enriched document)
File projections → Blob Storage as binary (normalized images from image analysis skill)

✂️ Chunking Strategies for RAG

Chunking splits source documents into smaller pieces before embedding. Chunk size directly impacts retrieval quality: too large = context dilution, too small = loss of context.

Strategy	How It Works	Best For	Trade-off
Fixed-size	Split every N tokens regardless of content boundaries	Simple pipelines, uniform content	May split sentences/concepts mid-thought
Sliding window	Fixed-size with overlap (e.g., 512 tokens, 50-token overlap)	Preserving context across chunk boundaries	More chunks to store/embed; higher cost
Semantic	Split at sentence/paragraph boundaries; respect natural language structure	Narrative text, articles, reports	Variable chunk sizes; more complex to implement
Hierarchical	Parent chunks (e.g., full page) + child chunks (e.g., paragraph). Retrieve child, send parent to LLM	RAG needing precision retrieval + broad context	Double storage; complex index design

Hierarchical chunking advantage: Child chunks (smaller) are used for vector search precision. Parent chunks (larger) are sent to the LLM as context — giving the model more surrounding information to generate accurate answers without noisy retrieval.

Overlap and Chunk Size Guidance

Chunk size: 256–1024 tokens is typical. Match to your LLM's context budget and typical answer length.
Overlap: 10–20% of chunk size (e.g., 100-token overlap for 512-token chunks)
Azure AI Search default: Text split skill uses 2000 tokens with 500-token overlap by default

🎬 Azure AI Content Understanding

Azure AI Content Understanding is a newer Azure AI service (2024–2025) for multimodal document and media analysis. It provides pre-trained and customizable analyzers for documents, images, audio, and video — going beyond Document Intelligence's form extraction to understand richer content types.

Analyzer Types

Analyzer	Input	Output
Document analyzer	PDF, Word, HTML	Structured fields, key-value pairs, tables, reading order
Image analyzer	JPEG, PNG, TIFF	Detected objects, text (OCR), captions, descriptions
Audio analyzer	WAV, MP3, MP4 audio	Transcription, speaker diarization, intent classification
Video analyzer	MP4, MOV	Transcription, keyframe extraction, scene description, speaker ID

vs Document Intelligence: Document Intelligence specializes in structured form field extraction with prebuilt business models. Content Understanding supports richer multimodal inputs (audio, video) and uses customizable analyzer schemas. They complement each other in complex pipelines.

🔗 End-to-End RAG Pipeline Architecture

A production RAG pipeline on Azure has two phases: Ingestion (offline) and Retrieval + Generation (online).

Ingestion Phase (Offline)

Source Documents (Blob Storage)
  ↓ Azure AI Document Intelligence / Content Understanding
Extract text + structure
  ↓ Text Split Skill (Azure AI Search indexer)
Chunk into overlapping segments
  ↓ Azure OpenAI Embedding Model (text-embedding-3-large)
Generate vector embeddings per chunk
  ↓ Azure AI Search Index
Store: text chunk + vector + metadata (source, page, date)

Retrieval + Generation Phase (Online)

User Query
  ↓ Azure OpenAI Embedding Model
Embed query → query vector
  ↓ Azure AI Search Hybrid Query
BM25 (keyword) + Vector (HNSW) → RRF merge → Semantic Ranker rerank
  ↓ Top-K Retrieved Chunks (with citations)
  ↓ Azure OpenAI Chat Completions API
System prompt + retrieved chunks + user query → LLM
  ↓ Grounded Answer (with source references)

Groundedness check: After generation, use Azure AI Content Safety Groundedness Detection to verify the LLM answer is supported by the retrieved context — not hallucinated. This closes the RAG quality loop.

🪝 FISH — Chunking Strategies

F · I · S · H

Fixed-size — split every N tokens (simple, fast)
Intelligent/Semantic — split at language boundaries (sentence, paragraph)
Sliding window — fixed-size with overlap (preserves boundary context)
Hierarchical — parent + child chunks (precision retrieval, broad context)

Remember: "When you need context at the edges, slide. When you need both precision and breadth, go hierarchical."

🪝 DIAL — Document Intelligence Custom Model Decision

D · I · A · L

Diversity of layout → Custom Neural (handles variation)
Identical structure → Custom Template (fast, cheap, 5 samples)
Already covered → Prebuilt model (Invoice, Receipt, Layout, ID...)
Lots of form types → Composed model (route to best-match component)

🪝 HNSW Tuning — "More M, More Memory. More EF, More Effort."

M = Memory · EF = Effort

m (connections): Higher m = better recall, more RAM. Default 4.
efConstruction (build effort): Higher = better index quality, slower build. Default 400.
efSearch (query effort): Higher = better recall at query time, slower. Set per query.
metric: Use cosine for normalized embeddings (OpenAI models output normalized vectors).

🪝 RRF — "Rank, Don't Score"

score = 1 / (60 + rank)

BM25 and vector scores are on different scales — RRF ignores raw scores
Only the rank position matters: rank 1 = 1/61, rank 2 = 1/62, etc.
Scores from BM25 list and vector list are summed per document
A document appearing #1 in both lists gets maximum combined RRF score

Memory: "k=60 is the 'fairness constant' — it dampens the advantage of top rank vs rank 5."

🪝 Semantic Ranker — "Query Time Only, Never Index Time"

L2 Rerank → Captions → Answers

Semantic Ranker operates ONLY at query time — it does not change the index
Takes top-N results from hybrid search (default top 50)
Reranks using a language model that understands semantic intent
Generates captions (highlighted passages) and answers (direct answers)
Requires Standard tier or above; billed per semantic query

🪝 Knowledge Store Projections — "TOF"

T · O · F

Table projections → Azure Table Storage (Power BI, analytics)
Object projections → Blob Storage as JSON (full enriched doc)
File projections → Blob Storage as binary (normalized images)

Knowledge Store is defined in the skillset. It persists enrichment outputs for consumption OUTSIDE the search service.

🪝 RAG Pipeline Mnemonic — "ECERGA"

E · C · E · R · G · A

Extract — Document Intelligence cracks documents into text
Chunk — Text split skill segments content
Embed — Azure OpenAI embedding model vectorizes chunks
Retrieve — Hybrid search (BM25 + vector + Semantic Ranker)
Generate — Azure OpenAI Chat Completions with retrieved context
Audit — Content Safety groundedness detection validates answer

Question 1 of 10

Score: 0

Close the loop: after building the RAG pipeline, evaluate it with Groundedness, Relevance, Fluency, and Coherence metrics in Azure AI Foundry. Understand that poor Groundedness points to chunking or retrieval issues (wrong chunks returned), while poor Coherence points to generation issues. Use this causal mapping for exam diagnostic scenarios.

MED PRIORITY

🔗 Official Resources

Verified links for AI-103 Domain 5 preparation

📄

Azure AI Document Intelligence Documentation

Full reference for all prebuilt models, custom model training, composed models, REST API, and SDK samples for Python, C#, Java, and JavaScript.

learn.microsoft.com → Document Intelligence ↗

🔍

Azure AI Search Documentation

Complete reference for index design, indexers, skillsets, hybrid search, vector search, Semantic Ranker, integrated vectorization, and Knowledge Store.

learn.microsoft.com → Azure AI Search ↗

⚡

Integrated Vectorization Quickstart

Step-by-step tutorial to configure integrated vectorization with Azure OpenAI embedding models and the Azure AI Search indexer pipeline.

learn.microsoft.com → Integrated Vectorization ↗

🔗

RAG Solution Design Guide

Microsoft's recommended architecture for RAG solutions using Azure AI Search, Azure OpenAI, and Azure AI Foundry — includes chunking strategy guidance.

learn.microsoft.com → RAG Overview ↗

🎬

Azure AI Content Understanding Documentation

Documentation for the Content Understanding service including document, image, audio, and video analyzers, and how to create custom analyzer schemas.

learn.microsoft.com → Content Understanding ↗

📚

AI-103 Official Study Guide

Microsoft's official exam study guide with skills measured breakdown, recommended learning paths, and sample questions for all 5 AI-103 domains.

learn.microsoft.com → AI-103 Study Guide ↗

🎓

AI-103 Certification Page

Official Microsoft certification page for Azure AI Apps and Agents Developer Associate. Register for the exam, access sandbox environments, and view exam details.

learn.microsoft.com → AI-103 Certification ↗

⚡

FlashGenius — More AI-103 Study Pages

Access interactive flashcards, quizzes, and study guides for all 5 AI-103 domains plus hundreds of other certification exams. Free to get started.

→ Start Free on FlashGenius ↗

AI-103: Information ExtractionSolutions