FlashGenius Logo FlashGenius
Microsoft Certified ยท AI-103 ยท Beta April 2026

AI-103: Information Extraction
Solutions

Domain 5 of 5  |  10โ€“15% of Exam  |  Azure AI Apps and Agents Developer Associate

100
Minutes
700
Passing Score
40โ€“60
Questions
Pearson VUE
Delivery
Foundry
Platform

๐Ÿ—‚๏ธ Exam Domain Weights

Highlighted row = this study page. All percentages are approximate.

#DomainWeight
1Plan and manage an Azure AI solution25โ€“30%
2Implement generative AI and agentic solutions30โ€“35%
3Implement computer vision solutions10โ€“15%
4Implement text analysis solutions10โ€“15%
5 ๐ŸŽฏ Implement information extraction solutions THIS PAGE 10โ€“15%

๐Ÿ“Œ What Is Information Extraction?

Information extraction transforms unstructured content โ€” PDFs, scanned documents, images, audio, and video โ€” into structured, queryable data. For AI-103, this domain covers three core Azure services working in concert: Azure AI Document Intelligence (extract fields from documents), Azure AI Search (index, enrich, and retrieve that data at scale), and Azure AI Content Understanding (multimodal analyzers for documents, images, audio, and video).

These services are the foundation of every production RAG pipeline โ€” the ingestion side that feeds your Azure OpenAI models with precise, up-to-date grounding data.

Exam focus: Expect scenario-based questions where you must choose between Document Intelligence prebuilt models vs custom models, select the correct Azure AI Search indexer/skillset configuration, design hybrid search strategies (BM25 + vector + Semantic Ranker), and select the right chunking strategy for a RAG pipeline.

๐Ÿ“– What's on This Page (Domain 5)

๐Ÿ“„Document Intelligence Prebuilt Models
๐Ÿ› ๏ธCustom Template vs Custom Neural
๐Ÿ”Azure AI Search Architecture
โšกHybrid Search & Semantic Ranker
๐Ÿ”—Integrated Vectorization & RAG
๐ŸŽฌAzure AI Content Understanding
โœ‚๏ธChunking Strategies
๐Ÿ“ŠDecision Tables & Service Selection

๐Ÿงญ Service Selection Decision Table

ScenarioUse This ServiceWhy
Extract invoice fields (vendor, total, line items) from PDFsDocument Intelligence โ€” Invoice prebuiltPre-trained on millions of invoices, no training needed
Extract fields from proprietary insurance claim formsDocument Intelligence โ€” Custom Template or NeuralProprietary layout not covered by prebuilt models
Full-text + semantic search over enterprise documentsAzure AI Search with Semantic RankerHybrid BM25 + vector + L2 reranking for best relevance
Ground Azure OpenAI with up-to-date enterprise knowledgeAzure AI Search (RAG pipeline)Retrieval + generation separation; fresh data without retraining
Analyze video for meeting transcription + speaker IDAzure AI Content Understanding โ€” Video analyzerMultimodal; Audio + visual analysis in one service
Classify documents and extract key-value pairs at scaleDocument Intelligence + AI Search indexer + skillsetEnd-to-end pipeline: extract โ†’ enrich โ†’ index โ†’ query

๐Ÿ“„ Azure AI Document Intelligence

Azure AI Document Intelligence (formerly Form Recognizer) uses machine learning to extract text, key-value pairs, tables, and structured fields from documents. It exposes a REST API and SDK (Python, C#, Java, JavaScript).

Prebuilt Models โ€” Know These Cold

ModelExtractsInput
InvoiceVendor name, invoice #, due date, line items, subtotal, tax, totalPDF, JPEG, PNG, TIFF, BMP
ReceiptMerchant name, date, items, totals, payment methodImages, PDF
LayoutText, tables, selection marks, reading order โ€” no field semanticsAny document type
ID DocumentName, DOB, address, ID #, expiry from passports & driver's licensesImages, PDF
W-2Tax year, employer EIN, wages, withholdings โ€” US tax formsPDF
ContractParties, dates, clauses, obligations from legal contractsPDF
Health Insurance CardMember ID, plan, group number, payer nameImages

Layout vs Invoice: Layout extracts ALL text and table structure without knowing what the fields mean. Invoice understands semantics (it knows "Total Due" is the total amount). Use Layout when you need structure for downstream processing; use Invoice/Receipt when you need business-domain field extraction.

Custom Models

Custom Template Model
  • Fixed, predictable layouts (printed forms)
  • Fast training: as few as 5 labeled samples
  • Rule-based field anchoring to page regions
  • Lower compute cost
  • Fails with layout variations
  • Use when: Consistent form structure, high volume
Custom Neural Model
  • Variable layouts (handwritten, multi-format)
  • Requires more labeled samples (typically 50+)
  • Deep learning field extraction โ€” understands context
  • Handles layout variations gracefully
  • Higher accuracy, more compute
  • Use when: Diverse layouts or handwriting

Composed Models

A composed model combines multiple custom models into a single endpoint. Document Intelligence automatically routes each submitted document to the best-matching component model. Use when you process multiple form types (purchase orders, receipts, contracts) through one API call.

Confidence Scores

Every extracted field includes a confidence score from 0.0 to 1.0. Values below 0.5 typically indicate extraction uncertainty โ€” implement business logic to flag low-confidence fields for human review.

POST /documentModels/{modelId}:analyze?api-version=2024-11-30 Content-Type: application/json { "urlSource": "https://storage.blob.core.windows.net/docs/invoice.pdf" } # Poll result: GET /documentModels/{modelId}/analyzeResults/{resultId} # Response includes: fields[].value, fields[].confidence

๐Ÿ” Azure AI Search

Azure AI Search (formerly Cognitive Search) is a cloud search service that indexes, enriches, and queries content at scale. It is the retrieval backbone of enterprise RAG solutions on Azure.

Core Architecture Components

๐Ÿ“š Index ๐Ÿ”Œ Data Source ๐Ÿค– Indexer ๐Ÿง  Skillset ๐Ÿ“ฆ Knowledge Store

Index Field Attributes

AttributePurposeExample Use
keyUnique document identifierDocument ID field
searchableFull-text indexed (BM25 / inverted index)Document content, title
filterableSupports $filter OData expressionsCategory, date, status
sortableSupports $orderbyDate, price, score
facetableSupports aggregation buckets (facet navigation)Category, department
retrievableReturned in search results (stored)Display fields

Vector fields must be typed as Collection(Edm.Single) with vectorSearchProfile assigned. Vector fields cannot be filterable or facetable.

Indexers and Skillsets

An indexer pulls documents from a data source (Blob Storage, SQL, Cosmos DB, etc.) and optionally applies a skillset โ€” a pipeline of AI enrichment steps โ€” before writing results to the index.

Built-in cognitive skills include: OCR, language detection, key phrase extraction, entity recognition, image analysis, document cracking. Custom skills call any Azure Function or web API that implements the skill API contract (inputs/outputs JSON schema).

// Custom skill API contract (must implement this shape) POST https://your-function.azurewebsites.net/api/skill { "values": [ { "recordId": "doc1", "data": { "text": "..." } } ] } // Response: { "values": [ { "recordId": "doc1", "data": { "myField": "extracted value" }, "errors": [], "warnings": [] } ] }

Search Types and When to Use Them

Search TypeAlgorithmBest For
Full-textBM25 (TF-IDF variant)Keyword precision, exact term matching
VectorHNSW approximate nearest neighborSemantic similarity, cross-lingual queries
HybridBM25 + Vector merged via RRFBest relevance โ€” combines both signals
SemanticHybrid + L2 reranking (language model)Highest relevance; generates captions and answers

HNSW โ€” Hierarchical Navigable Small World

HNSW is the approximate nearest-neighbor algorithm used for vector search. Key parameters:

  • m (connections per node, default 4): More connections = better recall, more memory
  • efConstruction (build-time exploration, default 400): Higher = better index quality, slower build
  • efSearch (query-time exploration): Higher = better recall, higher latency
  • metric: cosine (recommended for normalized embeddings), dotProduct, euclidean

Reciprocal Rank Fusion (RRF)

RRF merges BM25 and vector search result lists into a single ranked list. For each document in each result set, its RRF score is computed as 1 / (k + rank) where k=60 (default). Scores from both lists are summed. Documents appearing high in both lists get the highest combined scores.

Why RRF wins: It is rank-based, not score-based, so BM25 scores (unbounded) and vector cosine scores (0โ€“1) can be combined without normalization. A document in position 1 in both lists gets max RRF score regardless of raw score magnitude.

Semantic Ranker

Semantic Ranker is an optional L2 (second-pass) reranker that takes the top-N results from hybrid search and reranks them using a language model. It also generates semantic captions (highlighted passages) and semantic answers (direct extractions for factual questions).

Cost: Semantic Ranker is an add-on (requires Standard tier or above) billed per query. It does NOT change the index โ€” it is a query-time operation only.

Integrated Vectorization

Integrated vectorization embeds chunking and embedding generation directly into the indexer pipeline โ€” no external preprocessing step needed. You configure a vectorizer (pointing to Azure OpenAI text-embedding model) and Azure AI Search handles chunking documents, calling the embedding API, and storing vectors automatically.

Integrated vectorization pipeline: Blob Storage โ†’ Indexer pulls document โ†’ Document cracking skill splits content โ†’ Text split skill chunks text โ†’ Azure OpenAI vectorizer embeds each chunk โ†’ Vectors written to index field.

Knowledge Store

Knowledge Store persists AI enrichment outputs from the skillset pipeline to Azure Storage (Blob or Table Storage) in addition to the search index. It enables downstream analytics, Power BI reports, or other applications to consume enriched data without going through the search API.

Projection types:

  • Table projections โ†’ Azure Table Storage (rows/columns โ€” good for structured data, Power BI)
  • Object projections โ†’ Blob Storage as JSON (full enriched document)
  • File projections โ†’ Blob Storage as binary (normalized images from image analysis skill)

โœ‚๏ธ Chunking Strategies for RAG

Chunking splits source documents into smaller pieces before embedding. Chunk size directly impacts retrieval quality: too large = context dilution, too small = loss of context.

StrategyHow It WorksBest ForTrade-off
Fixed-size Split every N tokens regardless of content boundaries Simple pipelines, uniform content May split sentences/concepts mid-thought
Sliding window Fixed-size with overlap (e.g., 512 tokens, 50-token overlap) Preserving context across chunk boundaries More chunks to store/embed; higher cost
Semantic Split at sentence/paragraph boundaries; respect natural language structure Narrative text, articles, reports Variable chunk sizes; more complex to implement
Hierarchical Parent chunks (e.g., full page) + child chunks (e.g., paragraph). Retrieve child, send parent to LLM RAG needing precision retrieval + broad context Double storage; complex index design

Hierarchical chunking advantage: Child chunks (smaller) are used for vector search precision. Parent chunks (larger) are sent to the LLM as context โ€” giving the model more surrounding information to generate accurate answers without noisy retrieval.

Overlap and Chunk Size Guidance

  • Chunk size: 256โ€“1024 tokens is typical. Match to your LLM's context budget and typical answer length.
  • Overlap: 10โ€“20% of chunk size (e.g., 100-token overlap for 512-token chunks)
  • Azure AI Search default: Text split skill uses 2000 tokens with 500-token overlap by default

๐ŸŽฌ Azure AI Content Understanding

Azure AI Content Understanding is a newer Azure AI service (2024โ€“2025) for multimodal document and media analysis. It provides pre-trained and customizable analyzers for documents, images, audio, and video โ€” going beyond Document Intelligence's form extraction to understand richer content types.

Analyzer Types

AnalyzerInputOutput
Document analyzerPDF, Word, HTMLStructured fields, key-value pairs, tables, reading order
Image analyzerJPEG, PNG, TIFFDetected objects, text (OCR), captions, descriptions
Audio analyzerWAV, MP3, MP4 audioTranscription, speaker diarization, intent classification
Video analyzerMP4, MOVTranscription, keyframe extraction, scene description, speaker ID

vs Document Intelligence: Document Intelligence specializes in structured form field extraction with prebuilt business models. Content Understanding supports richer multimodal inputs (audio, video) and uses customizable analyzer schemas. They complement each other in complex pipelines.

๐Ÿ”— End-to-End RAG Pipeline Architecture

A production RAG pipeline on Azure has two phases: Ingestion (offline) and Retrieval + Generation (online).

Ingestion Phase (Offline)

Source Documents (Blob Storage) โ†“ Azure AI Document Intelligence / Content Understanding Extract text + structure โ†“ Text Split Skill (Azure AI Search indexer) Chunk into overlapping segments โ†“ Azure OpenAI Embedding Model (text-embedding-3-large) Generate vector embeddings per chunk โ†“ Azure AI Search Index Store: text chunk + vector + metadata (source, page, date)

Retrieval + Generation Phase (Online)

User Query โ†“ Azure OpenAI Embedding Model Embed query โ†’ query vector โ†“ Azure AI Search Hybrid Query BM25 (keyword) + Vector (HNSW) โ†’ RRF merge โ†’ Semantic Ranker rerank โ†“ Top-K Retrieved Chunks (with citations) โ†“ Azure OpenAI Chat Completions API System prompt + retrieved chunks + user query โ†’ LLM โ†“ Grounded Answer (with source references)

Groundedness check: After generation, use Azure AI Content Safety Groundedness Detection to verify the LLM answer is supported by the retrieved context โ€” not hallucinated. This closes the RAG quality loop.

๐Ÿช FISH โ€” Chunking Strategies

F ยท I ยท S ยท H
  • Fixed-size โ€” split every N tokens (simple, fast)
  • Intelligent/Semantic โ€” split at language boundaries (sentence, paragraph)
  • Sliding window โ€” fixed-size with overlap (preserves boundary context)
  • Hierarchical โ€” parent + child chunks (precision retrieval, broad context)

Remember: "When you need context at the edges, slide. When you need both precision and breadth, go hierarchical."

๐Ÿช DIAL โ€” Document Intelligence Custom Model Decision

D ยท I ยท A ยท L
  • Diversity of layout โ†’ Custom Neural (handles variation)
  • Identical structure โ†’ Custom Template (fast, cheap, 5 samples)
  • Already covered โ†’ Prebuilt model (Invoice, Receipt, Layout, ID...)
  • Lots of form types โ†’ Composed model (route to best-match component)

๐Ÿช HNSW Tuning โ€” "More M, More Memory. More EF, More Effort."

M = Memory ยท EF = Effort
  • m (connections): Higher m = better recall, more RAM. Default 4.
  • efConstruction (build effort): Higher = better index quality, slower build. Default 400.
  • efSearch (query effort): Higher = better recall at query time, slower. Set per query.
  • metric: Use cosine for normalized embeddings (OpenAI models output normalized vectors).

๐Ÿช RRF โ€” "Rank, Don't Score"

score = 1 / (60 + rank)
  • BM25 and vector scores are on different scales โ€” RRF ignores raw scores
  • Only the rank position matters: rank 1 = 1/61, rank 2 = 1/62, etc.
  • Scores from BM25 list and vector list are summed per document
  • A document appearing #1 in both lists gets maximum combined RRF score

Memory: "k=60 is the 'fairness constant' โ€” it dampens the advantage of top rank vs rank 5."

๐Ÿช Semantic Ranker โ€” "Query Time Only, Never Index Time"

L2 Rerank โ†’ Captions โ†’ Answers
  • Semantic Ranker operates ONLY at query time โ€” it does not change the index
  • Takes top-N results from hybrid search (default top 50)
  • Reranks using a language model that understands semantic intent
  • Generates captions (highlighted passages) and answers (direct answers)
  • Requires Standard tier or above; billed per semantic query

๐Ÿช Knowledge Store Projections โ€” "TOF"

T ยท O ยท F
  • Table projections โ†’ Azure Table Storage (Power BI, analytics)
  • Object projections โ†’ Blob Storage as JSON (full enriched doc)
  • File projections โ†’ Blob Storage as binary (normalized images)

Knowledge Store is defined in the skillset. It persists enrichment outputs for consumption OUTSIDE the search service.

๐Ÿช RAG Pipeline Mnemonic โ€” "ECERGA"

E ยท C ยท E ยท R ยท G ยท A
  • Extract โ€” Document Intelligence cracks documents into text
  • Chunk โ€” Text split skill segments content
  • Embed โ€” Azure OpenAI embedding model vectorizes chunks
  • Retrieve โ€” Hybrid search (BM25 + vector + Semantic Ranker)
  • Generate โ€” Azure OpenAI Chat Completions with retrieved context
  • Audit โ€” Content Safety groundedness detection validates answer
Question 1 of 10
Score: 0

๐Ÿ“Š Personalized Study Advisor

Select your background for a tailored Domain 5 study plan.

1

Call Document Intelligence REST API โ€” Hands-On

As a developer, the best way to internalize this is to actually call the API. Submit a PDF invoice to the prebuilt Invoice model via REST or SDK. Inspect the JSON response: understand the fields array, confidence scores, and bounding regions. Then try the Layout model on the same document and compare outputs โ€” this locks in the Layout vs Invoice distinction.

HIGH PRIORITYHands-On
2

Build a Minimal RAG Pipeline with Azure AI Search SDK

Use the azure-search-documents Python SDK to create an index with vector fields, upload chunks with embeddings, and run hybrid queries. Focus on understanding HybridSearch request structure, top-K, and how @search.score differs from @search.rerankerScore. The exam will test you on query parameters and result fields, not just architecture.

HIGH PRIORITYSDK Focus
3

Understand Index Schema Design for RAG

Study index field attributes โ€” specifically which attributes are required for vector fields (Collection(Edm.Single), vectorSearchProfile). Understand that vector fields cannot be filterable or facetable. Design an index schema that supports both BM25 and vector search on the same content field (contentVector for vector, content for text).

HIGH PRIORITY
4

Study Integrated Vectorization Configuration

Integrated vectorization is a developer productivity feature โ€” understand what it eliminates (external preprocessing, separate embedding calls) and what it requires (vectorizer configuration in the index, skillset with Text Split + Azure OpenAI skill). Practice reading the indexer/skillset JSON definitions in the Azure portal.

MED PRIORITY
5

Review Custom Skill API Contract

Custom skills bridge Azure AI Search to any external logic. Memorize the request/response shape: values array, recordId, data input/output, errors, warnings. Know that the skill endpoint must respond within the timeout (default 230 seconds) or the indexer marks the document as an error. This is frequently tested in scenario questions.

MED PRIORITY
1

Map Data Pipeline Concepts to AI Search Indexers

You know ETL pipelines โ€” an indexer is Azure AI Search's ETL. The data source is the extract; the skillset is the transform; the index write is the load. Study the indexer scheduling (every 5 minutes minimum), change detection (high water mark, soft-delete detection), and error handling (maxFailedItems). These are the same operational concerns you handle in Databricks or ADF pipelines.

HIGH PRIORITYData Focus
2

Deep Dive: Knowledge Store Projections

As a data engineer, Knowledge Store is your domain. Understand table vs object vs file projections. A table projection maps skillset output fields to Azure Table Storage rows โ€” think of it like writing enriched data to a structured sink. Object projections write full JSON enrichment trees to Blob Storage. Practice defining projection groups in the skillset JSON definition.

HIGH PRIORITY
3

Chunking Strategy Trade-offs for Your Use Cases

Apply chunking strategy decisions to data engineering scenarios: batch processing of PDFs (fixed-size for speed), streaming document ingestion (semantic for quality), large knowledge bases (hierarchical for both). Understand that overlap increases storage and embedding cost โ€” trade-off for context preservation at chunk boundaries.

HIGH PRIORITY
4

Document Intelligence for Structured Data Extraction

Study Document Intelligence from the pipeline perspective: it is the document cracking step that converts unstructured PDFs into structured data your pipeline can process. Focus on the composed model pattern โ€” one API endpoint routing to multiple custom models โ€” which is the equivalent of a dispatch pattern in data engineering.

MED PRIORITY
5

RRF Score Calculation and Search Result Interpretation

Understand RRF mathematically: 1/(k+rank) per result list, summed. This is important for explaining why a document might rank higher in hybrid search than in either BM25 or vector-only search โ€” it appeared in both lists, even if not at the very top of either. Know that @search.score (BM25/hybrid) and @search.rerankerScore (Semantic Ranker) are separate fields in the API response.

MED PRIORITY
1

Focus on Chunking Strategy Impact on RAG Quality

You understand embeddings โ€” focus on how chunking strategy affects retrieval quality metrics. Smaller chunks improve retrieval precision (the right chunk comes back) but reduce answer quality (not enough context). Larger chunks improve answer quality but reduce precision. Hierarchical chunking solves both but doubles storage. Map each strategy to groundedness and relevance metrics in the Azure AI Foundry evaluation framework.

HIGH PRIORITYML Focus
2

HNSW Parameter Tuning โ€” The Recall/Latency Trade-off

You know approximate nearest neighbor algorithms โ€” apply that knowledge to Azure AI Search's HNSW configuration. The exam may ask: which parameter controls build-time index quality (efConstruction), which controls query-time recall (efSearch), and which controls graph connectivity (m). Know that cosine distance is recommended for OpenAI embeddings because they are L2-normalized.

HIGH PRIORITY
3

Semantic Ranker vs Fine-tuned Embeddings

Understand the trade-off: Semantic Ranker is a zero-shot query-time reranker (no training, add-on billing). Fine-tuned embeddings (Custom Neural via Azure OpenAI fine-tuning) produce domain-specific vectors but require training data and are baked into the index. For exam scenarios, Semantic Ranker is the recommended first step for relevance improvement before investing in fine-tuning.

HIGH PRIORITY
4

Content Understanding Multimodal Analyzer Design

Azure AI Content Understanding is newer and less documented. Focus on the analyzer type decision: document analyzer for PDFs/Word, image analyzer for visual content, audio/video analyzers for meeting recordings. The key exam scenario: choosing Content Understanding over Document Intelligence for video meeting summarization or audio transcript extraction with speaker diarization.

MED PRIORITY
5

End-to-End RAG Evaluation with Azure AI Foundry

Close the loop: after building the RAG pipeline, evaluate it with Groundedness, Relevance, Fluency, and Coherence metrics in Azure AI Foundry. Understand that poor Groundedness points to chunking or retrieval issues (wrong chunks returned), while poor Coherence points to generation issues. Use this causal mapping for exam diagnostic scenarios.

MED PRIORITY

๐Ÿ”— Official Resources

Verified links for AI-103 Domain 5 preparation

๐Ÿ“„

Azure AI Document Intelligence Documentation

Full reference for all prebuilt models, custom model training, composed models, REST API, and SDK samples for Python, C#, Java, and JavaScript.

learn.microsoft.com โ†’ Document Intelligence โ†—
๐Ÿ”

Azure AI Search Documentation

Complete reference for index design, indexers, skillsets, hybrid search, vector search, Semantic Ranker, integrated vectorization, and Knowledge Store.

learn.microsoft.com โ†’ Azure AI Search โ†—
โšก

Integrated Vectorization Quickstart

Step-by-step tutorial to configure integrated vectorization with Azure OpenAI embedding models and the Azure AI Search indexer pipeline.

learn.microsoft.com โ†’ Integrated Vectorization โ†—
๐Ÿ”—

RAG Solution Design Guide

Microsoft's recommended architecture for RAG solutions using Azure AI Search, Azure OpenAI, and Azure AI Foundry โ€” includes chunking strategy guidance.

learn.microsoft.com โ†’ RAG Overview โ†—
๐ŸŽฌ

Azure AI Content Understanding Documentation

Documentation for the Content Understanding service including document, image, audio, and video analyzers, and how to create custom analyzer schemas.

learn.microsoft.com โ†’ Content Understanding โ†—
๐Ÿ“š

AI-103 Official Study Guide

Microsoft's official exam study guide with skills measured breakdown, recommended learning paths, and sample questions for all 5 AI-103 domains.

learn.microsoft.com โ†’ AI-103 Study Guide โ†—
๐ŸŽ“

AI-103 Certification Page

Official Microsoft certification page for Azure AI Apps and Agents Developer Associate. Register for the exam, access sandbox environments, and view exam details.

learn.microsoft.com โ†’ AI-103 Certification โ†—
โšก

FlashGenius โ€” More AI-103 Study Pages

Access interactive flashcards, quizzes, and study guides for all 5 AI-103 domains plus hundreds of other certification exams. Free to get started.

โ†’ Start Free on FlashGenius โ†—