Domain 5 of 5 | 10โ15% of Exam | Azure AI Apps and Agents Developer Associate
Highlighted row = this study page. All percentages are approximate.
| # | Domain | Weight |
|---|---|---|
| 1 | Plan and manage an Azure AI solution | 25โ30% |
| 2 | Implement generative AI and agentic solutions | 30โ35% |
| 3 | Implement computer vision solutions | 10โ15% |
| 4 | Implement text analysis solutions | 10โ15% |
| 5 | ๐ฏ Implement information extraction solutions THIS PAGE | 10โ15% |
Information extraction transforms unstructured content โ PDFs, scanned documents, images, audio, and video โ into structured, queryable data. For AI-103, this domain covers three core Azure services working in concert: Azure AI Document Intelligence (extract fields from documents), Azure AI Search (index, enrich, and retrieve that data at scale), and Azure AI Content Understanding (multimodal analyzers for documents, images, audio, and video).
These services are the foundation of every production RAG pipeline โ the ingestion side that feeds your Azure OpenAI models with precise, up-to-date grounding data.
Exam focus: Expect scenario-based questions where you must choose between Document Intelligence prebuilt models vs custom models, select the correct Azure AI Search indexer/skillset configuration, design hybrid search strategies (BM25 + vector + Semantic Ranker), and select the right chunking strategy for a RAG pipeline.
| Scenario | Use This Service | Why |
|---|---|---|
| Extract invoice fields (vendor, total, line items) from PDFs | Document Intelligence โ Invoice prebuilt | Pre-trained on millions of invoices, no training needed |
| Extract fields from proprietary insurance claim forms | Document Intelligence โ Custom Template or Neural | Proprietary layout not covered by prebuilt models |
| Full-text + semantic search over enterprise documents | Azure AI Search with Semantic Ranker | Hybrid BM25 + vector + L2 reranking for best relevance |
| Ground Azure OpenAI with up-to-date enterprise knowledge | Azure AI Search (RAG pipeline) | Retrieval + generation separation; fresh data without retraining |
| Analyze video for meeting transcription + speaker ID | Azure AI Content Understanding โ Video analyzer | Multimodal; Audio + visual analysis in one service |
| Classify documents and extract key-value pairs at scale | Document Intelligence + AI Search indexer + skillset | End-to-end pipeline: extract โ enrich โ index โ query |
Azure AI Document Intelligence (formerly Form Recognizer) uses machine learning to extract text, key-value pairs, tables, and structured fields from documents. It exposes a REST API and SDK (Python, C#, Java, JavaScript).
| Model | Extracts | Input |
|---|---|---|
| Invoice | Vendor name, invoice #, due date, line items, subtotal, tax, total | PDF, JPEG, PNG, TIFF, BMP |
| Receipt | Merchant name, date, items, totals, payment method | Images, PDF |
| Layout | Text, tables, selection marks, reading order โ no field semantics | Any document type |
| ID Document | Name, DOB, address, ID #, expiry from passports & driver's licenses | Images, PDF |
| W-2 | Tax year, employer EIN, wages, withholdings โ US tax forms | |
| Contract | Parties, dates, clauses, obligations from legal contracts | |
| Health Insurance Card | Member ID, plan, group number, payer name | Images |
Layout vs Invoice: Layout extracts ALL text and table structure without knowing what the fields mean. Invoice understands semantics (it knows "Total Due" is the total amount). Use Layout when you need structure for downstream processing; use Invoice/Receipt when you need business-domain field extraction.
A composed model combines multiple custom models into a single endpoint. Document Intelligence automatically routes each submitted document to the best-matching component model. Use when you process multiple form types (purchase orders, receipts, contracts) through one API call.
Every extracted field includes a confidence score from 0.0 to 1.0. Values below 0.5 typically indicate extraction uncertainty โ implement business logic to flag low-confidence fields for human review.
Azure AI Search (formerly Cognitive Search) is a cloud search service that indexes, enriches, and queries content at scale. It is the retrieval backbone of enterprise RAG solutions on Azure.
| Attribute | Purpose | Example Use |
|---|---|---|
| key | Unique document identifier | Document ID field |
| searchable | Full-text indexed (BM25 / inverted index) | Document content, title |
| filterable | Supports $filter OData expressions | Category, date, status |
| sortable | Supports $orderby | Date, price, score |
| facetable | Supports aggregation buckets (facet navigation) | Category, department |
| retrievable | Returned in search results (stored) | Display fields |
Vector fields must be typed as Collection(Edm.Single) with vectorSearchProfile assigned. Vector fields cannot be filterable or facetable.
An indexer pulls documents from a data source (Blob Storage, SQL, Cosmos DB, etc.) and optionally applies a skillset โ a pipeline of AI enrichment steps โ before writing results to the index.
Built-in cognitive skills include: OCR, language detection, key phrase extraction, entity recognition, image analysis, document cracking. Custom skills call any Azure Function or web API that implements the skill API contract (inputs/outputs JSON schema).
| Search Type | Algorithm | Best For |
|---|---|---|
| Full-text | BM25 (TF-IDF variant) | Keyword precision, exact term matching |
| Vector | HNSW approximate nearest neighbor | Semantic similarity, cross-lingual queries |
| Hybrid | BM25 + Vector merged via RRF | Best relevance โ combines both signals |
| Semantic | Hybrid + L2 reranking (language model) | Highest relevance; generates captions and answers |
HNSW is the approximate nearest-neighbor algorithm used for vector search. Key parameters:
RRF merges BM25 and vector search result lists into a single ranked list. For each document in each result set, its RRF score is computed as 1 / (k + rank) where k=60 (default). Scores from both lists are summed. Documents appearing high in both lists get the highest combined scores.
Why RRF wins: It is rank-based, not score-based, so BM25 scores (unbounded) and vector cosine scores (0โ1) can be combined without normalization. A document in position 1 in both lists gets max RRF score regardless of raw score magnitude.
Semantic Ranker is an optional L2 (second-pass) reranker that takes the top-N results from hybrid search and reranks them using a language model. It also generates semantic captions (highlighted passages) and semantic answers (direct extractions for factual questions).
Cost: Semantic Ranker is an add-on (requires Standard tier or above) billed per query. It does NOT change the index โ it is a query-time operation only.
Integrated vectorization embeds chunking and embedding generation directly into the indexer pipeline โ no external preprocessing step needed. You configure a vectorizer (pointing to Azure OpenAI text-embedding model) and Azure AI Search handles chunking documents, calling the embedding API, and storing vectors automatically.
Integrated vectorization pipeline: Blob Storage โ Indexer pulls document โ Document cracking skill splits content โ Text split skill chunks text โ Azure OpenAI vectorizer embeds each chunk โ Vectors written to index field.
Knowledge Store persists AI enrichment outputs from the skillset pipeline to Azure Storage (Blob or Table Storage) in addition to the search index. It enables downstream analytics, Power BI reports, or other applications to consume enriched data without going through the search API.
Projection types:
Chunking splits source documents into smaller pieces before embedding. Chunk size directly impacts retrieval quality: too large = context dilution, too small = loss of context.
| Strategy | How It Works | Best For | Trade-off |
|---|---|---|---|
| Fixed-size | Split every N tokens regardless of content boundaries | Simple pipelines, uniform content | May split sentences/concepts mid-thought |
| Sliding window | Fixed-size with overlap (e.g., 512 tokens, 50-token overlap) | Preserving context across chunk boundaries | More chunks to store/embed; higher cost |
| Semantic | Split at sentence/paragraph boundaries; respect natural language structure | Narrative text, articles, reports | Variable chunk sizes; more complex to implement |
| Hierarchical | Parent chunks (e.g., full page) + child chunks (e.g., paragraph). Retrieve child, send parent to LLM | RAG needing precision retrieval + broad context | Double storage; complex index design |
Hierarchical chunking advantage: Child chunks (smaller) are used for vector search precision. Parent chunks (larger) are sent to the LLM as context โ giving the model more surrounding information to generate accurate answers without noisy retrieval.
Azure AI Content Understanding is a newer Azure AI service (2024โ2025) for multimodal document and media analysis. It provides pre-trained and customizable analyzers for documents, images, audio, and video โ going beyond Document Intelligence's form extraction to understand richer content types.
| Analyzer | Input | Output |
|---|---|---|
| Document analyzer | PDF, Word, HTML | Structured fields, key-value pairs, tables, reading order |
| Image analyzer | JPEG, PNG, TIFF | Detected objects, text (OCR), captions, descriptions |
| Audio analyzer | WAV, MP3, MP4 audio | Transcription, speaker diarization, intent classification |
| Video analyzer | MP4, MOV | Transcription, keyframe extraction, scene description, speaker ID |
vs Document Intelligence: Document Intelligence specializes in structured form field extraction with prebuilt business models. Content Understanding supports richer multimodal inputs (audio, video) and uses customizable analyzer schemas. They complement each other in complex pipelines.
A production RAG pipeline on Azure has two phases: Ingestion (offline) and Retrieval + Generation (online).
Groundedness check: After generation, use Azure AI Content Safety Groundedness Detection to verify the LLM answer is supported by the retrieved context โ not hallucinated. This closes the RAG quality loop.
Remember: "When you need context at the edges, slide. When you need both precision and breadth, go hierarchical."
Memory: "k=60 is the 'fairness constant' โ it dampens the advantage of top rank vs rank 5."
Knowledge Store is defined in the skillset. It persists enrichment outputs for consumption OUTSIDE the search service.
Select your background for a tailored Domain 5 study plan.
As a developer, the best way to internalize this is to actually call the API. Submit a PDF invoice to the prebuilt Invoice model via REST or SDK. Inspect the JSON response: understand the fields array, confidence scores, and bounding regions. Then try the Layout model on the same document and compare outputs โ this locks in the Layout vs Invoice distinction.
Use the azure-search-documents Python SDK to create an index with vector fields, upload chunks with embeddings, and run hybrid queries. Focus on understanding HybridSearch request structure, top-K, and how @search.score differs from @search.rerankerScore. The exam will test you on query parameters and result fields, not just architecture.
Study index field attributes โ specifically which attributes are required for vector fields (Collection(Edm.Single), vectorSearchProfile). Understand that vector fields cannot be filterable or facetable. Design an index schema that supports both BM25 and vector search on the same content field (contentVector for vector, content for text).
Integrated vectorization is a developer productivity feature โ understand what it eliminates (external preprocessing, separate embedding calls) and what it requires (vectorizer configuration in the index, skillset with Text Split + Azure OpenAI skill). Practice reading the indexer/skillset JSON definitions in the Azure portal.
Custom skills bridge Azure AI Search to any external logic. Memorize the request/response shape: values array, recordId, data input/output, errors, warnings. Know that the skill endpoint must respond within the timeout (default 230 seconds) or the indexer marks the document as an error. This is frequently tested in scenario questions.
You know ETL pipelines โ an indexer is Azure AI Search's ETL. The data source is the extract; the skillset is the transform; the index write is the load. Study the indexer scheduling (every 5 minutes minimum), change detection (high water mark, soft-delete detection), and error handling (maxFailedItems). These are the same operational concerns you handle in Databricks or ADF pipelines.
As a data engineer, Knowledge Store is your domain. Understand table vs object vs file projections. A table projection maps skillset output fields to Azure Table Storage rows โ think of it like writing enriched data to a structured sink. Object projections write full JSON enrichment trees to Blob Storage. Practice defining projection groups in the skillset JSON definition.
Apply chunking strategy decisions to data engineering scenarios: batch processing of PDFs (fixed-size for speed), streaming document ingestion (semantic for quality), large knowledge bases (hierarchical for both). Understand that overlap increases storage and embedding cost โ trade-off for context preservation at chunk boundaries.
Study Document Intelligence from the pipeline perspective: it is the document cracking step that converts unstructured PDFs into structured data your pipeline can process. Focus on the composed model pattern โ one API endpoint routing to multiple custom models โ which is the equivalent of a dispatch pattern in data engineering.
Understand RRF mathematically: 1/(k+rank) per result list, summed. This is important for explaining why a document might rank higher in hybrid search than in either BM25 or vector-only search โ it appeared in both lists, even if not at the very top of either. Know that @search.score (BM25/hybrid) and @search.rerankerScore (Semantic Ranker) are separate fields in the API response.
You understand embeddings โ focus on how chunking strategy affects retrieval quality metrics. Smaller chunks improve retrieval precision (the right chunk comes back) but reduce answer quality (not enough context). Larger chunks improve answer quality but reduce precision. Hierarchical chunking solves both but doubles storage. Map each strategy to groundedness and relevance metrics in the Azure AI Foundry evaluation framework.
You know approximate nearest neighbor algorithms โ apply that knowledge to Azure AI Search's HNSW configuration. The exam may ask: which parameter controls build-time index quality (efConstruction), which controls query-time recall (efSearch), and which controls graph connectivity (m). Know that cosine distance is recommended for OpenAI embeddings because they are L2-normalized.
Understand the trade-off: Semantic Ranker is a zero-shot query-time reranker (no training, add-on billing). Fine-tuned embeddings (Custom Neural via Azure OpenAI fine-tuning) produce domain-specific vectors but require training data and are baked into the index. For exam scenarios, Semantic Ranker is the recommended first step for relevance improvement before investing in fine-tuning.
Azure AI Content Understanding is newer and less documented. Focus on the analyzer type decision: document analyzer for PDFs/Word, image analyzer for visual content, audio/video analyzers for meeting recordings. The key exam scenario: choosing Content Understanding over Document Intelligence for video meeting summarization or audio transcript extraction with speaker diarization.
Close the loop: after building the RAG pipeline, evaluate it with Groundedness, Relevance, Fluency, and Coherence metrics in Azure AI Foundry. Understand that poor Groundedness points to chunking or retrieval issues (wrong chunks returned), while poor Coherence points to generation issues. Use this causal mapping for exam diagnostic scenarios.
Verified links for AI-103 Domain 5 preparation
Full reference for all prebuilt models, custom model training, composed models, REST API, and SDK samples for Python, C#, Java, and JavaScript.
learn.microsoft.com โ Document Intelligence โComplete reference for index design, indexers, skillsets, hybrid search, vector search, Semantic Ranker, integrated vectorization, and Knowledge Store.
learn.microsoft.com โ Azure AI Search โStep-by-step tutorial to configure integrated vectorization with Azure OpenAI embedding models and the Azure AI Search indexer pipeline.
learn.microsoft.com โ Integrated Vectorization โMicrosoft's recommended architecture for RAG solutions using Azure AI Search, Azure OpenAI, and Azure AI Foundry โ includes chunking strategy guidance.
learn.microsoft.com โ RAG Overview โDocumentation for the Content Understanding service including document, image, audio, and video analyzers, and how to create custom analyzer schemas.
learn.microsoft.com โ Content Understanding โMicrosoft's official exam study guide with skills measured breakdown, recommended learning paths, and sample questions for all 5 AI-103 domains.
learn.microsoft.com โ AI-103 Study Guide โOfficial Microsoft certification page for Azure AI Apps and Agents Developer Associate. Register for the exam, access sandbox environments, and view exam details.
learn.microsoft.com โ AI-103 Certification โAccess interactive flashcards, quizzes, and study guides for all 5 AI-103 domains plus hundreds of other certification exams. Free to get started.
โ Start Free on FlashGenius โ