About This Page
This FlashGenius study page covers Domain 3 — Implement Computer Vision Solutions, worth 10–15% of the AI-103 exam. Use all seven tabs to master concepts, burn in memory hooks, test yourself, and find the right resources.
Exam Domain Weights
| Domain | Topic | Weight |
|---|---|---|
| Domain 1 | Plan and Manage Azure AI Solutions | 15–20% |
| Domain 2 | Implement Generative AI Solutions | 25–30% |
| Domain 3 | Implement Computer Vision Solutions | 10–15% |
| Domain 4 | Implement Natural Language Processing Solutions | 20–25% |
| Domain 5 | Implement Agentic AI Solutions | 20–25% |
Key Services in This Domain
👁
Azure AI Vision
Image Analysis 4.0, OCR, spatial analysis
🎨
DALL-E 3
Text-to-image generation via Azure OpenAI
📷
GPT-4o Vision
Multimodal understanding & reasoning
📄
Content Understanding
Unified multimodal extraction service
👥
Azure Face API
Detection, verification, liveness
🎉
Custom Vision
Custom classifiers & object detectors
🎥
Video Indexer
Shot detection, transcripts, OCR in video
🛡
Content Safety
Image moderation, severity scoring
Ready to go deeper?
FlashGenius has adaptive quizzes and spaced-repetition flashcards for every AI-103 domain.
Create Free Account →Azure AI Vision Service
Image Analysis 4.0 API
- Dense Captions — up to 10 region-level captions + one overall caption
- Smart Crops — area of interest detection for thumbnail generation
- Object Detection — returns bounding boxes with labels and confidence scores
- Tag Generation — flat list of descriptive tags with confidence
- Read API (OCR) — text recognition from images and documents; returns lines and words with bounding polygons
- Face Detection — detects face bounding boxes only; recognition requires Face API (limited access)
- Background Removal / Segmentation — foreground extraction and semantic segmentation
Spatial Analysis
- People counting, zone crossing, social distancing monitoring
- Deployed via Docker container on edge or Azure
- Still appears on exam despite being deprecated in newer iterations
Video Indexer
- Shot detection — automatic chapter/scene segmentation
- Speaker diarization — who spoke when
- Transcript generation — speech-to-text across multiple languages
- OCR in video — reads on-screen text frame by frame
- Scene understanding — labels, brands, celebrities
Custom Vision
- Classification vs Object Detection: classification gives a label to the whole image; object detection returns bounding boxes
- Training datasets: need labeled images per class or annotated bounding boxes
- Evaluation metrics: Precision, Recall, Average Precision (AP), Mean Average Precision (mAP)
- Iterative training: add images, retrain, evaluate, repeat
Multimodal AI with GPT-4o Vision
Passing Images in Messages
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this chart."},
{"type": "image_url",
"image_url": {"url": "https://...jpg"}}
]
}
- Supported formats: JPEG, PNG, GIF, WEBP
- Two input methods: image URL or base64-encoded data URI
- Image URL requires the model to fetch externally; base64 keeps everything in-request
Vision Use Cases
- Document understanding (forms, invoices, reports)
- Chart reading and trend explanation
- Product inspection and defect detection
- Scene description and VQA (Visual Question Answering)
- Multimodal agent: vision + text reasoning + function calling in one loop
Prompt Engineering for Vision
- Provide detailed instructions specifying what to look for
- Request specific output formats (JSON, bullet list, table)
- Combine with system messages for role context
Image & Video Generation
DALL-E 3 via Azure OpenAI
- API call: client.images.generate(...)
- prompt — text description of the desired image
- size — 1024x1024 (square), 1792x1024 (landscape), 1024x1792 (portrait)
- quality — standard (faster) or hd (higher detail)
- style — vivid (dramatic) or natural (photorealistic)
- n — number of images (DALL-E 3 supports n=1 only)
- Revised prompts: DALL-E 3 rewrites the user prompt for safety — the revised version is returned in the response
- Content policy filtering rejects unsafe prompts before generation
Model Catalog & Other Generators
- Azure AI Foundry catalog includes Stable Diffusion and other open-weight image models
- DALL-E 2 supports image editing (inpainting) and image variations — DALL-E 3 does not
Azure AI Content Understanding
- New unified service in Azure AI Foundry that consolidates document, image, audio, and video extraction
- Analyzers: pre-built analyzers for common schemas; custom analyzers for bespoke field extraction
- Replaces some Form Recognizer + Vision features in a single unified API
- Integrates with Azure AI Search for multimodal indexing and retrieval
- Field extraction from unstructured content (invoices, receipts, images, audio recordings, video clips)
- Configured and deployed inside an Azure AI Foundry project
Content Safety for Vision
- Image content moderation categories: Hate, Violence, Sexual, Self-Harm
- Severity levels 0–7 per category (0 = safe, 7 = most severe)
- Custom blocklists for images (hash-based matching)
- Protected material detection for images (copyright / IP)
- Groundedness check for image-based claims
Azure Face API
- Face Detection: locate face bounding boxes; attributes (age estimate, head pose) — no approval required
- Face Verification: compare two faces, returns similarity score — limited access required
- Face Identification: match a face to a PersonGroup — limited access required (requires Microsoft approval)
- Liveness Detection: anti-spoofing (determines if the subject is a real person, not a photo) — requires limited access
- PersonGroup / FaceList: storage structures for known identities
- Emotion detection removed in newer API versions
Custom Vision: Detail
- Multi-class classification: one label per image (mutually exclusive)
- Multi-label classification: multiple labels per image allowed
- Object detection: bounding box annotation required per object
- Quick Training vs Advanced Training: budget compute time for better accuracy
- Export formats: TensorFlow, CoreML, ONNX, Docker container
- Precision: of all predicted positives, what fraction is correct?
- Recall: of all actual positives, what fraction was found?
- mAP (mean Average Precision): average AP across all classes
Decision Guide: Which Vision Service?
| Need | Use This Service |
|---|---|
| Generic image captioning / tagging | Azure AI Vision (Image Analysis 4.0) |
| Read text from images or documents | Vision Read API (OCR) |
| Generate images from text prompts | DALL-E 3 via Azure OpenAI |
| Understand image + reason over it | GPT-4o Vision |
| Custom object detection / classification | Custom Vision |
| Video analysis at scale | Video Indexer |
| Multimodal document field extraction | Content Understanding |
| Moderate image content for safety | Content Safety |
| Face identification / verification | Face API (limited access) |
Memory Hooks for Domain 3
Short mnemonics and mental models to lock in key facts for exam day.
🅠 "C.A.T.S" for Image Analysis 4.0
Captions (dense captions & smart crops) • Analysis (object detection & tag generation) • Text (Read API / OCR) • Segmentation (background removal)
DALL-E 3 Parameters: "PSQS"
Remember the four parameters in order: Prompt → Size → Quality → Style. Sizes: square (1024×1024), landscape (1792×1024), portrait (1024×1792). Quality: standard vs hd. Style: vivid vs natural.
📷 GPT-4o Vision Input: "URL or Base64 — no other way in"
Images reach GPT-4o Vision only as an image URL or a base64-encoded data URI inside the messages array. Supported formats: JPEG, PNG, GIF, WEBP.
🅾 Custom Vision Decision Tree
Do you need custom categories not in the standard model? Yes → Custom Vision. No → Azure AI Vision (standard). Does the customer need bounding boxes? Yes → Custom Vision Object Detection. No → Custom Vision Classification.
🅺 Face API: "Face ID = Need a Form"
Face Detection (bounding box only) = free, no approval. Face Identification and Verification = Limited Access — must submit a Microsoft approval form. Remember: "if you want to know who, you need approval."
⚖️ Precision vs Recall
Precision = "When you say YES, are you right?" (true positives / all predicted positives). Recall = "Of all the actual YESes, did you find them all?" (true positives / all actual positives). High precision + low recall = conservative detector that misses many objects.
🛠️ Content Understanding: "Swiss Army Knife"
One service for documents + images + audio + video. Think of it as the replacement and unification of Form Recognizer + parts of Vision into a single Foundry-native service.
🎥 Video Indexer: "YouTube Chapters for Enterprise"
Shot detection = automatic chapter markers. Speaker diarization = who said what. Transcript = speech-to-text. OCR in video = reads text on screen. Scene understanding = labels, brands, celebrities.
🛡️ Content Safety Severity: 0–7
Categories: Hate | Violence | Sexual | Self-Harm. Each rated 0–7. Typical production threshold: reject severity ≥ 2 or 4 depending on platform strictness. 0 = safe, 7 = most severe.
🚀 DALL-E 3 "Revised Prompt" Gotcha
DALL-E 3 rewrites your prompt before generating. The response object includes a revised_prompt field showing what was actually used. This is for safety filtering and prompt enhancement — know this for scenario questions!
Domain 3 Knowledge Check
10 scenario-based questions covering all key vision topics.
Question 1 of 10
Domain 3 Flashcards
20 cards. Click a card to flip. Use arrows to navigate.
Card 1 of 20
QUESTION
Click to reveal answer
ANSWER
Study Advisor
Tailored study paths for Domain 3 based on your background.
💻 Coming From General Development (No Prior Azure AI)
1
Start with GPT-4o Vision — most flexible and conceptually familiar if you know the Chat Completions API. Learn the messages array image format, URL vs base64, and VQA patterns. GPT-4o Vision
2
Add DALL-E 3 — simple API, key parameters (PSQS), and the revised prompt behavior. One lab is enough. DALL-E 3
3
Learn Azure AI Vision 4.0 — focus on the decision guide: which feature to use for captioning vs OCR vs segmentation. Azure AI Vision
4
Survey the specialized services — Custom Vision (precision/recall), Face API (limited access rules), Content Safety (categories + severity), Video Indexer (capabilities). No deep implementation needed for exam. Custom VisionFace APIContent Safety
🏃 Coming From AI-102 (Azure AI Engineer)
1
Focus on what's new in AI-103: GPT-4o Vision (multimodal reasoning) and Content Understanding (replaces Form Recognizer patterns). These are the biggest deltas. GPT-4o VisionContent Understanding
2
Review DALL-E 3 parameters — especially size options and the revised prompt behavior (AI-102 focused more on DALL-E 2). DALL-E 3
3
Validate your Azure AI Vision 4.0 knowledge — the C.A.T.S. features are mostly familiar but confirm you know the new dense captions and smart crop APIs. Image Analysis 4.0
4
Face API limited access rules — know exactly which features (identification, verification, liveness) require Microsoft approval. This is a common distractor in scenario questions. Face API
📜 Building a Real Product (Practical Focus)
1
Custom Vision + Content Safety — core for any user-generated content platform. Know Custom Vision training pipeline, evaluation metrics, and export formats for edge deployment. Custom VisionContent Safety
2
Content Understanding — if building document processing pipelines, understand the analyzer pattern and how it integrates with Azure AI Search. Content Understanding
3
GPT-4o Vision for reasoning tasks — when you need chart analysis, product inspection reasoning, or open-ended VQA rather than structured extraction. GPT-4o Vision
4
Decision guide mastery — on the exam, scenario questions require you to instantly map a business need to the right service. Run through the decision table until it's automatic. Decision Guide
⌛ Short on Time (48-Hour Cram)
1
Memorize the Decision Guide table — 9 rows, 2 columns. This alone answers 3–4 exam questions. Decision Guide
2
Lock in DALL-E 3 parameters (PSQS) and GPT-4o image input formats (URL or base64, JPEG/PNG/GIF/WEBP). DALL-E 3GPT-4o Vision
3
Know Face API limited access rules and Custom Vision precision vs recall. These are high-frequency exam topics. Face APICustom Vision
4
Run through all 10 quiz questions and all 20 flashcards at least twice. Review explanations for any you miss. QuizFlashcards
Official Study Resources
All links go to official Microsoft Learn documentation.
AI-103 Official Study Guide
learn.microsoft.com — Full exam skills outline
AI-103 Certification Page
learn.microsoft.com — Exam registration & overview
Azure AI Vision Documentation
learn.microsoft.com — Image Analysis 4.0, OCR, spatial analysis
DALL-E via Azure OpenAI
learn.microsoft.com — Parameters, safety, API reference
GPT-4o Vision How-To
learn.microsoft.com — Image input formats, use cases
Azure AI Content Understanding
learn.microsoft.com — Analyzers, multimodal extraction
⚡ FlashGenius
Practice smarter with adaptive flashcards
Spaced repetition, timed quizzes, and progress tracking for every AI-103 domain.
Get Started Free →