FlashGenius Logo FlashGenius
🅯 Microsoft Azure

AI-103: Computer Vision Solutions

Domain 3 of 5 — Azure AI Apps & Agents Developer Associate

📋 40–60 Questions ⌛ 100 Minutes 🎯 Pass: 700/1000 🔍 Domain Weight: 10–15%
Start Free on FlashGenius →

About This Page

This FlashGenius study page covers Domain 3 — Implement Computer Vision Solutions, worth 10–15% of the AI-103 exam. Use all seven tabs to master concepts, burn in memory hooks, test yourself, and find the right resources.

Exam Domain Weights

DomainTopicWeight
Domain 1Plan and Manage Azure AI Solutions15–20%
Domain 2Implement Generative AI Solutions25–30%
Domain 3Implement Computer Vision Solutions10–15%
Domain 4Implement Natural Language Processing Solutions20–25%
Domain 5Implement Agentic AI Solutions20–25%

Key Services in This Domain

👁
Azure AI Vision
Image Analysis 4.0, OCR, spatial analysis
🎨
DALL-E 3
Text-to-image generation via Azure OpenAI
📷
GPT-4o Vision
Multimodal understanding & reasoning
📄
Content Understanding
Unified multimodal extraction service
👥
Azure Face API
Detection, verification, liveness
🎉
Custom Vision
Custom classifiers & object detectors
🎥
Video Indexer
Shot detection, transcripts, OCR in video
🛡
Content Safety
Image moderation, severity scoring
Ready to go deeper?

FlashGenius has adaptive quizzes and spaced-repetition flashcards for every AI-103 domain.

Create Free Account →

Azure AI Vision Service

Image Analysis 4.0 API

  • Dense Captions — up to 10 region-level captions + one overall caption
  • Smart Crops — area of interest detection for thumbnail generation
  • Object Detection — returns bounding boxes with labels and confidence scores
  • Tag Generation — flat list of descriptive tags with confidence
  • Read API (OCR) — text recognition from images and documents; returns lines and words with bounding polygons
  • Face Detection — detects face bounding boxes only; recognition requires Face API (limited access)
  • Background Removal / Segmentation — foreground extraction and semantic segmentation

Spatial Analysis

  • People counting, zone crossing, social distancing monitoring
  • Deployed via Docker container on edge or Azure
  • Still appears on exam despite being deprecated in newer iterations

Video Indexer

  • Shot detection — automatic chapter/scene segmentation
  • Speaker diarization — who spoke when
  • Transcript generation — speech-to-text across multiple languages
  • OCR in video — reads on-screen text frame by frame
  • Scene understanding — labels, brands, celebrities

Custom Vision

  • Classification vs Object Detection: classification gives a label to the whole image; object detection returns bounding boxes
  • Training datasets: need labeled images per class or annotated bounding boxes
  • Evaluation metrics: Precision, Recall, Average Precision (AP), Mean Average Precision (mAP)
  • Iterative training: add images, retrain, evaluate, repeat

Multimodal AI with GPT-4o Vision

Passing Images in Messages

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe this chart."},
    {"type": "image_url",
     "image_url": {"url": "https://...jpg"}}
  ]
}
  • Supported formats: JPEG, PNG, GIF, WEBP
  • Two input methods: image URL or base64-encoded data URI
  • Image URL requires the model to fetch externally; base64 keeps everything in-request

Vision Use Cases

  • Document understanding (forms, invoices, reports)
  • Chart reading and trend explanation
  • Product inspection and defect detection
  • Scene description and VQA (Visual Question Answering)
  • Multimodal agent: vision + text reasoning + function calling in one loop

Prompt Engineering for Vision

  • Provide detailed instructions specifying what to look for
  • Request specific output formats (JSON, bullet list, table)
  • Combine with system messages for role context

Image & Video Generation

DALL-E 3 via Azure OpenAI

  • API call: client.images.generate(...)
  • prompt — text description of the desired image
  • size1024x1024 (square), 1792x1024 (landscape), 1024x1792 (portrait)
  • qualitystandard (faster) or hd (higher detail)
  • stylevivid (dramatic) or natural (photorealistic)
  • n — number of images (DALL-E 3 supports n=1 only)
  • Revised prompts: DALL-E 3 rewrites the user prompt for safety — the revised version is returned in the response
  • Content policy filtering rejects unsafe prompts before generation

Model Catalog & Other Generators

  • Azure AI Foundry catalog includes Stable Diffusion and other open-weight image models
  • DALL-E 2 supports image editing (inpainting) and image variations — DALL-E 3 does not

Azure AI Content Understanding

  • New unified service in Azure AI Foundry that consolidates document, image, audio, and video extraction
  • Analyzers: pre-built analyzers for common schemas; custom analyzers for bespoke field extraction
  • Replaces some Form Recognizer + Vision features in a single unified API
  • Integrates with Azure AI Search for multimodal indexing and retrieval
  • Field extraction from unstructured content (invoices, receipts, images, audio recordings, video clips)
  • Configured and deployed inside an Azure AI Foundry project

Content Safety for Vision

  • Image content moderation categories: Hate, Violence, Sexual, Self-Harm
  • Severity levels 0–7 per category (0 = safe, 7 = most severe)
  • Custom blocklists for images (hash-based matching)
  • Protected material detection for images (copyright / IP)
  • Groundedness check for image-based claims

Azure Face API

  • Face Detection: locate face bounding boxes; attributes (age estimate, head pose) — no approval required
  • Face Verification: compare two faces, returns similarity score — limited access required
  • Face Identification: match a face to a PersonGroup — limited access required (requires Microsoft approval)
  • Liveness Detection: anti-spoofing (determines if the subject is a real person, not a photo) — requires limited access
  • PersonGroup / FaceList: storage structures for known identities
  • Emotion detection removed in newer API versions

Custom Vision: Detail

  • Multi-class classification: one label per image (mutually exclusive)
  • Multi-label classification: multiple labels per image allowed
  • Object detection: bounding box annotation required per object
  • Quick Training vs Advanced Training: budget compute time for better accuracy
  • Export formats: TensorFlow, CoreML, ONNX, Docker container
  • Precision: of all predicted positives, what fraction is correct?
  • Recall: of all actual positives, what fraction was found?
  • mAP (mean Average Precision): average AP across all classes

Decision Guide: Which Vision Service?

NeedUse This Service
Generic image captioning / taggingAzure AI Vision (Image Analysis 4.0)
Read text from images or documentsVision Read API (OCR)
Generate images from text promptsDALL-E 3 via Azure OpenAI
Understand image + reason over itGPT-4o Vision
Custom object detection / classificationCustom Vision
Video analysis at scaleVideo Indexer
Multimodal document field extractionContent Understanding
Moderate image content for safetyContent Safety
Face identification / verificationFace API (limited access)

Memory Hooks for Domain 3

Short mnemonics and mental models to lock in key facts for exam day.

🅠 "C.A.T.S" for Image Analysis 4.0
Captions (dense captions & smart crops) • Analysis (object detection & tag generation) • Text (Read API / OCR) • Segmentation (background removal)
🇒 DALL-E 3 Parameters: "PSQS"
Remember the four parameters in order: Prompt → Size → Quality → Style. Sizes: square (1024×1024), landscape (1792×1024), portrait (1024×1792). Quality: standard vs hd. Style: vivid vs natural.
📷 GPT-4o Vision Input: "URL or Base64 — no other way in"
Images reach GPT-4o Vision only as an image URL or a base64-encoded data URI inside the messages array. Supported formats: JPEG, PNG, GIF, WEBP.
🅾 Custom Vision Decision Tree
Do you need custom categories not in the standard model? Yes → Custom Vision. No → Azure AI Vision (standard). Does the customer need bounding boxes? Yes → Custom Vision Object Detection. No → Custom Vision Classification.
🅺 Face API: "Face ID = Need a Form"
Face Detection (bounding box only) = free, no approval. Face Identification and Verification = Limited Access — must submit a Microsoft approval form. Remember: "if you want to know who, you need approval."
⚖️ Precision vs Recall
Precision = "When you say YES, are you right?" (true positives / all predicted positives). Recall = "Of all the actual YESes, did you find them all?" (true positives / all actual positives). High precision + low recall = conservative detector that misses many objects.
🛠️ Content Understanding: "Swiss Army Knife"
One service for documents + images + audio + video. Think of it as the replacement and unification of Form Recognizer + parts of Vision into a single Foundry-native service.
🎥 Video Indexer: "YouTube Chapters for Enterprise"
Shot detection = automatic chapter markers. Speaker diarization = who said what. Transcript = speech-to-text. OCR in video = reads text on screen. Scene understanding = labels, brands, celebrities.
🛡️ Content Safety Severity: 0–7
Categories: Hate | Violence | Sexual | Self-Harm. Each rated 0–7. Typical production threshold: reject severity ≥ 2 or 4 depending on platform strictness. 0 = safe, 7 = most severe.
🚀 DALL-E 3 "Revised Prompt" Gotcha
DALL-E 3 rewrites your prompt before generating. The response object includes a revised_prompt field showing what was actually used. This is for safety filtering and prompt enhancement — know this for scenario questions!

Domain 3 Knowledge Check

10 scenario-based questions covering all key vision topics.

Question 1 of 10

Domain 3 Flashcards

20 cards. Click a card to flip. Use arrows to navigate.

Card 1 of 20
QUESTION
Click to reveal answer
ANSWER

Study Advisor

Tailored study paths for Domain 3 based on your background.

💻 Coming From General Development (No Prior Azure AI)

1
Start with GPT-4o Vision — most flexible and conceptually familiar if you know the Chat Completions API. Learn the messages array image format, URL vs base64, and VQA patterns. GPT-4o Vision
2
Add DALL-E 3 — simple API, key parameters (PSQS), and the revised prompt behavior. One lab is enough. DALL-E 3
3
Learn Azure AI Vision 4.0 — focus on the decision guide: which feature to use for captioning vs OCR vs segmentation. Azure AI Vision
4
Survey the specialized services — Custom Vision (precision/recall), Face API (limited access rules), Content Safety (categories + severity), Video Indexer (capabilities). No deep implementation needed for exam. Custom VisionFace APIContent Safety

🏃 Coming From AI-102 (Azure AI Engineer)

1
Focus on what's new in AI-103: GPT-4o Vision (multimodal reasoning) and Content Understanding (replaces Form Recognizer patterns). These are the biggest deltas. GPT-4o VisionContent Understanding
2
Review DALL-E 3 parameters — especially size options and the revised prompt behavior (AI-102 focused more on DALL-E 2). DALL-E 3
3
Validate your Azure AI Vision 4.0 knowledge — the C.A.T.S. features are mostly familiar but confirm you know the new dense captions and smart crop APIs. Image Analysis 4.0
4
Face API limited access rules — know exactly which features (identification, verification, liveness) require Microsoft approval. This is a common distractor in scenario questions. Face API

📜 Building a Real Product (Practical Focus)

1
Custom Vision + Content Safety — core for any user-generated content platform. Know Custom Vision training pipeline, evaluation metrics, and export formats for edge deployment. Custom VisionContent Safety
2
Content Understanding — if building document processing pipelines, understand the analyzer pattern and how it integrates with Azure AI Search. Content Understanding
3
GPT-4o Vision for reasoning tasks — when you need chart analysis, product inspection reasoning, or open-ended VQA rather than structured extraction. GPT-4o Vision
4
Decision guide mastery — on the exam, scenario questions require you to instantly map a business need to the right service. Run through the decision table until it's automatic. Decision Guide

⌛ Short on Time (48-Hour Cram)

1
Memorize the Decision Guide table — 9 rows, 2 columns. This alone answers 3–4 exam questions. Decision Guide
2
Lock in DALL-E 3 parameters (PSQS) and GPT-4o image input formats (URL or base64, JPEG/PNG/GIF/WEBP). DALL-E 3GPT-4o Vision
3
Know Face API limited access rules and Custom Vision precision vs recall. These are high-frequency exam topics. Face APICustom Vision
4
Run through all 10 quiz questions and all 20 flashcards at least twice. Review explanations for any you miss. QuizFlashcards
⚡ FlashGenius
Practice smarter with adaptive flashcards

Spaced repetition, timed quizzes, and progress tracking for every AI-103 domain.

Get Started Free →