AI-103: Computer Vision Solutions

About This Page

This FlashGenius study page covers Domain 3 — Implement Computer Vision Solutions, worth 10–15% of the AI-103 exam. Use all seven tabs to master concepts, burn in memory hooks, test yourself, and find the right resources.

Exam Domain Weights

Domain	Topic	Weight
Domain 1	Plan and Manage Azure AI Solutions	15–20%
Domain 2	Implement Generative AI Solutions	25–30%
Domain 3	Implement Computer Vision Solutions	10–15%
Domain 4	Implement Natural Language Processing Solutions	20–25%
Domain 5	Implement Agentic AI Solutions	20–25%

Key Services in This Domain

👁

Azure AI Vision

Image Analysis 4.0, OCR, spatial analysis

🎨

DALL-E 3

Text-to-image generation via Azure OpenAI

📷

GPT-4o Vision

Multimodal understanding & reasoning

📄

Content Understanding

Unified multimodal extraction service

👥

Azure Face API

Detection, verification, liveness

🎉

Custom Vision

Custom classifiers & object detectors

🎥

Video Indexer

Shot detection, transcripts, OCR in video

🛡

Content Safety

Image moderation, severity scoring

Ready to go deeper?

FlashGenius has adaptive quizzes and spaced-repetition flashcards for every AI-103 domain.

Create Free Account →

Azure AI Vision Service

Image Analysis 4.0 API

Dense Captions — up to 10 region-level captions + one overall caption
Smart Crops — area of interest detection for thumbnail generation
Object Detection — returns bounding boxes with labels and confidence scores
Tag Generation — flat list of descriptive tags with confidence
Read API (OCR) — text recognition from images and documents; returns lines and words with bounding polygons
Face Detection — detects face bounding boxes only; recognition requires Face API (limited access)
Background Removal / Segmentation — foreground extraction and semantic segmentation

Spatial Analysis

People counting, zone crossing, social distancing monitoring
Deployed via Docker container on edge or Azure
Still appears on exam despite being deprecated in newer iterations

Video Indexer

Shot detection — automatic chapter/scene segmentation
Speaker diarization — who spoke when
Transcript generation — speech-to-text across multiple languages
OCR in video — reads on-screen text frame by frame
Scene understanding — labels, brands, celebrities

Custom Vision

Classification vs Object Detection: classification gives a label to the whole image; object detection returns bounding boxes
Training datasets: need labeled images per class or annotated bounding boxes
Evaluation metrics: Precision, Recall, Average Precision (AP), Mean Average Precision (mAP)
Iterative training: add images, retrain, evaluate, repeat

Multimodal AI with GPT-4o Vision

Passing Images in Messages

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Describe this chart."},
    {"type": "image_url",
     "image_url": {"url": "https://...jpg"}}
  ]
}

Supported formats: JPEG, PNG, GIF, WEBP
Two input methods: image URL or base64-encoded data URI
Image URL requires the model to fetch externally; base64 keeps everything in-request

Vision Use Cases

Document understanding (forms, invoices, reports)
Chart reading and trend explanation
Product inspection and defect detection
Scene description and VQA (Visual Question Answering)
Multimodal agent: vision + text reasoning + function calling in one loop

Prompt Engineering for Vision

Provide detailed instructions specifying what to look for
Request specific output formats (JSON, bullet list, table)
Combine with system messages for role context

Image & Video Generation

DALL-E 3 via Azure OpenAI

API call: client.images.generate(...)
prompt — text description of the desired image
size — 1024x1024 (square), 1792x1024 (landscape), 1024x1792 (portrait)
quality — standard (faster) or hd (higher detail)
style — vivid (dramatic) or natural (photorealistic)
n — number of images (DALL-E 3 supports n=1 only)
Revised prompts: DALL-E 3 rewrites the user prompt for safety — the revised version is returned in the response
Content policy filtering rejects unsafe prompts before generation

Model Catalog & Other Generators

Azure AI Foundry catalog includes Stable Diffusion and other open-weight image models
DALL-E 2 supports image editing (inpainting) and image variations — DALL-E 3 does not

Azure AI Content Understanding

New unified service in Azure AI Foundry that consolidates document, image, audio, and video extraction
Analyzers: pre-built analyzers for common schemas; custom analyzers for bespoke field extraction
Replaces some Form Recognizer + Vision features in a single unified API
Integrates with Azure AI Search for multimodal indexing and retrieval
Field extraction from unstructured content (invoices, receipts, images, audio recordings, video clips)
Configured and deployed inside an Azure AI Foundry project

Content Safety for Vision

Image content moderation categories: Hate, Violence, Sexual, Self-Harm
Severity levels 0–7 per category (0 = safe, 7 = most severe)
Custom blocklists for images (hash-based matching)
Protected material detection for images (copyright / IP)
Groundedness check for image-based claims

Azure Face API

Face Detection: locate face bounding boxes; attributes (age estimate, head pose) — no approval required
Face Verification: compare two faces, returns similarity score — limited access required
Face Identification: match a face to a PersonGroup — limited access required (requires Microsoft approval)
Liveness Detection: anti-spoofing (determines if the subject is a real person, not a photo) — requires limited access
PersonGroup / FaceList: storage structures for known identities
Emotion detection removed in newer API versions

Custom Vision: Detail

Multi-class classification: one label per image (mutually exclusive)
Multi-label classification: multiple labels per image allowed
Object detection: bounding box annotation required per object
Quick Training vs Advanced Training: budget compute time for better accuracy
Export formats: TensorFlow, CoreML, ONNX, Docker container
Precision: of all predicted positives, what fraction is correct?
Recall: of all actual positives, what fraction was found?
mAP (mean Average Precision): average AP across all classes

Decision Guide: Which Vision Service?

Need	Use This Service
Generic image captioning / tagging	Azure AI Vision (Image Analysis 4.0)
Read text from images or documents	Vision Read API (OCR)
Generate images from text prompts	DALL-E 3 via Azure OpenAI
Understand image + reason over it	GPT-4o Vision
Custom object detection / classification	Custom Vision
Video analysis at scale	Video Indexer
Multimodal document field extraction	Content Understanding
Moderate image content for safety	Content Safety
Face identification / verification	Face API (limited access)

Memory Hooks for Domain 3

Short mnemonics and mental models to lock in key facts for exam day.

🅠 "C.A.T.S" for Image Analysis 4.0

Captions (dense captions & smart crops) • Analysis (object detection & tag generation) • Text (Read API / OCR) • Segmentation (background removal)

🇒 DALL-E 3 Parameters: "PSQS"

Remember the four parameters in order: Prompt → Size → Quality → Style. Sizes: square (1024×1024), landscape (1792×1024), portrait (1024×1792). Quality: standard vs hd. Style: vivid vs natural.

📷 GPT-4o Vision Input: "URL or Base64 — no other way in"

Images reach GPT-4o Vision only as an image URL or a base64-encoded data URI inside the messages array. Supported formats: JPEG, PNG, GIF, WEBP.

🅾 Custom Vision Decision Tree

Do you need custom categories not in the standard model? Yes → Custom Vision. No → Azure AI Vision (standard). Does the customer need bounding boxes? Yes → Custom Vision Object Detection. No → Custom Vision Classification.

🅺 Face API: "Face ID = Need a Form"

Face Detection (bounding box only) = free, no approval. Face Identification and Verification = Limited Access — must submit a Microsoft approval form. Remember: "if you want to know who, you need approval."

⚖️ Precision vs Recall

Precision = "When you say YES, are you right?" (true positives / all predicted positives). Recall = "Of all the actual YESes, did you find them all?" (true positives / all actual positives). High precision + low recall = conservative detector that misses many objects.

🛠️ Content Understanding: "Swiss Army Knife"

One service for documents + images + audio + video. Think of it as the replacement and unification of Form Recognizer + parts of Vision into a single Foundry-native service.

🎥 Video Indexer: "YouTube Chapters for Enterprise"

Shot detection = automatic chapter markers. Speaker diarization = who said what. Transcript = speech-to-text. OCR in video = reads text on screen. Scene understanding = labels, brands, celebrities.

🛡️ Content Safety Severity: 0–7

Categories: Hate | Violence | Sexual | Self-Harm. Each rated 0–7. Typical production threshold: reject severity ≥ 2 or 4 depending on platform strictness. 0 = safe, 7 = most severe.

🚀 DALL-E 3 "Revised Prompt" Gotcha

DALL-E 3 rewrites your prompt before generating. The response object includes a revised_prompt field showing what was actually used. This is for safety filtering and prompt enhancement — know this for scenario questions!

Domain 3 Knowledge Check

10 scenario-based questions covering all key vision topics.

Question 1 of 10

Domain 3 Flashcards

20 cards. Click a card to flip. Use arrows to navigate.

Card 1 of 20

QUESTION

Click to reveal answer

ANSWER

Study Advisor

Tailored study paths for Domain 3 based on your background.

💻 Coming From General Development (No Prior Azure AI)

Start with GPT-4o Vision — most flexible and conceptually familiar if you know the Chat Completions API. Learn the messages array image format, URL vs base64, and VQA patterns. GPT-4o Vision

Add DALL-E 3 — simple API, key parameters (PSQS), and the revised prompt behavior. One lab is enough. DALL-E 3

Learn Azure AI Vision 4.0 — focus on the decision guide: which feature to use for captioning vs OCR vs segmentation. Azure AI Vision

Survey the specialized services — Custom Vision (precision/recall), Face API (limited access rules), Content Safety (categories + severity), Video Indexer (capabilities). No deep implementation needed for exam. Custom VisionFace APIContent Safety

🏃 Coming From AI-102 (Azure AI Engineer)

Focus on what's new in AI-103: GPT-4o Vision (multimodal reasoning) and Content Understanding (replaces Form Recognizer patterns). These are the biggest deltas. GPT-4o VisionContent Understanding

Review DALL-E 3 parameters — especially size options and the revised prompt behavior (AI-102 focused more on DALL-E 2). DALL-E 3

Validate your Azure AI Vision 4.0 knowledge — the C.A.T.S. features are mostly familiar but confirm you know the new dense captions and smart crop APIs. Image Analysis 4.0

Face API limited access rules — know exactly which features (identification, verification, liveness) require Microsoft approval. This is a common distractor in scenario questions. Face API

📜 Building a Real Product (Practical Focus)

Custom Vision + Content Safety — core for any user-generated content platform. Know Custom Vision training pipeline, evaluation metrics, and export formats for edge deployment. Custom VisionContent Safety

Content Understanding — if building document processing pipelines, understand the analyzer pattern and how it integrates with Azure AI Search. Content Understanding

GPT-4o Vision for reasoning tasks — when you need chart analysis, product inspection reasoning, or open-ended VQA rather than structured extraction. GPT-4o Vision

Decision guide mastery — on the exam, scenario questions require you to instantly map a business need to the right service. Run through the decision table until it's automatic. Decision Guide

⌛ Short on Time (48-Hour Cram)

Memorize the Decision Guide table — 9 rows, 2 columns. This alone answers 3–4 exam questions. Decision Guide

Lock in DALL-E 3 parameters (PSQS) and GPT-4o image input formats (URL or base64, JPEG/PNG/GIF/WEBP). DALL-E 3GPT-4o Vision

Know Face API limited access rules and Custom Vision precision vs recall. These are high-frequency exam topics. Face APICustom Vision

Run through all 10 quiz questions and all 20 flashcards at least twice. Review explanations for any you miss. QuizFlashcards

Official Study Resources

All links go to official Microsoft Learn documentation.

📚

AI-103 Official Study Guide

learn.microsoft.com — Full exam skills outline

🏅

AI-103 Certification Page

learn.microsoft.com — Exam registration & overview

👁

Azure AI Vision Documentation

learn.microsoft.com — Image Analysis 4.0, OCR, spatial analysis

🎨

DALL-E via Azure OpenAI

learn.microsoft.com — Parameters, safety, API reference

📷

GPT-4o Vision How-To

learn.microsoft.com — Image input formats, use cases

📄

Azure AI Content Understanding

learn.microsoft.com — Analyzers, multimodal extraction

⚡ FlashGenius

Practice smarter with adaptive flashcards

Spaced repetition, timed quizzes, and progress tracking for every AI-103 domain.

Get Started Free →