AI-103 · Text Analysis & Speech Solutions

Domain	Topic	Weight
Domain 1	Design and plan AI solutions on Azure	10–15%
Domain 2	Implement computer vision solutions	10–15%
Domain 3	Implement natural language processing solutions	25–30%
Domain 4	Implement text analysis solutions ← This Page	10–15%
Domain 5	Implement generative AI solutions	30–35%

Core Concepts

Master the Azure AI services you need for Domain 4. Each section covers the key features, decision points, and exam-relevant details.

Azure AI Language Service

Named Entity Recognition (NER) & Custom NER

Prebuilt NER categories: Person, Location, Organization, DateTime, Quantity, URL, IP Address, Email

Custom NER: Train the model with your own labeled entities. Evaluation uses precision (of what the model predicted, how much was correct), recall (of all actual entities, how many were found), and F1 score (harmonic mean of precision and recall).

Label training documents with entity spans in Language Studio
Requires minimum data per entity type; more labeled data = better performance

Key Phrase Extraction

Identifies the main topics or talking points from unstructured text. Returns a list of phrases that represent the key ideas. No training required — fully prebuilt.

Sentiment Analysis & Opinion Mining

Document-level sentiment: Classifies the entire text as positive, neutral, or negative (with confidence scores for each).

Opinion Mining (aspect-based sentiment): Goes deeper — identifies specific aspects (e.g., "coffee", "service") and the sentiment expressed toward each. Example: "The coffee was great but the service was slow" → coffee=positive, service=negative.

Entity Linking

Disambiguates recognized entities to known entries in a knowledge base (Wikipedia). Example: "Mercury" in an astronomy context links to the planet, not the element or the god. Returns a data source (Wikipedia URL) and confidence score.

Language Detection

Identifies the language of input text and returns a confidence score (0.0–1.0). Handles mixed-language content by returning the dominant language. Use when source language is unknown before translation.

PII (Personally Identifiable Information) Detection

Categories detected: Name, Phone number, Social Security Number (SSN), Credit card number, Email address, IP address, date of birth, passport number, and more.

Redaction: The API can return text with PII replaced by category labels (e.g., "[PERSON]") for safe downstream processing. Separate endpoint for PHI (Protected Health Information).

Text Classification

Single-label classification: Each document gets exactly one category. Used for simple categorization (e.g., "sports", "politics", "tech").

Multi-label classification: A document can belong to multiple categories simultaneously. Used when content spans multiple topics.

Both require custom training with labeled examples in Language Studio.

Summarization

Extractive summarization: Selects and returns the most important existing sentences from the source document. Output sentences are verbatim from input.

Abstractive summarization: Generates a new summary in the model's own words. May not use exact source sentences.

Conversation summarization: Summarizes multi-turn dialogue — returns issue, resolution, and chapter structure.

Question Answering (QnA)

Build FAQ-style knowledge bases from documents, URLs, or manually entered Q&A pairs. Returns answers with a confidence score (0–1). Supports follow-up prompts for multi-turn conversations. Hosted in Azure AI Language (replaces QnA Maker).

Conversational Language Understanding (CLU)

The modern replacement for LUIS. Understands natural language input by predicting intents (what the user wants) and extracting entities (key data). Training utterances teach the model variations of each intent.

Intents: BookFlight, CancelOrder, GetWeather
Entities: prebuilt (datetime, number) or custom (FlightDestination)
Integrates with Speech SDK for voice-driven applications

Azure OpenAI for Text Tasks

When to Use Azure OpenAI (GPT-4o) vs Azure AI Language

Factor	Azure AI Language	GPT-4o (Azure OpenAI)
Output structure	Consistent, typed JSON	Structured outputs via JSON Schema
Latency	Lower	Higher
Cost	Lower	Higher (token-based)
Training data needed	More labeled examples	Few-shot or zero-shot
Complex reasoning	Limited	Chain-of-thought capable
Compliance/audit	Easier (deterministic)	More variable outputs

Use Language service for: PII detection at scale, structured NER, compliance scenarios, cost-sensitive applications.

Use GPT-4o for: nuanced sentiment with reasoning, complex entity extraction with context, when you have few labeled examples.

Azure AI Translator

Translation Modes

Text Translation: Synchronous API call, source language auto-detected or specified, target language required. Supports 100+ languages.
Document Translation: Asynchronous batch translation of complete documents (PDF, DOCX, PPTX, etc.). Use for large-scale document processing. Results stored in Azure Blob Storage.
Custom Translator: Train a domain-specific translation model with parallel corpus (source + target sentence pairs). Use when standard NMT produces poor results for specialized vocabulary (legal, medical, technical).
Transliteration: Converts text from one script to another without changing the language. Example: Arabic text → Latin characters. Not a translation — pronunciation stays the same.

Azure AI Speech Service

Speech-to-Text (STT)

Real-time recognition: Continuous (ongoing stream, e.g., live call) or single utterance (one phrase, then stops). Use Speech SDK.
Batch transcription: Asynchronous — submit audio files (WAV, MP3, OGG) to REST API, poll for results. Best for large volumes of recorded audio.
Custom Speech: Fine-tune the acoustic and language models with domain-specific audio data and pronunciation dictionaries. Use when standard STT misrecognizes industry terms.
Diarization: Labels who spoke each segment in multi-speaker audio. Returns speaker IDs (Speaker 1, Speaker 2...) with timestamps.
Word-level timestamps: Returns the start/end time of each recognized word.
Language identification: Detect which language is being spoken before or during transcription.

Text-to-Speech (TTS)

Prebuilt neural voices: Hundreds of voices across languages — no training required.
Custom Neural Voice: Create a unique AI voice from voice talent recordings. Requires Microsoft approval (limited access program). Voice talent must give explicit written consent.
SSML (Speech Synthesis Markup Language): XML-based markup to control speech rate, pitch, emphasis, pauses (breaks), volume, and pronunciation. Wrap text in <speak> and use tags like <prosody rate="slow">, <break time="500ms"/>, <emphasis level="strong">.
Audio formats: WAV (uncompressed), MP3 (compressed), OGG. Choose based on quality vs. file size needs.
Real-time vs batch synthesis: Real-time for interactive apps, batch for pre-generating large audio libraries.

Speech Translation, Intent Recognition, Speaker Recognition

Speech Translation: Pipeline of STT → translation → TTS. Speak in one language, get speech output in another. Single SDK call.
Keyword Recognition: Detect specific wake words or trigger phrases locally (on-device). Low-latency, always-listening capability.
Intent Recognition: Combines Speech SDK (STT) with CLU to understand natural language voice commands. One round trip: audio → text → intent + entities.
Speaker Verification: Confirm that a speaker is who they claim to be (1:1 comparison against enrolled voiceprint).
Speaker Identification: Determine which of several enrolled speakers is talking (1:N comparison against a group).

Content Safety for Text

Harm Categories & Prompt Shield

Harm categories: Hate, Violence, Self-harm, Sexual. Each scored 0–7 (0 = safe, 7 = severe).
Prompt Shield — User jailbreak detection: Detects when a user message attempts to override model safety guidelines or extract unsafe behaviors ("ignore previous instructions...").
Prompt Shield — Indirect injection detection: Detects malicious instructions embedded in documents fed to the model (the document tells the model to do something harmful). Two separate shields, each independently configurable.
Groundedness detection: Verifies that a model's response is factually supported by the provided grounding documents. Helps detect hallucination in RAG systems.
Protected material detection: Detects if model output contains copyrighted text (song lyrics, news articles, books).
Custom blocklists: Define your own banned words, phrases, or regex patterns specific to your use case.

Decision Table: Text Task → Service

Task	Recommended Service / Feature
Extract named entities from text	Azure AI Language — NER
Detect sentiment + specific aspect opinions	Azure AI Language — Sentiment + Opinion Mining
Transcribe meeting audio with speaker labels	Speech Service — Batch Transcription + Diarization
Translate 10,000 documents to French	Azure Translator — Document Translation (async batch)
Build FAQ chatbot from existing docs	Azure AI Language — Question Answering
Detect PII in medical records	Azure AI Language — PII Detection (PHI endpoint)
Build voice assistant with intent understanding	Speech SDK (STT) + CLU + TTS
Complex nuanced text reasoning	Azure OpenAI — GPT-4o
Detect harmful / unsafe text	Azure Content Safety
Convert Arabic script to Latin characters	Azure Translator — Transliteration
Detect jailbreak in user chat message	Content Safety — Prompt Shield (user)
Summarize using exact source sentences	Azure AI Language — Extractive Summarization
Generate fluent paraphrased summary	Azure AI Language — Abstractive Summarization
Understand spoken commands ("book me a flight")	Speech SDK Intent Recognition (STT + CLU)
Translate technical documents with jargon	Azure Translator — Custom Translator

Memory Hooks

High-impact mnemonics and mental models to anchor exam concepts. These are the patterns that stick when exam pressure is high.

🧠

Language Service Tasks

"NERVES" for Language Tasks

NER • Entity Linking • Redaction (PII) • Verification (sentiment) • Extraction (key phrases) • Summarization. Six core Azure AI Language capabilities in one word.

🎙️

STT Modes

Real-time = live phone call; Batch = recorded voicemail

Real-time recognition streams audio as it's spoken — like a live call center agent. Batch transcription processes stored audio files asynchronously — like transcribing all yesterday's voicemails overnight.

✏️

Summarization Types

Extractive = highlight pen; Abstractive = your own words

Extractive summary: you take a yellow highlighter and mark sentences that already exist. Abstractive summary: you close the book and write what you remember in your own words. Same distinction in the Azure Language API.

🔄

CLU vs LUIS

CLU is LUIS 2.0 — same concept, newer service

CLU (Conversational Language Understanding) replaces LUIS with same intent/entity model. If an exam scenario describes "LUIS" functionality, the answer is CLU. LUIS is retired; all new development uses CLU.

🎤

Custom Neural Voice

Must apply — Microsoft doesn't let anyone clone voices freely

Custom Neural Voice requires submitting an application to Microsoft and getting approved. Voice talent must give explicit written consent. This is gated to prevent unauthorized voice cloning.

🗣️

SSML Mnemonic

"Slow Speech Makes Listeners" — SSML

SSML controls Speech Speed, eMphasis, and pauses (Length). If you need to control how text is spoken — rate, pitch, breaks, pronunciation — reach for SSML tags inside your TTS call.

😊

Sentiment: Document vs Aspect

Document = how the whole review feels; Aspect = how they feel about the coffee specifically

A 3-star restaurant review could be document-level: "neutral." But opinion mining finds: food=positive, noise=negative, service=negative. Aspect-level is what you need when you care about what drove the sentiment.

🚫

PII Categories

Name, phone, SSN, credit card, email — 5 things you'd never write on a public whiteboard

These are the core PII categories Azure AI Language detects. If a scenario involves protecting any of these from appearing in logs, transcripts, or output, PII Detection (with redaction) is the answer.

👥

Diarization

Diarization = "Who said that?" — speaker labeling in transcripts

Diarization segments a transcript by speaker. A meeting recording comes back labeled "Speaker 1: ... Speaker 2: ..." rather than one unbroken wall of text. Combine with batch transcription for recorded meetings.

🛡️

Prompt Shield

Front door guard + mail screener

Prompt Shield has two modes: user jailbreak (the front door guard — stops bad user messages before they reach the model) and indirect injection (the mail screener — checks documents you feed to the model for hidden instructions). Both can be enabled independently.

📊

Custom NER Evaluation

Precision = "Did I predict correctly?" Recall = "Did I find them all?"

Precision: of all entity spans the model returned, what % were actually correct? Recall: of all true entity spans in the data, what % did the model find? F1 = the balance between both. Low recall = missing entities; low precision = false positives.

🌐

Transliteration vs Translation

Transliteration = same sound, new alphabet. Translation = same meaning, new language.

Transliteration converts the script only — "مرحبا" → "Marhaba" (still Arabic, just Latin letters). Translation changes the language — "مرحبا" → "Hello" (now English). Know which one a scenario is asking for.

Practice Quiz

10 scenario-based questions covering the key decision points in Domain 4. Select the best answer for each question.

Question 1 of 10

out of 10 questions correct

Flashcards

20 cards covering essential Domain 4 concepts. Click any card to flip and reveal the answer.

20 cards · Click to flip Click a card to reveal answer

Study Advisor

Personalized focus recommendations based on what you're building. Match your use case to the services that matter most for your scenario.

📞 Building a Call Center App

Primary Focus

Azure AI Speech — STT (real-time + batch), Diarization, Custom Speech, TTS

Secondary Focus

Azure AI Language — NER (for caller intent), Sentiment Analysis (call quality scoring), PII Redaction (remove sensitive data from transcripts)

Day 1: Speech STT modes — real-time vs batch, when to use each

Day 2: Diarization setup, word-level timestamps, batch transcription async flow

Day 3: Language NER + Sentiment for post-call analytics

Day 4: PII detection and redaction pipeline

Day 5: Content Safety integration + Custom Speech for domain jargon

📝 Building a Content Platform

Primary Focus

Content Safety — harm categories, Prompt Shield, custom blocklists, groundedness detection

Secondary Focus

Azure AI Language — PII Detection (user-generated content), Text Classification (auto-categorize articles), Summarization (content previews)

Day 1: Content Safety harm categories (Hate, Violence, Self-harm, Sexual) and severity 0–7

Day 2: Prompt Shield — user jailbreak vs document indirect injection, differences

Day 3: PII Detection + redaction for user content

Day 4: Text Classification (single vs multi-label) for content tagging

Day 5: Extractive vs abstractive summarization for previews and digests

💻 General Azure AI Developer

Start Here

Azure AI Language service overview — all capabilities and when each applies

Then Expand

Speech SDK fundamentals, Azure Translator (text vs document vs custom), Content Safety basics

Day 1: Azure AI Language — NER, Key Phrase, Sentiment, Entity Linking

Day 2: PII detection, Text Classification, Summarization (both types)

Day 3: QnA and CLU (what replaces LUIS and QnA Maker)

Day 4: Speech SDK — STT modes, TTS + SSML, translation pipeline

Day 5: Azure Translator modes, Content Safety, Language vs GPT-4o decision

🌐 Building a Global App (Translation)

Primary Focus

Azure AI Translator — text translation, document translation (async), Custom Translator, transliteration

Secondary Focus

Language Detection (pre-translation), Speech Translation (voice apps), Content Safety (multilingual content moderation)

Day 1: Translator text API — source/target language, auto-detect, 100+ languages

Day 2: Document Translation async flow — Blob Storage input/output, polling

Day 3: Custom Translator — when standard NMT fails, parallel corpus training

Day 4: Transliteration vs translation distinction, language detection confidence

Day 5: Speech Translation pipeline (STT → translate → TTS), Azure AI Language + Translator combination

      Key Exam Traps to Avoid
      Trap 1: "Build a FAQ bot" → Answer is Question Answering (not CLU, which is for intent/entity extraction from conversation)
Trap 2: "Custom Speech" is about recognition accuracy for domain vocabulary — not about creating a custom voice (that's Custom Neural Voice)
Trap 3: Document Translation is async — you submit, get a job ID, then poll for results. Not synchronous.
Trap 4: Extractive summarization returns existing sentences verbatim. If the scenario says "do not paraphrase," choose extractive.
Trap 5: Speaker Verification = 1 person confirming identity. Speaker Identification = figuring out which of N enrolled speakers is talking.

    

Text Analysis & Speech Solutions