AI Lifecycle & Operations

The AI lifecycle is a continuous loop — not a one-time project. Operationalizing AI well requires disciplined engineering practices (MLOps) applied at every phase from data to decommission.

    Auditor's Focus: At each lifecycle phase ask: Is there an accountable owner? Is there documented evidence this step was completed? Are risks addressed? The AI lifecycle is only governable if each phase produces artifacts — data cards, model cards, test reports, deployment approvals, monitoring dashboards — that auditors can inspect.
  

AI/ML Lifecycle — 8 Phases

Problem Definition & Scoping

Deliverable: Problem statement, success criteria, constraints, use-case risk classification

Define the business problem and whether AI is the right solution. Establish measurable success criteria (accuracy targets, latency SLAs), regulatory constraints, and a preliminary risk tier classification. Gate: business and risk sign-off before resources are committed.

Audit check: Is the AI use case formally approved? Is a risk tier assigned? Is the intended use documented?

Data Collection & Ingestion

Deliverable: Data catalog entry, consent/provenance documentation, ingestion pipeline

Identify, acquire, and ingest data from internal systems, third parties, or public sources. Document data provenance, consent basis (GDPR), and legal authority to use the data. Establish data lineage tracking from this phase forward.

Audit check: Is consent documented? Is data provenance traceable? Are third-party data agreements in place?

Data Preparation & Feature Engineering

Deliverable: Processed dataset, feature catalog, transformation scripts (versioned)

Clean, normalize, and transform raw data. Engineer features (derived variables) for model input. Critical risk: introducing data leakage (future information contaminating training), or creating proxy variables for protected attributes during feature engineering.

Audit check: Are transformations versioned and documented? Is data leakage testing performed? Are proxy variables identified?

Model Development & Training

Deliverable: Trained model artifact, experiment tracking logs, hyperparameter records

Select algorithm, define architecture, run training experiments. Track hyperparameters, metrics, and artifacts in an experiment tracker (MLflow, W&B). Versioning of code, data, and model outputs enables reproducibility — a core MLOps requirement.

Audit check: Are experiments logged? Are model artifacts versioned? Is training reproducible from versioned inputs?

Model Validation & Testing

Deliverable: Validation report (accuracy, bias, robustness, explainability), go/no-go decision

Evaluate model on holdout test data. Run accuracy testing, bias testing (demographic parity), robustness testing, and explainability analysis. Challenger comparison against the current production model if one exists. Formal sign-off required before deployment.

Audit check: Is the test set truly unseen? Are bias tests run across demographic groups? Is validation sign-off documented?

Deployment

Deliverable: Deployed model, deployment record, rollback plan, change approval

Move validated model to production via approved change management process. Select deployment pattern (batch, real-time, canary, A/B). Register model in model registry with version, metadata, and lineage. Maintain rollback capability to prior production version.

Audit check: Is there a change approval record? Is a rollback procedure tested and documented? Is the model version in the registry?

Monitoring & Maintenance

Deliverable: Monitoring dashboard, drift alerts, performance reports, retraining records

Continuously monitor model performance (accuracy, PSI drift, prediction drift) and input data quality. Detect degradation early and trigger retraining or rollback when thresholds are breached. Document all maintenance actions with change records.

Audit check: Are monitoring thresholds defined and documented? Are drift alerts actioned within SLA? Are retraining events approved?

Retirement & Decommission

Deliverable: Decommission plan, data retention/deletion record, successor model handoff

Formally retire models that are superseded, no longer fit-for-purpose, or pose unacceptable risk. Document the retirement decision, preserve audit trails and model artifacts per retention policy, and ensure data deletion obligations (GDPR right to erasure) are met.

Audit check: Is model retirement formally approved? Are artifacts retained per policy? Is data deletion documented?

MLOps — Operationalizing AI

🗄️

MLOps Component

Data Pipeline Automation

Automated, reproducible pipelines for data ingestion, transformation, and feature computation. Triggers reprocessing when source data changes. Enables consistent data delivery to training and serving.

🔄

MLOps Component

CI/CD/CT Pipeline

CI = Continuous Integration (test code changes). CD = Continuous Delivery (deploy validated models). CT = Continuous Training (retrain models automatically when drift or performance triggers are met).

🧪

MLOps Component

Experiment Tracking

Tools (MLflow, W&B, Neptune) log every training run: hyperparameters, metrics, dataset versions, and model artifacts. Enables reproducibility, comparison across experiments, and audit trail for model development decisions.

📦

MLOps Component

Model Registry

Central repository storing all model versions with metadata, lineage, and lifecycle status (Staging → Production → Archived). Provides single source of truth for model governance and deployment approvals.

📡

MLOps Component

Model Monitoring

Ongoing measurement of data drift (PSI), prediction drift, performance metrics (Precision/Recall/F1), and infrastructure health. Triggers automated alerts and retraining when thresholds are breached.

📋

MLOps Component

Feature Store

Centralized repository for computed features. Offline store: historical features for model training. Online store: low-latency feature retrieval for real-time inference. Ensures training-serving consistency — preventing training-serving skew.

DevOps vs. MLOps — Key Differences

⚙️ DevOps

🤖 MLOps

Artifact versioned

Application code

Code + Data + Model (all three must be versioned)

Testing focus

Unit tests, integration tests, functional correctness

+ Model accuracy, bias tests, data quality, drift tests

Pipeline triggers

Code commits, PRs

+ Data changes, performance degradation, drift alerts (CT)

Reproducibility

Same code + environment = same output

Same code + data + hyperparameters = same model

Failure mode

Build fails, tests fail — visible immediately

Silent degradation — model drifts slowly, no code changes

Monitoring concern

Uptime, error rates, latency

+ Data drift, model accuracy, fairness, prediction distributions

Model Development & Validation

Sound model development requires disciplined data preparation, reproducible training, and rigorous validation before any model touches production.

    Key Audit Risk — Data Leakage: Data leakage occurs when information from the test set (or future data) contaminates the training process, artificially inflating performance metrics. A leaky model appears excellent in testing but fails catastrophically in production. Temporal separation (never using future data to predict past) is the primary safeguard.
  

Data Preparation — Key Steps & Risks

Step 1

Data Profiling

Assess data quality: completeness, distributions, outliers, missing values, and class imbalance. Establishes a baseline before any transformations.

Risk: Undocumented profiling = unknown data quality entering the pipeline

Step 2

Data Cleaning

Handle missing values (imputation, deletion), remove duplicates, correct format errors, clip outliers. All decisions must be documented in transformation scripts.

Risk: Inappropriate imputation strategies introduce systematic bias into training data

Step 3

Feature Engineering

Create derived features (aggregations, interactions, embeddings). All feature transformations must be versioned and applied consistently to training and inference data.

Risk: Proxy variables — features that correlate with protected attributes (e.g., zip code → race)

Step 4

Train/Val/Test Split

Divide data into training (model learning), validation (hyperparameter tuning), and holdout test (final unbiased evaluation). Test set must remain unseen until final evaluation.

Risk: Data leakage if test data is seen during training; temporal leakage for time-series data

Step 5

Normalization & Encoding

Scale numeric features (min-max, z-score) and encode categoricals (one-hot, ordinal, embeddings). Scalers fitted on training data only — then applied to validation/test.

Risk: Fitting scalers on full dataset before splitting causes data leakage

Step 6

Class Imbalance Handling

Address skewed class distributions: oversampling (SMOTE), undersampling, class weights, or threshold tuning. Imbalanced data causes models to predict majority class and miss rare events.

Risk: Accuracy is misleading on imbalanced data — use Precision/Recall/F1 instead

Model Training — Key Concepts

Concept	Definition	Audit/Risk Relevance
Overfitting	Model memorizes training data; performs well on training, poorly on unseen data	Model appears performant in testing but fails in production — a silent risk
Underfitting	Model is too simple to capture patterns; poor performance on both training and test	Detectable — but model may still be deployed if business pressure overrides technical rigor
Hyperparameter Tuning	Optimizing model configuration parameters (learning rate, depth, regularization) that are set before training	Must use validation set only — using test set for tuning = data leakage
Reproducibility	Ability to re-run training with the same inputs (code, data, hyperparameters) and obtain the same model	Core MLOps requirement — without it, model lineage and audit trails are unverifiable
Experiment Tracking	Systematic logging of every training run: parameters, metrics, artifacts, dataset version	Provides audit trail for model development decisions; required for regulatory compliance
Transfer Learning	Adapting a pre-trained model (e.g., foundation model) to a new task using fine-tuning	Inherits risks from original training data — provenance and bias from source model must be assessed

Model Validation Approaches

Pre-Deployment

Holdout Testing

Model evaluated on a fixed, unseen test set that was set aside before training. The gold standard for unbiased performance estimation.

Audit requirement: Confirm the holdout set was not touched until final evaluation

Pre-Deployment

Cross-Validation

Data is split into k folds; model trains on k-1 folds and validates on the remaining fold, rotating through all combinations. More robust for small datasets.

Used during development — not for final holdout evaluation; each fold must maintain temporal integrity for time-series

Pre-Deployment

Champion-Challenger

New candidate model (challenger) is validated against the current production model (champion) on the same test set. Challenger must clear the champion's metrics before being promoted.

Best practice for model updates — ensures new model is demonstrably better, not just different

Live Testing

Shadow Mode Testing

New model runs in parallel with the production model. Receives live inputs and generates predictions, but its outputs are NOT used for real decisions. Results are logged for comparison.

Low risk — zero production impact. Ideal for validating model behavior on real traffic before exposure

Live Testing

A/B Testing

Live traffic is split between model A (current) and model B (new). Both serve real users. Statistical comparison of outcomes determines the winner after sufficient sample accumulates.

Requires governance: randomized assignment, outcome tracking, statistical significance thresholds, and rollback triggers

Graduated Risk

Canary Deployment

New model receives a small percentage of live traffic (e.g., 5%) while the current model handles the remainder. Traffic shifts gradually as confidence builds. Full rollback available at any point.

Balances risk and learning: early exposure catches real-world failures while limiting blast radius

Deployment & Operations

How a model is deployed determines its performance profile, governance requirements, and failure modes. Operational decisions made at deployment shape the entire monitoring strategy.

    Training-Serving Skew: One of the most common and damaging operational failures. Occurs when the features computed at inference time differ from those computed during training — different code paths, different data sources, or stale feature values. A feature store solves this by providing a single consistent feature computation layer for both training and serving.
  

AI Deployment Patterns

📦

Pattern

Batch Inference

Latency: Minutes to hours · Throughput: Very high

Model runs on large datasets on a schedule (nightly, weekly). Outputs stored for downstream consumption. Does not require low latency.

Use cases: Credit score updates, recommendation pre-computation, fraud batch screening

⚡

Pattern

Real-Time (Online) Inference

Latency: Milliseconds · Throughput: Moderate

Model serves individual predictions synchronously in response to API calls. Requires low latency infrastructure (model serving endpoints, feature caching).

Use cases: Real-time fraud detection, chatbot responses, real-time personalization

📱

Pattern

Edge Deployment

Latency: Sub-millisecond · Network: None required

Model runs on-device (IoT sensors, smartphones, edge servers). Requires model compression (quantization, pruning). No round-trip to cloud — critical for offline and privacy-sensitive use cases.

Use cases: Autonomous vehicles, medical devices, industrial IoT, on-device NLP

🌊

Pattern

Streaming Inference

Latency: Low · Throughput: Continuous

Model processes continuous event streams (Kafka, Kinesis) in near-real time. Each event triggers a prediction as it arrives. Combines low latency with high volume processing.

Use cases: Transaction fraud streams, clickstream personalization, network anomaly detection

MLOps Infrastructure Components

Model Registry

Centralized store for all model versions with metadata (training data version, hyperparameters, metrics, lineage) and lifecycle stages: Staging → Production → Archived. Acts as the deployment gate — only registry-approved models can go to production.

Audit: Verify every production model has a registry entry with approval records and lineage

Feature Store

Offline store: historical features for model training (batch processing). Online store: real-time low-latency feature retrieval for inference. Solves training-serving skew by using the same feature definitions in both environments.

Audit: Confirm training and serving use identical feature definitions; check feature freshness SLAs

Experiment Tracker

Logs every training run: hyperparameters, code version, dataset version, metrics at each epoch, and final model artifact. Tools: MLflow, Weights & Biases, Neptune. Enables full reproducibility and comparison across experiments.

Audit: Confirm experiments are logged for all production model candidates; verify reproducibility from logged artifacts

Model Serving Infrastructure

Platforms (TensorFlow Serving, TorchServe, Triton, KServe) that host models as scalable API endpoints. Handle batching, versioning, autoscaling, and A/B traffic routing. Must support rollback to previous model version.

Audit: Review rollback capability, SLA monitoring, and change approval integration with serving infrastructure

Data & Model Lineage

End-to-end tracking linking: source data → preprocessing → features → training run → model artifact → deployed model → predictions. Enables root-cause analysis of model failures and supports regulatory "right to explanation" requirements.

Audit: Trace any production model back to its training data; verify lineage is automated and tamper-evident

Monitoring & Alerting Platform

Continuously tracks input data distributions (PSI), output distributions, performance metrics (where labels exist), infrastructure metrics (latency, throughput), and HITL override rates. Alerts trigger automated retraining or human review.

Audit: Verify alert thresholds are defined, documented, and actioned; review alert-to-response SLA compliance

Model Retraining Strategies

📅

Scheduled Retraining

Model is retrained on a fixed time schedule (weekly, monthly, quarterly) regardless of detected drift. Simple to govern — predictable, calendar-driven. Does not react to sudden data changes.

Trade-off: Predictable governance burden vs. slow reaction to sudden distributional shifts

🚨

Triggered (Event-Based) Retraining

Retraining is automatically initiated when a monitoring metric crosses a defined threshold (PSI >0.2, accuracy drops below baseline, override rate spikes). More responsive than scheduled but requires well-tuned thresholds to avoid unnecessary retraining.

Trade-off: Responsive to real drift vs. risk of trigger noise causing unnecessary retraining overhead

♾️

Continuous Training (CT)

Model trains continuously on a rolling window of the most recent data (or on new data as it arrives). Extremely responsive but requires robust governance: automated validation gates must be in place or a poorly performing model could auto-deploy.

Trade-off: Maximum freshness vs. highest governance complexity — automated testing gates are non-negotiable

🖐️

Manual Retraining

Data scientists initiate retraining in response to incidents, business changes, or scheduled reviews. Provides maximum human oversight but introduces latency between drift detection and response.

Trade-off: Maximum control and oversight vs. slowest response; suitable for high-stakes models requiring explicit approval at each retraining cycle

AI System Architecture — Operational Components

Component	Function	Key Operational Risk
Data Pipeline	Ingest, transform, and deliver features to training and serving	Pipeline failures causing stale or missing features (training-serving skew)
Model Serving Endpoint	Expose model as API for real-time inference	Latency SLA breaches; single point of failure without redundancy
Feedback Loop	Capture ground truth labels from real-world outcomes to evaluate production performance	Delayed labels (credit decisions take months to default) limit real-time performance tracking
Shadow Traffic Log	Record all predictions and inputs for offline analysis, audit, and retraining	Logging PII in prediction logs creates regulatory exposure; log retention policies required
Human Review Queue	Route low-confidence or flagged decisions to human reviewers (HITL)	Queue overflow; reviewer fatigue; inconsistent human decisions introducing new bias
Rollback Mechanism	Revert production to a prior model version on detection of failure	Untested rollback procedures fail when needed most; registry entries must be preserved

Lifecycle Governance

AI systems require governance at every lifecycle phase — not just at deployment. Documentation, versioning, change management, and formal decommission processes are all auditable obligations.

    Governance Gate Principle: A well-governed AI lifecycle has formal sign-off gates between phases: Problem Definition → Data Ready → Model Validated → Deployment Approved → Monitoring Live. Each gate requires documented evidence and named approvers. A model that "just appeared in production" with no approval trail is an audit finding.
  

Model Documentation Artifacts

📄

Model Card

Structured document (introduced by Google) summarizing: intended use, out-of-scope uses, training data, performance metrics across demographic groups, known limitations, ethical considerations, and contact information.

Audit use: Primary artifact for bias assessment and regulatory disclosure; required for high-risk AI under EU AI Act

📊

Datasheet for Datasets

Dataset-level documentation (Microsoft) covering: dataset purpose, composition, collection process, preprocessing, consent basis, recommended uses, and known biases. Provides provenance for training data.

Audit use: Supports data lineage auditing; confirms consent and legal basis for data use

🧪

Validation Report

Formal documentation of all pre-deployment testing: accuracy on holdout set, bias test results across demographic groups, robustness testing, explainability analysis, and go/no-go recommendation with sign-off.

Audit use: Evidence that the model was validated before deployment; must be signed by a risk-independent reviewer

📝

Change Management Record

Every deployment, update, or configuration change to a production AI system must have: change request, risk assessment, testing evidence, approvals, rollback plan, and post-deployment verification. Same discipline as IT change management.

Audit use: Tests operating effectiveness of change control; unauthorized changes are audit findings

📈

Model Risk Assessment

Structured risk evaluation covering: model complexity, data sensitivity, decision impact, regulatory scope, technical limitations, and residual risks after controls. Determines governance level required for the model.

Audit use: Basis for scoping the audit; high-risk classifications require more extensive testing

🏁

Decommission Record

Formal retirement documentation: reason for decommission, final performance snapshot, data retention/deletion actions, successor model details, and sign-off from business and risk owners.

Audit use: Confirms retirement was deliberate and authorized; verifies GDPR deletion obligations were met

AI Versioning — What Must Be Versioned

Reproducibility requires that every component of the AI pipeline is versioned independently. A production model's provenance must be fully traceable from model artifact back to code, data, and configuration.

Artifact	What to Version	Tool Examples
Training Code	Model architecture, preprocessing scripts, training loop	Git (commit hash tied to model artifact)
Training Data	Dataset snapshot used for each training run (hash or version tag)	DVC, Delta Lake, data catalog versioning
Features	Feature definitions and computed feature versions	Feature store versioning (Feast, Tecton)
Hyperparameters	Full configuration used for each training run	MLflow, W&B experiment tracker
Model Artifact	Serialized model weights and architecture	Model registry (MLflow, Seldon, Vertex AI)
Deployment Configuration	Serving environment, scaling policy, traffic routing rules	Infrastructure-as-Code (Terraform, Helm)

Responsible AI Checkpoints by Lifecycle Phase

Phase	Responsible AI Check
Problem Definition	Is AI appropriate here? Could it cause harm? Is the use case legal? Risk tier assigned?
Data Collection	Is consent obtained? Are protected attributes identified? Is data representative of the deployment population?
Feature Engineering	Are proxy variables for protected attributes identified and mitigated? Is data leakage prevented?
Model Training	Are bias mitigation techniques applied (re-weighting, adversarial debiasing)? Is fairness a training objective?
Validation	Are bias metrics tested across all demographic groups? Is explainability tested? Is human review threshold defined?
Deployment	Is an appeals/override process in place? Is user notification required? Is HITL scope defined?
Monitoring	Are fairness metrics monitored alongside accuracy? Are bias drift alerts configured?
Retirement	Is retirement triggered by fairness degradation as well as performance? Are data deletion obligations met?

Practice Quiz — Domain 2: Lifecycle & Operations

10 questions covering AI lifecycle phases, MLOps, deployment patterns, and lifecycle governance. Select an answer for each question, then click Submit.

Question 1 of 10

Which lifecycle phase should formally define success criteria, risk tier classification, and business sign-off BEFORE any data is collected or resources committed?

A Model validation — where performance benchmarks are confirmed

B Feature engineering — where data transformations are designed

C Problem definition and scoping — where the use case is formally assessed and approved

D Deployment — where rollback plans are documented

Problem definition and scoping is the mandatory first phase. It establishes: the business problem, whether AI is the appropriate solution, measurable success criteria, regulatory constraints, and a risk tier classification. Without formal approval at this gate, all subsequent work may be done on an unapproved or inappropriate use case — a governance failure that auditors frequently cite.

Question 2 of 10

In MLOps, "CT" (Continuous Training) refers to:

A Continuous testing of model code quality through automated unit test pipelines

B Automatically retraining models when drift thresholds or performance triggers are met

C Continuous transformation of input data through the feature pipeline

D A neural network training methodology using cyclic learning rates

MLOps extends DevOps with a third continuous practice: Continuous Training (CT). CI tests code changes; CD deploys validated models; CT automatically retrains models when monitoring thresholds are breached (drift, performance degradation) or new labeled data arrives. CT requires automated validation gates — without them, a degraded model could auto-deploy. It's this automated retraining loop that distinguishes MLOps from traditional DevOps.

Question 3 of 10

A model is deployed in "shadow mode." This means the model:

A Is concealed from end users for security and compliance reasons

B Runs in parallel with the production model receiving live inputs, but its outputs are NOT used for real decisions

C Is only accessible during off-peak hours to reduce infrastructure load

D Uses obfuscated model weights to protect intellectual property

Shadow mode (also called shadow deployment or dark launch) runs the new model alongside the production model. Both receive identical live inputs; only the production model's outputs are used for actual decisions. The shadow model's predictions are logged for offline comparison. This is the lowest-risk validation approach — zero impact on users, but provides real-world input distribution. Distinguishing it from A/B testing (where both models serve real decisions) is a common exam question.

Question 4 of 10

A feature store's PRIMARY purpose in an MLOps architecture is to:

A Store serialized model weights and deployment artifacts

B Provide a centralized repository of feature definitions ensuring consistency between training and serving

C Track model performance metrics and drift alerts over time

D Manage model deployment approvals and change records

A feature store solves the training-serving skew problem: it provides the same feature definitions and computed values to both the training pipeline (offline store, historical data) and the inference pipeline (online store, low-latency real-time values). Without a feature store, two separate code paths often compute features slightly differently, causing the model to behave differently in production than in testing. Model weights → model registry; metrics → monitoring platform; approvals → change management system.

Question 5 of 10

Which deployment strategy gradually shifts a small percentage of live traffic to a new model, increasing exposure only as confidence builds?

A Shadow deployment — model runs in parallel but outputs are not used

B A/B testing — traffic is split equally between two models for comparison

C Canary deployment — a small traffic slice goes to the new model with gradual increase

D Blue-green deployment — instantaneous full traffic switch between two environments

Canary deployment (named after the "canary in a coal mine" concept) exposes the new model to a small fraction of traffic (e.g., 5%) while the current model handles the rest. If no issues emerge, the canary slice grows incrementally toward 100%. Full rollback is available at any point. This limits the "blast radius" of a bad deployment. A/B testing splits traffic equally for experimental comparison; blue-green is an instant full-switch with a parallel environment on standby.

Question 6 of 10

Model cards are PRIMARILY used to:

A Store model weights and deployment metadata in the model registry

B Track training experiment hyperparameters and run metrics

C Document a model's intended use, performance characteristics, limitations, and ethical considerations

D Define feature pipeline versioning and data transformation rules

Model cards (introduced by Google) are structured disclosure documents covering: the model's intended use cases and out-of-scope uses, performance metrics across different demographic groups, known limitations, ethical considerations, and the training data used. They are the primary artifact for bias transparency and regulatory disclosure. Under the EU AI Act, high-risk AI systems are required to provide model card-like documentation. Hyperparameters → experiment tracker; weights → model registry; feature rules → feature store.

Question 7 of 10

"Data leakage" in model development refers to:

A Training data being exposed to unauthorized users through a security breach

B Future or test-set information contaminating the training process, artificially inflating model performance

C Loss of training data due to pipeline infrastructure failures

D A regulatory breach of data handling and retention procedures

Data leakage is a model development defect where information that would not legitimately be available at prediction time (future data, or the test set used for "validation") is accidentally included in training. The model learns to use this future information — producing unrealistically high performance during testing that collapses completely in production. Common leakage sources: fitting scalers on the full dataset before splitting; including post-event features; using future-date data for past predictions. This is a technical risk, not a security or regulatory breach.

Question 8 of 10

In MLOps, which component stores ALL versions of deployed models along with their metadata, training lineage, and lifecycle stage?

A Feature store — manages feature definitions for training and serving

B Data catalog — documents dataset metadata and lineage

C Experiment tracker — logs training runs and hyperparameters

D Model registry — central store for model artifacts with lifecycle governance

The model registry is the single source of truth for all model versions in an organization. It stores: serialized model artifacts, metadata (training data version, hyperparameters, performance metrics), lineage (which code + data produced this model), and lifecycle stage (Staging → Production → Archived). It acts as the deployment gate — only models in "Production" state in the registry can serve live traffic. It's essential for governance: without a registry, there is no audit trail for what model is in production or how it got there.

Question 9 of 10

Which retraining strategy initiates a new training run automatically when a model's performance metric drops below a defined threshold?

A Scheduled retraining — model retrains on a fixed calendar-based schedule

B Triggered (event-based) retraining — retraining fires when a metric threshold is crossed

C Continuous training — model trains on a rolling window of new data without stopping

D Manual retraining — data scientists initiate retraining based on their own review

Triggered (event-based) retraining is initiated automatically when a specific condition is met — a PSI score exceeds 0.2, model accuracy drops below a defined floor, or a bias metric breaches a threshold. It's more responsive than scheduled retraining (which may miss sudden shifts) but less complex than continuous training (which trains on every new data point). The trade-off: well-defined thresholds avoid unnecessary retraining while ensuring timely response to real degradation.

Question 10 of 10

The PRIMARY goal of champion-challenger testing in AI model deployment is to:

A Test a model's resilience against adversarial attacks and data poisoning

B Validate model code quality through peer review before merging to main branch

C Compare a new candidate model's performance against the current production model before full deployment

D Ensure model documentation meets compliance requirements before deployment approval

Champion-challenger testing is a model validation strategy where the new candidate model (challenger) is evaluated against the existing production model (champion) on the same held-out test data. The challenger must meet or exceed the champion's performance metrics before being promoted to production. This prevents deploying a "different but not better" model, and provides a documented rationale for every model update. It directly addresses the governance question: "Why was this model promoted?"

Practice Score — Keep studying with FlashGenius!

Memory Hooks

High-yield mnemonics and patterns to lock in AI Lifecycle & Operations for the AAIA.

🔄

8 Lifecycle Phases — In Order

Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire. The lifecycle is a loop — Monitoring feeds back into Retrain (via CT triggers) and eventually Retire. Governance gates sit between each phase.

Mnemonic: "Pretty Data Flows Through Very Dense Machine Rooms" — Problem, Data, Features, Train, Validate, Deploy, Monitor, Retire

⚙️

MLOps CI/CD/CT

CI = test the code (unit tests, integration tests on code commits). CD = deploy the validated model (automated delivery pipeline). CT = retrain the model (automated when drift or performance triggers fire). CT is the MLOps addition — DevOps only has CI/CD. CT without automated validation gates is dangerous.

Mnemonic: "Code → Deploy → Train — gets you to production and keeps you there." CT = the loop that keeps models fresh.

🕵️

Shadow vs. Canary vs. A/B

Shadow: new model runs, zero decisions made → safest, no user impact. Canary: small % of real traffic to new model → graduated risk. A/B: full split traffic for experiment → both models make real decisions. Risk order: Shadow < Canary < A/B. Champion-Challenger is pre-deployment testing — not live.

Mnemonic: "Shadows don't decide, Canaries take a little, A/B takes half" — increasing risk and exposure left to right.

🏗️

Feature Store — Offline vs. Online

Offline store: historical feature values for model training (batch, high latency OK). Online store: real-time feature retrieval for inference (millisecond latency). Same feature definitions in both = training-serving consistency. Without this, skew between training and serving corrupts model behavior silently.

Mnemonic: "Offline feeds Training, Online feeds Serving — same definitions, different stores" = no skew.

📋

Model Registry Lifecycle Stages

Model registry stages: Staging (validated, awaiting approval) → Production (approved, live) → Archived (superseded, retained for audit). Only Production-stage models should serve live traffic. Every transition needs a documented approval. No model should reach production without a registry entry — that's an audit finding.

Mnemonic: "Stage it, Approve it, Archive it" — the model's career path from development to retirement.

💧

Data Leakage — Root Causes

Leakage occurs when the model sees future information during training: 1. Test set contamination (used for tuning). 2. Scaler fitted on full data before split. 3. Post-event features included (e.g., "claim paid" as a feature to predict "will claim"). 4. Temporal leakage in time-series data. Model appears great in testing, collapses in production.

Mnemonic: "Future data in training = future failure in production." Always split BEFORE fitting any transformers.

High-Yield AAIA Facts — AI Lifecycle & Operations

Fact	Answer
The third "C" that MLOps adds to DevOps's CI/CD	CT — Continuous Training (automated model retraining)
Deployment strategy with zero risk to live users	Shadow mode — new model runs in parallel, outputs not used for decisions
What a feature store prevents	Training-serving skew — ensures training and inference use identical feature definitions
Model card creator and its key sections	Google (2019); sections: intended use, performance by group, limitations, ethical considerations
Data leakage definition	Future or test-set information contaminating training, artificially inflating performance
Champion-challenger testing purpose	Validate that the new model outperforms the current production model before promotion
Model registry lifecycle stages	Staging → Production → Archived
Silent failure mode unique to AI vs. traditional software	Model drift — gradual performance degradation with no code changes or visible errors
Retraining strategy that fires when PSI > 0.2	Triggered (event-based) retraining
Deployment pattern requiring no network connectivity	Edge deployment — model runs on-device

Flashcards & Study Advisor

Click any card to flip it. Use the Study Advisor for targeted guidance by topic area.

Concept

What is Continuous Training (CT) in MLOps and what makes it different from CI/CD?

Tap to reveal →

Answer

CT (Continuous Training) automatically retrains models when drift thresholds or performance triggers are met. CI tests code changes; CD deploys validated models. CT = the MLOps-specific addition — it closes the loop between monitoring and retraining, keeping models current without manual intervention.

Distinction

What is the difference between shadow mode, canary, and A/B deployment?

Tap to reveal →

Answer

Shadow: new model receives live inputs but makes no decisions (zero risk). Canary: small % of real traffic goes to new model with gradual increase. A/B: traffic split equally — both models make real decisions for comparison. Risk: Shadow < Canary < A/B.

Infrastructure

What problem does a feature store solve, and what are its two components?

Tap to reveal →

Answer

Solves training-serving skew — ensuring model training and live inference use identical feature definitions. Two components: Offline store (historical features for training, batch) and Online store (real-time low-latency feature retrieval for inference).

Risk

What is data leakage in model development and why is it dangerous?

Tap to reveal →

Answer

Data leakage: future or test-set information contaminating training, artificially inflating performance. Dangerous because the model appears excellent in testing but fails in production — a silent defect invisible until live deployment. Common cause: fitting scalers on full data before train/test split.

Governance

What is a model card and when is it required?

Tap to reveal →

Answer

Model card (Google): structured document covering intended use, out-of-scope uses, performance across demographic groups, limitations, and ethical considerations. Required by EU AI Act for high-risk AI; best practice for all AI systems. Primary artifact for bias transparency and regulatory disclosure.

Infrastructure

What is a model registry and what are its lifecycle stages?

Tap to reveal →

Answer

Model registry: centralized store for all model versions with metadata, training lineage, and lifecycle stage. Three stages: Staging (validated, awaiting approval) → Production (approved, live) → Archived (superseded). Only Production models should serve live traffic.

Validation

What is champion-challenger testing and why is it a governance best practice?

Tap to reveal →

Answer

Champion-challenger: new candidate model (challenger) is validated against the current production model (champion) on the same test data. The challenger must meet or exceed champion performance to be promoted. Governance value: provides documented rationale for every model update decision.

MLOps

How does MLOps differ from DevOps in what must be versioned?

Tap to reveal →

Answer

DevOps versions: code only. MLOps versions: code + data + model artifacts (all three). Reproducibility requires that the same code version + data version + hyperparameters reproduce the same model. Missing any one makes the model's provenance unverifiable — an audit finding.

Ready to pass the AAIA?

Reinforce AI Lifecycle & Operations with full-length practice tests on FlashGenius.

Unlock Full Practice Tests on FlashGenius →

📌 Exam Strategy

⚠️ Common Mistakes

⚡ Quick Review

🔬 Deep Dive

🎯 Practice Tips

Exam Strategy — AI Lifecycle

Lifecycle phase questions: When asked about sequence, use "Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire." Questions about "what comes first" almost always answer with Problem Definition or risk classification before any data work.
Deployment strategy distinctions: Shadow = no decisions; Canary = graduated real traffic; A/B = split real traffic. If a question asks about "lowest risk validation on live traffic" → Shadow. "Gradual rollout" → Canary. "Comparison experiment" → A/B.
Feature store vs. model registry: Feature store = features for training/serving consistency. Model registry = model artifacts + lifecycle. Never mix these — each question is testing whether you know which stores what.
CT trigger vs. scheduled: "Automatically when threshold crossed" → Triggered CT. "Every month regardless" → Scheduled. Both are valid; exam tests whether you know the tradeoff (responsiveness vs. governance simplicity).
Model card = bias disclosure: Any question about documenting bias, performance by demographic, ethical considerations, or intended use → Model Card. Not a datasheet (that's for datasets).

Common Mistakes to Avoid

Shadow ≠ A/B testing: Shadow mode makes NO real decisions. A/B testing makes real decisions for both model variants. A common distractor — know the difference cold.
Data leakage ≠ data breach: Leakage is a technical model development defect (future data in training), not a security incident. Exam questions test this distinction specifically.
CT without validation gates is dangerous: Continuous training doesn't mean blind auto-deployment. Every CT cycle must pass automated validation gates before the new model reaches production.
Model cards ≠ experiment tracking: Model cards are disclosure documents for intended use and bias. Experiment tracking logs hyperparameters and training metrics. Different artifacts, different purposes.
Champion-challenger is pre-deployment: It happens before the challenger goes live — not a live traffic split. Confusing it with A/B testing is a common error.

Quick Review — Key Facts

8 lifecycle phases: Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire
MLOps = CI + CD + CT: CI tests code; CD deploys; CT retrains automatically
Deployment risk order: Shadow < Canary < A/B (increasing real-world exposure)
Feature store: Offline (training) + Online (serving) = same definitions = no skew
Model registry stages: Staging → Production → Archived
Data leakage: Future/test data in training → inflated metrics → production failure
Model card: Intended use, demographic performance, limitations, ethical notes
Champion-challenger: New model must beat current production model on same test set

Deep Dive — Advanced Concepts

Training-serving skew: The feature store solves the inconsistency between training and serving code paths. But skew can also occur from: version mismatches in preprocessing libraries, stale online store values, and different handling of null values between paths — all are auditable operational risks.
Temporal leakage in time-series: For time-series models (fraud, demand forecasting), train/test splits must be temporal — test set must always be chronologically after the training set. Random splitting on time-series data always introduces leakage.
Feedback loop delay: Many AI systems suffer from "label delay" — outcomes only become known long after predictions (e.g., credit default takes months). This limits real-time accuracy monitoring; proxy metrics (override rate, confidence distribution) must substitute until ground truth arrives.
Transfer learning risk: Foundation models (LLMs, CLIP) are pre-trained on large datasets. Fine-tuning inherits the original training data's biases and provenance risks — even if the fine-tuning data is clean. Auditors must assess the full training lineage including the base model.
Model retirement triggers: Retirement should be triggered not only by performance degradation but also by: regulatory changes making the model non-compliant, discovery of systematic bias, business use case retirement, or replacement by a superior model. Each requires the same governance rigor as initial deployment.

Practice Tips

Draw the lifecycle: Sketch the 8-phase loop and label the governance artifact at each gate. Visualizing it builds exam intuition for "which comes first" questions.
Practice deployment scenarios: For any scenario question, identify: (1) is this pre-deployment or post-deployment? (2) does the model make real decisions? → narrow to the right strategy immediately.
Learn what each MLOps component stores: Feature store = features; model registry = model artifacts; experiment tracker = training run metadata; data catalog = dataset metadata. These are frequently tested as distractors.
Apply retraining logic: "Automatically when threshold breached" = Triggered. "On a schedule" = Scheduled. "Continuously on new data" = Continuous Training. "Data scientist decides" = Manual.
Connect lifecycle to Domain 3 auditing: Every phase has an audit artifact. Practice naming the artifact for each phase — this directly bridges Domain 2 operations to Domain 3 audit evidence collection.