The AI lifecycle is a continuous loop — not a one-time project. Operationalizing AI well requires disciplined engineering practices (MLOps) applied at every phase from data to decommission.
Auditor's Focus: At each lifecycle phase ask: Is there an accountable owner? Is there documented evidence this step was completed? Are risks addressed? The AI lifecycle is only governable if each phase produces artifacts — data cards, model cards, test reports, deployment approvals, monitoring dashboards — that auditors can inspect.
AI/ML Lifecycle — 8 Phases
1
Problem Definition & Scoping
Deliverable: Problem statement, success criteria, constraints, use-case risk classification
Define the business problem and whether AI is the right solution. Establish measurable success criteria (accuracy targets, latency SLAs), regulatory constraints, and a preliminary risk tier classification. Gate: business and risk sign-off before resources are committed.
Audit check: Is the AI use case formally approved? Is a risk tier assigned? Is the intended use documented?
2
Data Collection & Ingestion
Deliverable: Data catalog entry, consent/provenance documentation, ingestion pipeline
Identify, acquire, and ingest data from internal systems, third parties, or public sources. Document data provenance, consent basis (GDPR), and legal authority to use the data. Establish data lineage tracking from this phase forward.
Audit check: Is consent documented? Is data provenance traceable? Are third-party data agreements in place?
Clean, normalize, and transform raw data. Engineer features (derived variables) for model input. Critical risk: introducing data leakage (future information contaminating training), or creating proxy variables for protected attributes during feature engineering.
Audit check: Are transformations versioned and documented? Is data leakage testing performed? Are proxy variables identified?
4
Model Development & Training
Deliverable: Trained model artifact, experiment tracking logs, hyperparameter records
Select algorithm, define architecture, run training experiments. Track hyperparameters, metrics, and artifacts in an experiment tracker (MLflow, W&B). Versioning of code, data, and model outputs enables reproducibility — a core MLOps requirement.
Audit check: Are experiments logged? Are model artifacts versioned? Is training reproducible from versioned inputs?
Evaluate model on holdout test data. Run accuracy testing, bias testing (demographic parity), robustness testing, and explainability analysis. Challenger comparison against the current production model if one exists. Formal sign-off required before deployment.
Audit check: Is the test set truly unseen? Are bias tests run across demographic groups? Is validation sign-off documented?
Move validated model to production via approved change management process. Select deployment pattern (batch, real-time, canary, A/B). Register model in model registry with version, metadata, and lineage. Maintain rollback capability to prior production version.
Audit check: Is there a change approval record? Is a rollback procedure tested and documented? Is the model version in the registry?
7
Monitoring & Maintenance
Deliverable: Monitoring dashboard, drift alerts, performance reports, retraining records
Continuously monitor model performance (accuracy, PSI drift, prediction drift) and input data quality. Detect degradation early and trigger retraining or rollback when thresholds are breached. Document all maintenance actions with change records.
Audit check: Are monitoring thresholds defined and documented? Are drift alerts actioned within SLA? Are retraining events approved?
8
Retirement & Decommission
Deliverable: Decommission plan, data retention/deletion record, successor model handoff
Formally retire models that are superseded, no longer fit-for-purpose, or pose unacceptable risk. Document the retirement decision, preserve audit trails and model artifacts per retention policy, and ensure data deletion obligations (GDPR right to erasure) are met.
Audit check: Is model retirement formally approved? Are artifacts retained per policy? Is data deletion documented?
MLOps — Operationalizing AI
🗄️
MLOps Component
Data Pipeline Automation
Automated, reproducible pipelines for data ingestion, transformation, and feature computation. Triggers reprocessing when source data changes. Enables consistent data delivery to training and serving.
🔄
MLOps Component
CI/CD/CT Pipeline
CI = Continuous Integration (test code changes). CD = Continuous Delivery (deploy validated models). CT = Continuous Training (retrain models automatically when drift or performance triggers are met).
🧪
MLOps Component
Experiment Tracking
Tools (MLflow, W&B, Neptune) log every training run: hyperparameters, metrics, dataset versions, and model artifacts. Enables reproducibility, comparison across experiments, and audit trail for model development decisions.
📦
MLOps Component
Model Registry
Central repository storing all model versions with metadata, lineage, and lifecycle status (Staging → Production → Archived). Provides single source of truth for model governance and deployment approvals.
📡
MLOps Component
Model Monitoring
Ongoing measurement of data drift (PSI), prediction drift, performance metrics (Precision/Recall/F1), and infrastructure health. Triggers automated alerts and retraining when thresholds are breached.
📋
MLOps Component
Feature Store
Centralized repository for computed features. Offline store: historical features for model training. Online store: low-latency feature retrieval for real-time inference. Ensures training-serving consistency — preventing training-serving skew.
DevOps vs. MLOps — Key Differences
⚙️ DevOps
🤖 MLOps
Artifact versioned
Application code
Code + Data + Model (all three must be versioned)
Testing focus
Unit tests, integration tests, functional correctness
+ Model accuracy, bias tests, data quality, drift tests
Pipeline triggers
Code commits, PRs
+ Data changes, performance degradation, drift alerts (CT)
Reproducibility
Same code + environment = same output
Same code + data + hyperparameters = same model
Failure mode
Build fails, tests fail — visible immediately
Silent degradation — model drifts slowly, no code changes
Monitoring concern
Uptime, error rates, latency
+ Data drift, model accuracy, fairness, prediction distributions
Model Development & Validation
Sound model development requires disciplined data preparation, reproducible training, and rigorous validation before any model touches production.
Key Audit Risk — Data Leakage: Data leakage occurs when information from the test set (or future data) contaminates the training process, artificially inflating performance metrics. A leaky model appears excellent in testing but fails catastrophically in production. Temporal separation (never using future data to predict past) is the primary safeguard.
Data Preparation — Key Steps & Risks
Step 1
Data Profiling
Assess data quality: completeness, distributions, outliers, missing values, and class imbalance. Establishes a baseline before any transformations.
Risk: Undocumented profiling = unknown data quality entering the pipeline
Step 2
Data Cleaning
Handle missing values (imputation, deletion), remove duplicates, correct format errors, clip outliers. All decisions must be documented in transformation scripts.
Risk: Inappropriate imputation strategies introduce systematic bias into training data
Step 3
Feature Engineering
Create derived features (aggregations, interactions, embeddings). All feature transformations must be versioned and applied consistently to training and inference data.
Risk: Proxy variables — features that correlate with protected attributes (e.g., zip code → race)
Step 4
Train/Val/Test Split
Divide data into training (model learning), validation (hyperparameter tuning), and holdout test (final unbiased evaluation). Test set must remain unseen until final evaluation.
Risk: Data leakage if test data is seen during training; temporal leakage for time-series data
Step 5
Normalization & Encoding
Scale numeric features (min-max, z-score) and encode categoricals (one-hot, ordinal, embeddings). Scalers fitted on training data only — then applied to validation/test.
Risk: Fitting scalers on full dataset before splitting causes data leakage
Step 6
Class Imbalance Handling
Address skewed class distributions: oversampling (SMOTE), undersampling, class weights, or threshold tuning. Imbalanced data causes models to predict majority class and miss rare events.
Risk: Accuracy is misleading on imbalanced data — use Precision/Recall/F1 instead
Model Training — Key Concepts
Concept
Definition
Audit/Risk Relevance
Overfitting
Model memorizes training data; performs well on training, poorly on unseen data
Model appears performant in testing but fails in production — a silent risk
Underfitting
Model is too simple to capture patterns; poor performance on both training and test
Detectable — but model may still be deployed if business pressure overrides technical rigor
Hyperparameter Tuning
Optimizing model configuration parameters (learning rate, depth, regularization) that are set before training
Must use validation set only — using test set for tuning = data leakage
Reproducibility
Ability to re-run training with the same inputs (code, data, hyperparameters) and obtain the same model
Core MLOps requirement — without it, model lineage and audit trails are unverifiable
Experiment Tracking
Systematic logging of every training run: parameters, metrics, artifacts, dataset version
Provides audit trail for model development decisions; required for regulatory compliance
Transfer Learning
Adapting a pre-trained model (e.g., foundation model) to a new task using fine-tuning
Inherits risks from original training data — provenance and bias from source model must be assessed
Model Validation Approaches
Pre-Deployment
Holdout Testing
Model evaluated on a fixed, unseen test set that was set aside before training. The gold standard for unbiased performance estimation.
Audit requirement: Confirm the holdout set was not touched until final evaluation
Pre-Deployment
Cross-Validation
Data is split into k folds; model trains on k-1 folds and validates on the remaining fold, rotating through all combinations. More robust for small datasets.
Used during development — not for final holdout evaluation; each fold must maintain temporal integrity for time-series
Pre-Deployment
Champion-Challenger
New candidate model (challenger) is validated against the current production model (champion) on the same test set. Challenger must clear the champion's metrics before being promoted.
Best practice for model updates — ensures new model is demonstrably better, not just different
Live Testing
Shadow Mode Testing
New model runs in parallel with the production model. Receives live inputs and generates predictions, but its outputs are NOT used for real decisions. Results are logged for comparison.
Low risk — zero production impact. Ideal for validating model behavior on real traffic before exposure
Live Testing
A/B Testing
Live traffic is split between model A (current) and model B (new). Both serve real users. Statistical comparison of outcomes determines the winner after sufficient sample accumulates.
New model receives a small percentage of live traffic (e.g., 5%) while the current model handles the remainder. Traffic shifts gradually as confidence builds. Full rollback available at any point.
Balances risk and learning: early exposure catches real-world failures while limiting blast radius
Deployment & Operations
How a model is deployed determines its performance profile, governance requirements, and failure modes. Operational decisions made at deployment shape the entire monitoring strategy.
Training-Serving Skew: One of the most common and damaging operational failures. Occurs when the features computed at inference time differ from those computed during training — different code paths, different data sources, or stale feature values. A feature store solves this by providing a single consistent feature computation layer for both training and serving.
AI Deployment Patterns
📦
Pattern
Batch Inference
Latency: Minutes to hours · Throughput: Very high
Model runs on large datasets on a schedule (nightly, weekly). Outputs stored for downstream consumption. Does not require low latency.
Use cases: Credit score updates, recommendation pre-computation, fraud batch screening
⚡
Pattern
Real-Time (Online) Inference
Latency: Milliseconds · Throughput: Moderate
Model serves individual predictions synchronously in response to API calls. Requires low latency infrastructure (model serving endpoints, feature caching).
Use cases: Real-time fraud detection, chatbot responses, real-time personalization
📱
Pattern
Edge Deployment
Latency: Sub-millisecond · Network: None required
Model runs on-device (IoT sensors, smartphones, edge servers). Requires model compression (quantization, pruning). No round-trip to cloud — critical for offline and privacy-sensitive use cases.
Use cases: Autonomous vehicles, medical devices, industrial IoT, on-device NLP
🌊
Pattern
Streaming Inference
Latency: Low · Throughput: Continuous
Model processes continuous event streams (Kafka, Kinesis) in near-real time. Each event triggers a prediction as it arrives. Combines low latency with high volume processing.
Use cases: Transaction fraud streams, clickstream personalization, network anomaly detection
MLOps Infrastructure Components
Model Registry
Centralized store for all model versions with metadata (training data version, hyperparameters, metrics, lineage) and lifecycle stages: Staging → Production → Archived. Acts as the deployment gate — only registry-approved models can go to production.
Audit: Verify every production model has a registry entry with approval records and lineage
Feature Store
Offline store: historical features for model training (batch processing). Online store: real-time low-latency feature retrieval for inference. Solves training-serving skew by using the same feature definitions in both environments.
Audit: Confirm training and serving use identical feature definitions; check feature freshness SLAs
Experiment Tracker
Logs every training run: hyperparameters, code version, dataset version, metrics at each epoch, and final model artifact. Tools: MLflow, Weights & Biases, Neptune. Enables full reproducibility and comparison across experiments.
Audit: Confirm experiments are logged for all production model candidates; verify reproducibility from logged artifacts
Model Serving Infrastructure
Platforms (TensorFlow Serving, TorchServe, Triton, KServe) that host models as scalable API endpoints. Handle batching, versioning, autoscaling, and A/B traffic routing. Must support rollback to previous model version.
Audit: Review rollback capability, SLA monitoring, and change approval integration with serving infrastructure
Data & Model Lineage
End-to-end tracking linking: source data → preprocessing → features → training run → model artifact → deployed model → predictions. Enables root-cause analysis of model failures and supports regulatory "right to explanation" requirements.
Audit: Trace any production model back to its training data; verify lineage is automated and tamper-evident
Monitoring & Alerting Platform
Continuously tracks input data distributions (PSI), output distributions, performance metrics (where labels exist), infrastructure metrics (latency, throughput), and HITL override rates. Alerts trigger automated retraining or human review.
Audit: Verify alert thresholds are defined, documented, and actioned; review alert-to-response SLA compliance
Model Retraining Strategies
📅
Scheduled Retraining
Model is retrained on a fixed time schedule (weekly, monthly, quarterly) regardless of detected drift. Simple to govern — predictable, calendar-driven. Does not react to sudden data changes.
Trade-off: Predictable governance burden vs. slow reaction to sudden distributional shifts
🚨
Triggered (Event-Based) Retraining
Retraining is automatically initiated when a monitoring metric crosses a defined threshold (PSI >0.2, accuracy drops below baseline, override rate spikes). More responsive than scheduled but requires well-tuned thresholds to avoid unnecessary retraining.
Trade-off: Responsive to real drift vs. risk of trigger noise causing unnecessary retraining overhead
♾️
Continuous Training (CT)
Model trains continuously on a rolling window of the most recent data (or on new data as it arrives). Extremely responsive but requires robust governance: automated validation gates must be in place or a poorly performing model could auto-deploy.
Trade-off: Maximum freshness vs. highest governance complexity — automated testing gates are non-negotiable
🖐️
Manual Retraining
Data scientists initiate retraining in response to incidents, business changes, or scheduled reviews. Provides maximum human oversight but introduces latency between drift detection and response.
Trade-off: Maximum control and oversight vs. slowest response; suitable for high-stakes models requiring explicit approval at each retraining cycle
AI System Architecture — Operational Components
Component
Function
Key Operational Risk
Data Pipeline
Ingest, transform, and deliver features to training and serving
Pipeline failures causing stale or missing features (training-serving skew)
Model Serving Endpoint
Expose model as API for real-time inference
Latency SLA breaches; single point of failure without redundancy
Feedback Loop
Capture ground truth labels from real-world outcomes to evaluate production performance
Delayed labels (credit decisions take months to default) limit real-time performance tracking
Shadow Traffic Log
Record all predictions and inputs for offline analysis, audit, and retraining
Route low-confidence or flagged decisions to human reviewers (HITL)
Queue overflow; reviewer fatigue; inconsistent human decisions introducing new bias
Rollback Mechanism
Revert production to a prior model version on detection of failure
Untested rollback procedures fail when needed most; registry entries must be preserved
Lifecycle Governance
AI systems require governance at every lifecycle phase — not just at deployment. Documentation, versioning, change management, and formal decommission processes are all auditable obligations.
Governance Gate Principle: A well-governed AI lifecycle has formal sign-off gates between phases: Problem Definition → Data Ready → Model Validated → Deployment Approved → Monitoring Live. Each gate requires documented evidence and named approvers. A model that "just appeared in production" with no approval trail is an audit finding.
Model Documentation Artifacts
📄
Model Card
Structured document (introduced by Google) summarizing: intended use, out-of-scope uses, training data, performance metrics across demographic groups, known limitations, ethical considerations, and contact information.
Audit use: Primary artifact for bias assessment and regulatory disclosure; required for high-risk AI under EU AI Act
📊
Datasheet for Datasets
Dataset-level documentation (Microsoft) covering: dataset purpose, composition, collection process, preprocessing, consent basis, recommended uses, and known biases. Provides provenance for training data.
Audit use: Supports data lineage auditing; confirms consent and legal basis for data use
🧪
Validation Report
Formal documentation of all pre-deployment testing: accuracy on holdout set, bias test results across demographic groups, robustness testing, explainability analysis, and go/no-go recommendation with sign-off.
Audit use: Evidence that the model was validated before deployment; must be signed by a risk-independent reviewer
📝
Change Management Record
Every deployment, update, or configuration change to a production AI system must have: change request, risk assessment, testing evidence, approvals, rollback plan, and post-deployment verification. Same discipline as IT change management.
Audit use: Tests operating effectiveness of change control; unauthorized changes are audit findings
📈
Model Risk Assessment
Structured risk evaluation covering: model complexity, data sensitivity, decision impact, regulatory scope, technical limitations, and residual risks after controls. Determines governance level required for the model.
Audit use: Basis for scoping the audit; high-risk classifications require more extensive testing
🏁
Decommission Record
Formal retirement documentation: reason for decommission, final performance snapshot, data retention/deletion actions, successor model details, and sign-off from business and risk owners.
Audit use: Confirms retirement was deliberate and authorized; verifies GDPR deletion obligations were met
AI Versioning — What Must Be Versioned
Reproducibility requires that every component of the AI pipeline is versioned independently. A production model's provenance must be fully traceable from model artifact back to code, data, and configuration.
Artifact
What to Version
Tool Examples
Training Code
Model architecture, preprocessing scripts, training loop
Git (commit hash tied to model artifact)
Training Data
Dataset snapshot used for each training run (hash or version tag)
Is AI appropriate here? Could it cause harm? Is the use case legal? Risk tier assigned?
Data Collection
Is consent obtained? Are protected attributes identified? Is data representative of the deployment population?
Feature Engineering
Are proxy variables for protected attributes identified and mitigated? Is data leakage prevented?
Model Training
Are bias mitigation techniques applied (re-weighting, adversarial debiasing)? Is fairness a training objective?
Validation
Are bias metrics tested across all demographic groups? Is explainability tested? Is human review threshold defined?
Deployment
Is an appeals/override process in place? Is user notification required? Is HITL scope defined?
Monitoring
Are fairness metrics monitored alongside accuracy? Are bias drift alerts configured?
Retirement
Is retirement triggered by fairness degradation as well as performance? Are data deletion obligations met?
Practice Quiz — Domain 2: Lifecycle & Operations
10 questions covering AI lifecycle phases, MLOps, deployment patterns, and lifecycle governance. Select an answer for each question, then click Submit.
Question 1 of 10
Which lifecycle phase should formally define success criteria, risk tier classification, and business sign-off BEFORE any data is collected or resources committed?
A Model validation — where performance benchmarks are confirmed
B Feature engineering — where data transformations are designed
C Problem definition and scoping — where the use case is formally assessed and approved
D Deployment — where rollback plans are documented
Problem definition and scoping is the mandatory first phase. It establishes: the business problem, whether AI is the appropriate solution, measurable success criteria, regulatory constraints, and a risk tier classification. Without formal approval at this gate, all subsequent work may be done on an unapproved or inappropriate use case — a governance failure that auditors frequently cite.
Question 2 of 10
In MLOps, "CT" (Continuous Training) refers to:
A Continuous testing of model code quality through automated unit test pipelines
B Automatically retraining models when drift thresholds or performance triggers are met
C Continuous transformation of input data through the feature pipeline
D A neural network training methodology using cyclic learning rates
MLOps extends DevOps with a third continuous practice: Continuous Training (CT). CI tests code changes; CD deploys validated models; CT automatically retrains models when monitoring thresholds are breached (drift, performance degradation) or new labeled data arrives. CT requires automated validation gates — without them, a degraded model could auto-deploy. It's this automated retraining loop that distinguishes MLOps from traditional DevOps.
Question 3 of 10
A model is deployed in "shadow mode." This means the model:
A Is concealed from end users for security and compliance reasons
B Runs in parallel with the production model receiving live inputs, but its outputs are NOT used for real decisions
C Is only accessible during off-peak hours to reduce infrastructure load
D Uses obfuscated model weights to protect intellectual property
Shadow mode (also called shadow deployment or dark launch) runs the new model alongside the production model. Both receive identical live inputs; only the production model's outputs are used for actual decisions. The shadow model's predictions are logged for offline comparison. This is the lowest-risk validation approach — zero impact on users, but provides real-world input distribution. Distinguishing it from A/B testing (where both models serve real decisions) is a common exam question.
Question 4 of 10
A feature store's PRIMARY purpose in an MLOps architecture is to:
A Store serialized model weights and deployment artifacts
B Provide a centralized repository of feature definitions ensuring consistency between training and serving
C Track model performance metrics and drift alerts over time
D Manage model deployment approvals and change records
A feature store solves the training-serving skew problem: it provides the same feature definitions and computed values to both the training pipeline (offline store, historical data) and the inference pipeline (online store, low-latency real-time values). Without a feature store, two separate code paths often compute features slightly differently, causing the model to behave differently in production than in testing. Model weights → model registry; metrics → monitoring platform; approvals → change management system.
Question 5 of 10
Which deployment strategy gradually shifts a small percentage of live traffic to a new model, increasing exposure only as confidence builds?
A Shadow deployment — model runs in parallel but outputs are not used
B A/B testing — traffic is split equally between two models for comparison
C Canary deployment — a small traffic slice goes to the new model with gradual increase
D Blue-green deployment — instantaneous full traffic switch between two environments
Canary deployment (named after the "canary in a coal mine" concept) exposes the new model to a small fraction of traffic (e.g., 5%) while the current model handles the rest. If no issues emerge, the canary slice grows incrementally toward 100%. Full rollback is available at any point. This limits the "blast radius" of a bad deployment. A/B testing splits traffic equally for experimental comparison; blue-green is an instant full-switch with a parallel environment on standby.
Question 6 of 10
Model cards are PRIMARILY used to:
A Store model weights and deployment metadata in the model registry
B Track training experiment hyperparameters and run metrics
C Document a model's intended use, performance characteristics, limitations, and ethical considerations
D Define feature pipeline versioning and data transformation rules
Model cards (introduced by Google) are structured disclosure documents covering: the model's intended use cases and out-of-scope uses, performance metrics across different demographic groups, known limitations, ethical considerations, and the training data used. They are the primary artifact for bias transparency and regulatory disclosure. Under the EU AI Act, high-risk AI systems are required to provide model card-like documentation. Hyperparameters → experiment tracker; weights → model registry; feature rules → feature store.
Question 7 of 10
"Data leakage" in model development refers to:
A Training data being exposed to unauthorized users through a security breach
B Future or test-set information contaminating the training process, artificially inflating model performance
C Loss of training data due to pipeline infrastructure failures
D A regulatory breach of data handling and retention procedures
Data leakage is a model development defect where information that would not legitimately be available at prediction time (future data, or the test set used for "validation") is accidentally included in training. The model learns to use this future information — producing unrealistically high performance during testing that collapses completely in production. Common leakage sources: fitting scalers on the full dataset before splitting; including post-event features; using future-date data for past predictions. This is a technical risk, not a security or regulatory breach.
Question 8 of 10
In MLOps, which component stores ALL versions of deployed models along with their metadata, training lineage, and lifecycle stage?
A Feature store — manages feature definitions for training and serving
B Data catalog — documents dataset metadata and lineage
C Experiment tracker — logs training runs and hyperparameters
D Model registry — central store for model artifacts with lifecycle governance
The model registry is the single source of truth for all model versions in an organization. It stores: serialized model artifacts, metadata (training data version, hyperparameters, performance metrics), lineage (which code + data produced this model), and lifecycle stage (Staging → Production → Archived). It acts as the deployment gate — only models in "Production" state in the registry can serve live traffic. It's essential for governance: without a registry, there is no audit trail for what model is in production or how it got there.
Question 9 of 10
Which retraining strategy initiates a new training run automatically when a model's performance metric drops below a defined threshold?
A Scheduled retraining — model retrains on a fixed calendar-based schedule
B Triggered (event-based) retraining — retraining fires when a metric threshold is crossed
C Continuous training — model trains on a rolling window of new data without stopping
D Manual retraining — data scientists initiate retraining based on their own review
Triggered (event-based) retraining is initiated automatically when a specific condition is met — a PSI score exceeds 0.2, model accuracy drops below a defined floor, or a bias metric breaches a threshold. It's more responsive than scheduled retraining (which may miss sudden shifts) but less complex than continuous training (which trains on every new data point). The trade-off: well-defined thresholds avoid unnecessary retraining while ensuring timely response to real degradation.
Question 10 of 10
The PRIMARY goal of champion-challenger testing in AI model deployment is to:
A Test a model's resilience against adversarial attacks and data poisoning
B Validate model code quality through peer review before merging to main branch
C Compare a new candidate model's performance against the current production model before full deployment
D Ensure model documentation meets compliance requirements before deployment approval
Champion-challenger testing is a model validation strategy where the new candidate model (challenger) is evaluated against the existing production model (champion) on the same held-out test data. The challenger must meet or exceed the champion's performance metrics before being promoted to production. This prevents deploying a "different but not better" model, and provides a documented rationale for every model update. It directly addresses the governance question: "Why was this model promoted?"
Practice Score — Keep studying with FlashGenius!
Memory Hooks
High-yield mnemonics and patterns to lock in AI Lifecycle & Operations for the AAIA.
🔄
8 Lifecycle Phases — In Order
Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire. The lifecycle is a loop — Monitoring feeds back into Retrain (via CT triggers) and eventually Retire. Governance gates sit between each phase.
Mnemonic: "Pretty Data Flows Through Very Dense Machine Rooms" — Problem, Data, Features, Train, Validate, Deploy, Monitor, Retire
⚙️
MLOps CI/CD/CT
CI = test the code (unit tests, integration tests on code commits). CD = deploy the validated model (automated delivery pipeline). CT = retrain the model (automated when drift or performance triggers fire). CT is the MLOps addition — DevOps only has CI/CD. CT without automated validation gates is dangerous.
Mnemonic: "Code → Deploy → Train — gets you to production and keeps you there." CT = the loop that keeps models fresh.
🕵️
Shadow vs. Canary vs. A/B
Shadow: new model runs, zero decisions made → safest, no user impact. Canary: small % of real traffic to new model → graduated risk. A/B: full split traffic for experiment → both models make real decisions. Risk order: Shadow < Canary < A/B. Champion-Challenger is pre-deployment testing — not live.
Mnemonic: "Shadows don't decide, Canaries take a little, A/B takes half" — increasing risk and exposure left to right.
🏗️
Feature Store — Offline vs. Online
Offline store: historical feature values for model training (batch, high latency OK). Online store: real-time feature retrieval for inference (millisecond latency). Same feature definitions in both = training-serving consistency. Without this, skew between training and serving corrupts model behavior silently.
Mnemonic: "Offline feeds Training, Online feeds Serving — same definitions, different stores" = no skew.
📋
Model Registry Lifecycle Stages
Model registry stages: Staging (validated, awaiting approval) → Production (approved, live) → Archived (superseded, retained for audit). Only Production-stage models should serve live traffic. Every transition needs a documented approval. No model should reach production without a registry entry — that's an audit finding.
Mnemonic: "Stage it, Approve it, Archive it" — the model's career path from development to retirement.
💧
Data Leakage — Root Causes
Leakage occurs when the model sees future information during training: 1. Test set contamination (used for tuning). 2. Scaler fitted on full data before split. 3. Post-event features included (e.g., "claim paid" as a feature to predict "will claim"). 4. Temporal leakage in time-series data. Model appears great in testing, collapses in production.
Mnemonic: "Future data in training = future failure in production." Always split BEFORE fitting any transformers.
High-Yield AAIA Facts — AI Lifecycle & Operations
Fact
Answer
The third "C" that MLOps adds to DevOps's CI/CD
CT — Continuous Training (automated model retraining)
Deployment strategy with zero risk to live users
Shadow mode — new model runs in parallel, outputs not used for decisions
What a feature store prevents
Training-serving skew — ensures training and inference use identical feature definitions
Model card creator and its key sections
Google (2019); sections: intended use, performance by group, limitations, ethical considerations
Data leakage definition
Future or test-set information contaminating training, artificially inflating performance
Champion-challenger testing purpose
Validate that the new model outperforms the current production model before promotion
Model registry lifecycle stages
Staging → Production → Archived
Silent failure mode unique to AI vs. traditional software
Model drift — gradual performance degradation with no code changes or visible errors
Retraining strategy that fires when PSI > 0.2
Triggered (event-based) retraining
Deployment pattern requiring no network connectivity
Edge deployment — model runs on-device
Flashcards & Study Advisor
Click any card to flip it. Use the Study Advisor for targeted guidance by topic area.
Concept
What is Continuous Training (CT) in MLOps and what makes it different from CI/CD?
Tap to reveal →
Answer
CT (Continuous Training) automatically retrains models when drift thresholds or performance triggers are met. CI tests code changes; CD deploys validated models. CT = the MLOps-specific addition — it closes the loop between monitoring and retraining, keeping models current without manual intervention.
Distinction
What is the difference between shadow mode, canary, and A/B deployment?
Tap to reveal →
Answer
Shadow: new model receives live inputs but makes no decisions (zero risk). Canary: small % of real traffic goes to new model with gradual increase. A/B: traffic split equally — both models make real decisions for comparison. Risk: Shadow < Canary < A/B.
Infrastructure
What problem does a feature store solve, and what are its two components?
Tap to reveal →
Answer
Solves training-serving skew — ensuring model training and live inference use identical feature definitions. Two components: Offline store (historical features for training, batch) and Online store (real-time low-latency feature retrieval for inference).
Risk
What is data leakage in model development and why is it dangerous?
Tap to reveal →
Answer
Data leakage: future or test-set information contaminating training, artificially inflating performance. Dangerous because the model appears excellent in testing but fails in production — a silent defect invisible until live deployment. Common cause: fitting scalers on full data before train/test split.
Governance
What is a model card and when is it required?
Tap to reveal →
Answer
Model card (Google): structured document covering intended use, out-of-scope uses, performance across demographic groups, limitations, and ethical considerations. Required by EU AI Act for high-risk AI; best practice for all AI systems. Primary artifact for bias transparency and regulatory disclosure.
Infrastructure
What is a model registry and what are its lifecycle stages?
Tap to reveal →
Answer
Model registry: centralized store for all model versions with metadata, training lineage, and lifecycle stage. Three stages: Staging (validated, awaiting approval) → Production (approved, live) → Archived (superseded). Only Production models should serve live traffic.
Validation
What is champion-challenger testing and why is it a governance best practice?
Tap to reveal →
Answer
Champion-challenger: new candidate model (challenger) is validated against the current production model (champion) on the same test data. The challenger must meet or exceed champion performance to be promoted. Governance value: provides documented rationale for every model update decision.
MLOps
How does MLOps differ from DevOps in what must be versioned?
Tap to reveal →
Answer
DevOps versions: code only. MLOps versions: code + data + model artifacts (all three). Reproducibility requires that the same code version + data version + hyperparameters reproduce the same model. Missing any one makes the model's provenance unverifiable — an audit finding.
Ready to pass the AAIA?
Reinforce AI Lifecycle & Operations with full-length practice tests on FlashGenius.
Lifecycle phase questions: When asked about sequence, use "Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire." Questions about "what comes first" almost always answer with Problem Definition or risk classification before any data work.
Deployment strategy distinctions: Shadow = no decisions; Canary = graduated real traffic; A/B = split real traffic. If a question asks about "lowest risk validation on live traffic" → Shadow. "Gradual rollout" → Canary. "Comparison experiment" → A/B.
Feature store vs. model registry: Feature store = features for training/serving consistency. Model registry = model artifacts + lifecycle. Never mix these — each question is testing whether you know which stores what.
CT trigger vs. scheduled: "Automatically when threshold crossed" → Triggered CT. "Every month regardless" → Scheduled. Both are valid; exam tests whether you know the tradeoff (responsiveness vs. governance simplicity).
Model card = bias disclosure: Any question about documenting bias, performance by demographic, ethical considerations, or intended use → Model Card. Not a datasheet (that's for datasets).
Common Mistakes to Avoid
Shadow ≠ A/B testing: Shadow mode makes NO real decisions. A/B testing makes real decisions for both model variants. A common distractor — know the difference cold.
Data leakage ≠ data breach: Leakage is a technical model development defect (future data in training), not a security incident. Exam questions test this distinction specifically.
CT without validation gates is dangerous: Continuous training doesn't mean blind auto-deployment. Every CT cycle must pass automated validation gates before the new model reaches production.
Model cards ≠ experiment tracking: Model cards are disclosure documents for intended use and bias. Experiment tracking logs hyperparameters and training metrics. Different artifacts, different purposes.
Champion-challenger is pre-deployment: It happens before the challenger goes live — not a live traffic split. Confusing it with A/B testing is a common error.
Quick Review — Key Facts
8 lifecycle phases: Problem → Data → Features → Train → Validate → Deploy → Monitor → Retire
MLOps = CI + CD + CT: CI tests code; CD deploys; CT retrains automatically
Feature store: Offline (training) + Online (serving) = same definitions = no skew
Model registry stages: Staging → Production → Archived
Data leakage: Future/test data in training → inflated metrics → production failure
Model card: Intended use, demographic performance, limitations, ethical notes
Champion-challenger: New model must beat current production model on same test set
Deep Dive — Advanced Concepts
Training-serving skew: The feature store solves the inconsistency between training and serving code paths. But skew can also occur from: version mismatches in preprocessing libraries, stale online store values, and different handling of null values between paths — all are auditable operational risks.
Temporal leakage in time-series: For time-series models (fraud, demand forecasting), train/test splits must be temporal — test set must always be chronologically after the training set. Random splitting on time-series data always introduces leakage.
Feedback loop delay: Many AI systems suffer from "label delay" — outcomes only become known long after predictions (e.g., credit default takes months). This limits real-time accuracy monitoring; proxy metrics (override rate, confidence distribution) must substitute until ground truth arrives.
Transfer learning risk: Foundation models (LLMs, CLIP) are pre-trained on large datasets. Fine-tuning inherits the original training data's biases and provenance risks — even if the fine-tuning data is clean. Auditors must assess the full training lineage including the base model.
Model retirement triggers: Retirement should be triggered not only by performance degradation but also by: regulatory changes making the model non-compliant, discovery of systematic bias, business use case retirement, or replacement by a superior model. Each requires the same governance rigor as initial deployment.
Practice Tips
Draw the lifecycle: Sketch the 8-phase loop and label the governance artifact at each gate. Visualizing it builds exam intuition for "which comes first" questions.
Practice deployment scenarios: For any scenario question, identify: (1) is this pre-deployment or post-deployment? (2) does the model make real decisions? → narrow to the right strategy immediately.
Learn what each MLOps component stores: Feature store = features; model registry = model artifacts; experiment tracker = training run metadata; data catalog = dataset metadata. These are frequently tested as distractors.
Apply retraining logic: "Automatically when threshold breached" = Triggered. "On a schedule" = Scheduled. "Continuously on new data" = Continuous Training. "Data scientist decides" = Manual.
Connect lifecycle to Domain 3 auditing: Every phase has an audit artifact. Practice naming the artifact for each phase — this directly bridges Domain 2 operations to Domain 3 audit evidence collection.