AWS SAA-C03 · Design Resilient Architectures

Design Resilient Architectures

Domain 2 of the AWS SAA-C03 exam — the second largest domain at 26%. Covers loosely coupled architectures, serverless, containers, load balancing, auto scaling, high availability, and disaster recovery strategies.

Domain 2 · 26% of Exam

Domain Breakdown — SAA-C03 Task Statements

Task Statement	Focus	Key Services
2.1 — Scalable & Loosely Coupled	Microservices, messaging, serverless, containers	SQS, SNS, EventBridge, Lambda, ECS, EKS, API Gateway, Step Functions
2.2 — HA & Fault Tolerant	Multi-AZ, DR strategies, Route 53, Auto Scaling	ALB, NLB, GWLB, EC2 Auto Scaling, Route 53, RDS Multi-AZ, RDS Proxy

What You'll Master

Loose Coupling Patterns

SQS (queue-based decoupling), SNS (fan-out pub/sub), EventBridge (event-driven routing with rules), API Gateway (managed API front door). Understand when each pattern applies and how they combine.

Serverless Compute

Lambda — event triggers, execution limits (15 min), concurrency (1,000 default), provisioned concurrency for cold starts. Fargate — serverless containers. Step Functions — multi-step workflow orchestration.

Containers

ECS (EC2 launch type vs Fargate launch type), EKS (managed Kubernetes), ECR (container registry). Know ECS task definitions, service auto scaling, and when to choose EKS over ECS.

HA & Fault Tolerance

Multi-AZ vs Multi-Region deployments. ALB (Layer 7, path routing), NLB (Layer 4, ultra-low latency, static IP), GWLB (virtual appliances). EC2 Auto Scaling — target tracking, step, scheduled, predictive policies.

Disaster Recovery

RPO and RTO definitions. 4 DR strategies: Backup & Restore → Pilot Light → Warm Standby → Active-Active. Cost increases as RTO/RPO decreases. Match business requirement to correct strategy.

Route 53 & DNS

7 routing policies: Simple, Weighted, Latency, Failover, Geolocation, Geoproximity, Multivalue. Failover routing requires a health check. Health checks: endpoint, calculated, CloudWatch alarm.

Exam Strategy Tips

DR Strategy Selection

The exam heavily tests matching RPO/RTO requirements to the correct DR strategy. Know the cost ladder: Backup & Restore (cheapest, hours RTO) → Pilot Light → Warm Standby → Active-Active (most expensive, near-zero RTO/RPO).

ELB Type Selection

ALB = HTTP/HTTPS, Layer 7, content-based routing, WAF integration. NLB = TCP/UDP, Layer 4, static IP, ultra-low latency, millions of RPS. GWLB = third-party virtual appliances (firewalls, IDS). Wrong layer = wrong answer.

Messaging Service Selection

SQS = decouple one producer from one (or competing) consumers, pull-based, durable. SNS = fan-out to multiple subscribers at once, push-based. EventBridge = event-driven routing with filtering rules, replaces CloudWatch Events.

Key Services — Domain 2

Amazon SQS Amazon SNS Amazon EventBridge Amazon API Gateway AWS Lambda AWS Fargate AWS Step Functions Amazon ECS Amazon EKS Amazon ECR ALB / NLB / GWLB EC2 Auto Scaling Amazon Route 53 Amazon RDS Multi-AZ RDS Read Replicas RDS Proxy AWS X-Ray

Core Concepts

Eight deep-dive concept cards covering every examinable topic in Domain 2. Study each section, then test yourself in the Quiz and Flashcards tabs.

1. Loose Coupling with SQS, SNS & EventBridge

SQS (Simple Queue Service) — decouples producers from consumers; messages persist until consumer processes and deletes; pull-based. Standard queue: best-effort ordering, at-least-once delivery. FIFO queue: strict ordering, exactly-once processing, 300 TPS limit.
SQS Visibility Timeout — when consumer picks up a message, it becomes invisible for the timeout period (default 30s, max 12h). If not deleted before timeout expires, message reappears and another consumer can pick it up. Increase timeout for long-processing jobs.
SQS Dead Letter Queue (DLQ) — receives messages that fail processing after maxReceiveCount retries. Use for debugging and isolating failed messages. Set a redrive policy on the source queue.
SNS (Simple Notification Service) — push-based pub/sub; one message → many subscribers (SQS, Lambda, HTTP, email, SMS, mobile push). Fan-out pattern = SNS → multiple SQS queues.
Amazon EventBridge — event bus for AWS service events, custom events, and SaaS events. Rules route events to targets (Lambda, SQS, Step Functions). Replaces CloudWatch Events. Schema registry for event discovery.
SQS vs SNS vs EventBridge — SQS = queue/decouple one consumer; SNS = fan-out/push to multiple; EventBridge = event-driven routing with rules and filtering.

2. API Gateway & Microservices Patterns

Amazon API Gateway — managed REST, HTTP, and WebSocket APIs. Integrates with Lambda (serverless backends), HTTP backends, and AWS services. Handles throttling, authentication, caching, and SSL termination.
API Gateway types — REST API (full features, caching, usage plans); HTTP API (lower latency, lower cost, fewer features); WebSocket API (real-time two-way communication).
Throttling — default 10,000 RPS per account; per-stage and per-method throttling; 429 Too Many Requests when exceeded. Use SQS to absorb spikes behind API Gateway.
Microservices design — stateless services (no server-side session) scale horizontally. Each service owns its data store. Communicate via APIs or events. Containers are ideal for packaging.
Multi-tier architecture — presentation → application → data tiers. Use ELB between tiers. Each tier scales independently. Private subnets for app and data tiers.
AWS Step Functions — orchestrate multi-step workflows. Standard workflows: durable, exactly-once, up to 1 year. Express workflows: high-volume, at-least-once, up to 5 minutes. Visual state machine. Integrates with Lambda, ECS, SQS, SNS, DynamoDB.

3. Serverless: Lambda & Fargate

AWS Lambda — runs code without provisioning servers. Event-driven triggers: S3, SQS, API Gateway, DynamoDB Streams, Kinesis, EventBridge. Max 15 min execution. Memory 128 MB–10 GB. Max deployment package 50 MB (zip) / 10 GB (container image).
Lambda concurrency — default 1,000 concurrent executions per region. Reserved concurrency = guarantee + cap for a function. Provisioned concurrency = pre-warmed instances to eliminate cold starts.
Lambda cold starts — container initialization delay on first invocation or after idle period. Mitigate with provisioned concurrency. Smaller packages and fewer imports also reduce cold start time.
Lambda layers — share code/libraries across multiple functions. Up to 5 layers per function. Store dependencies separately from function code to reduce package size.
AWS Fargate — serverless container execution for ECS or EKS. No EC2 to manage. Pay per vCPU and memory per second. Ideal for variable or unpredictable container workloads. No cluster management overhead.
Lambda vs Fargate — Lambda: short tasks ≤15 min, event-driven, pay-per-invocation. Fargate: longer running, container-based, pay per resource-time, no execution time limit.

4. Containers: ECS & EKS

Amazon ECS — AWS-native container orchestration. Task definition (JSON config for containers: image, CPU, memory, port mappings). Service (desired count, auto-scaling). Two launch types: EC2 (you manage instances) and Fargate (serverless).
ECS service auto scaling — scales tasks based on CPU%, memory%, or ALB request count per target. Integrates with Application Auto Scaling.
Amazon EKS — managed Kubernetes on AWS. Control plane managed by AWS. Worker nodes are EC2 instances or Fargate pods. Use when you need Kubernetes compatibility or have existing K8s workloads.
Amazon ECR — fully managed container registry. Integrates with ECS, EKS, and Lambda. Lifecycle policies auto-remove old images. Encryption at rest with KMS. Image scanning for vulnerabilities.
ECS vs EKS — ECS: AWS-native, simpler setup, tightly integrated with AWS services. EKS: Kubernetes standard, multi-cloud portability, steeper learning curve but more flexibility.
Container networking — ECS task networking uses awsvpc mode: each task gets its own ENI and IP address. Security groups applied at task level. EKS uses VPC CNI plugin.

5. Elastic Load Balancing

ALB (Application Load Balancer) — Layer 7; HTTP/HTTPS/gRPC; path-based and host-based routing; sticky sessions; target groups (EC2, Lambda, containers, IPs); WAF integration. Ideal for microservices and containerized apps.
NLB (Network Load Balancer) — Layer 4; TCP/UDP/TLS; ultra-low latency; static IP or Elastic IP per AZ; handles millions of RPS. Ideal for gaming, IoT, financial applications. Supports AWS PrivateLink.
GWLB (Gateway Load Balancer) — Layer 3; designed for third-party virtual network appliances (firewalls, IDS/IPS, deep packet inspection). Uses GENEVE protocol. Bump-in-the-wire traffic inspection pattern.
CLB (Classic Load Balancer) — legacy; Layer 4 and 7. Avoid for new architectures; migrate to ALB or NLB.
Cross-zone load balancing — ALB: enabled by default (no extra charge). NLB: disabled by default (charges apply when enabled). Distributes traffic evenly across all registered targets across all AZs.
ELB health checks — ALB pings target HTTP/HTTPS endpoint; NLB uses TCP, HTTP, or HTTPS. Unhealthy targets removed from rotation automatically. Configure healthy/unhealthy thresholds appropriately.

6. EC2 Auto Scaling

Auto Scaling Group (ASG) — min/desired/max capacity. Launches or terminates EC2 instances based on policies. Uses Launch Template (preferred over deprecated Launch Configuration).
Scaling policies — Target Tracking (most common: maintain a metric, e.g., 50% CPU); Step Scaling (scale by different amounts at different thresholds); Scheduled (predictable patterns); Predictive (ML-based demand forecasting).
Warm-up period — time for a new instance to contribute to the metric before ASG considers it "live." Prevents over-provisioning due to metric spikes during boot initialization.
Cooldown period — default 300s after a scaling activity; prevents rapid back-to-back scaling actions (thrashing). Target tracking policies manage their own cooldown automatically.
Lifecycle hooks — pause instance launch or termination to perform custom actions (install software, drain connections, take snapshot). Integrates with EventBridge, SQS, and SNS for notifications.
Spot instances in ASG — use mixed instances policy. Combine On-Demand base capacity with Spot for cost savings. Use capacity-optimized allocation strategy. Handle Spot interruptions gracefully via lifecycle hooks.

7. High Availability: Multi-AZ, Route 53 & HA Patterns

Multi-AZ design — deploy across ≥2 AZs. Use ELB to distribute traffic. Stateless app layer (no server-side session) scales easily. RDS Multi-AZ = synchronous standby replica in another AZ; automatic failover in ~60–120s. Standby does NOT serve read traffic.
Route 53 routing policies — Simple (one resource); Weighted (% split for A/B testing, migration); Latency (lowest latency AZ/Region); Failover (primary/secondary, requires health check); Geolocation (user's country or continent); Geoproximity (bias toward a location); Multivalue (multiple healthy IPs, up to 8).
Route 53 health checks — Endpoint (HTTP/HTTPS/TCP), Calculated (aggregate multiple child checks), CloudWatch Alarm. Used with Failover routing for automatic DNS failover. Health checkers come from multiple regions.
RDS Read Replicas vs Multi-AZ — Read Replicas: asynchronous replication, improve READ performance, can be cross-region, must manually promote to standalone. Multi-AZ: synchronous, HA and automatic failover only, no read traffic from standby.
RDS Proxy — connection pooler for RDS and Aurora. Reduces connection overhead especially for Lambda (which opens a new DB connection per invocation). Failover time up to 66% faster. Enforces IAM authentication. Deployed within VPC.

8. Disaster Recovery Strategies

RPO (Recovery Point Objective) — maximum acceptable data loss measured as time since last backup. "How old can the restored data be?"
RTO (Recovery Time Objective) — maximum acceptable downtime before service is restored. "How long can the business tolerate the outage?"
Backup & Restore — lowest cost, highest RTO/RPO. Backup data to S3 or Glacier; restore when disaster occurs. RPO = hours; RTO = hours. Suitable for non-critical workloads.
Pilot Light — minimal core infrastructure always running in DR region (e.g., RDS read replica). Scale up quickly in disaster. RPO = minutes; RTO = tens of minutes. Medium cost.
Warm Standby — scaled-down but fully functional environment in DR region, running continuously. Scale up to production capacity on disaster. RPO = seconds; RTO = minutes. Higher cost.
Active-Active (Multi-Site) — full production load in two or more regions simultaneously. Route 53 distributes traffic. Instant failover. RPO ≈ 0; RTO ≈ 0. Highest cost.
Selection guide — cost priority → Backup & Restore. Balanced trade-off → Pilot Light or Warm Standby. Zero downtime required → Active-Active. Always match the business's RTO/RTO budget to the correct strategy.

Quick Reference — DR Strategies Comparison

Strategy	RPO	RTO	Cost	Description
Backup & Restore	Hours	Hours	$	Backup to S3/Glacier; restore on disaster
Pilot Light	Minutes	Tens of minutes	$$	Minimal core infra running in DR region
Warm Standby	Seconds	Minutes	$$$	Scaled-down full env running continuously
Active-Active	≈ 0	≈ 0	$$$$	Full production in multiple regions

Quick Reference — ELB Type Selection

Load Balancer	Layer	Protocol	Use Case
ALB	7 (Application)	HTTP, HTTPS, gRPC	Microservices, containers, path/host routing, WAF
NLB	4 (Network)	TCP, UDP, TLS	Ultra-low latency, static IP, millions RPS, PrivateLink
GWLB	3 (Network)	GENEVE	Virtual appliances: firewalls, IDS/IPS, DPI
CLB	4 & 7	TCP, HTTP, HTTPS	Legacy only — use ALB or NLB for new deployments

Memory Hooks

Six memorable hooks to lock in the hardest Domain 2 concepts. Each hook is designed to stick under exam pressure.

🪜

DR Cost Ladder

Backup → Pilot → Warm → Active

Cost goes UP the ladder. RTO/RPO goes DOWN. Match the business requirement to the right rung. "Can't afford downtime?" → climb to Active-Active. "Non-critical data?" → stay at Backup & Restore.

📬

SQS vs SNS vs EventBridge

Queue · Shout · Events with Rules

SQS = Queue (one consumer pulls, durable). SNS = Shout (pushes to many at once). EventBridge = Events with Rules (filter and route to targets). Fan-out pattern = SNS → multiple SQS queues.

🔀

ALB vs NLB

Application = Layer 7 · Network = Layer 4

ALB = Application: HTTP/HTTPS, path routing, WAF, content-based. NLB = Network: TCP/UDP, ultra-low latency, static IP. Wrong layer = wrong answer on the exam every time.

🗄️

RDS Multi-AZ vs Read Replicas

HA vs Performance

Multi-AZ = HA: synchronous standby, auto-failover, standby does NOT serve reads. Read Replicas = Performance: asynchronous, for reads only, can be cross-region, must manually promote to standalone.

⚡

Lambda Limits to Know

15 · 10GB · 1,000 · 50MB/10GB

Max 15 min execution. Max 10 GB memory. Default 1,000 concurrent executions per region. Package: 50 MB zip / 10 GB container. Cold starts → fix with provisioned concurrency.

🏥

Route 53 Failover

Failover = Health Check Required

Failover routing policy only works if you configure a health check on the primary endpoint. When the primary fails the health check, Route 53 automatically routes to secondary. No health check = no failover.

Flashcards

8 cards covering the most-tested Domain 2 concepts. Click a card to flip it and reveal the answer.

Click any card to flip · Click again to flip back

4 DR Strategies — Cheapest to Most Expensive

Click to reveal the order and what changes

Backup & Restore → Pilot Light → Warm Standby → Active-Active. Cost increases as you go right. RTO and RPO shrink as cost increases. Match business downtime tolerance to the right strategy.

SQS Visibility Timeout — What Happens When It Expires?

Click to reveal what occurs to the message

The message becomes visible again in the queue and another consumer can pick it up. It means the original consumer failed to process and delete it within the timeout. Increase timeout for long-running processing jobs to prevent duplicate processing.

RDS Multi-AZ — Can the Standby Serve Read Traffic?

Click to reveal the answer and why

No — the standby is purely for automatic failover (HA). It uses synchronous replication and does not accept read traffic. For read scaling, use Read Replicas (separate read endpoint, async replication, can be cross-region).

Lambda Maximum Execution Timeout

Click to reveal the limit and how to work around it

15 minutes maximum. For tasks longer than 15 min, use Step Functions to orchestrate multiple Lambda calls in sequence, or use Fargate — containers have no execution time limit.

ALB Path-Based Routing — How Does It Work?

Click to reveal the mechanism

Listener rules evaluate incoming HTTP requests by path. /api/* routes to one target group, /images/* to another. Rules have priority order; requests that match no rule go to the default action. Each target group can have its own auto scaling and health check.

Route 53 Weighted Routing — Use Cases

Click to reveal common scenarios and how Weight 0 works

A/B testing (10% new version, 90% old), blue-green deployment (gradually shift traffic), multi-region load distribution. Set weight to 0 to stop sending traffic to a record without deleting it.

SNS Fan-Out Pattern

Click to reveal the architecture

One SNS topic → multiple SQS queue subscriptions. Each SQS queue has its own independent consumer. The publisher sends one message; SNS delivers copies to every subscribed SQS queue simultaneously. Decouples publisher from multiple independent processing pipelines.

Step Functions: Standard vs Express Workflows

Click to reveal key differences

Standard: up to 1 year duration, exactly-once execution, durable, $0.025 per 1K state transitions. Best for long-running critical workflows.
Express: up to 5 min, at-least-once execution, much cheaper for high-volume async workloads (streaming, IoT).

Study Advisor

Personalized study guidance based on where you are in your SAA-C03 exam preparation.

Beginners

Start with SQS vs SNS — create an SQS queue in the AWS console and configure a Lambda trigger to understand the pull-based model
Learn the difference between stateful and stateless application design; understand why stateless apps scale horizontally
Walk through the AWS Well-Architected Framework Reliability Pillar to understand design principles at a high level
Practice identifying what "decoupling" means: producer does not need to know anything about consumers
Read the SQS documentation to understand visibility timeout, DLQ, and message lifecycle hands-on

Resources

Official and recommended resources for mastering SAA-C03 Domain 2.

AWS Official Documentation

AWS AWS Well-Architected Framework — Reliability Pillar — docs.aws.amazon.com
AWS Amazon SQS Developer Guide — docs.aws.amazon.com/sqs/
AWS AWS Lambda Developer Guide (concurrency, limits, triggers) — docs.aws.amazon.com
AWS Amazon Route 53 — Routing Policies Documentation — docs.aws.amazon.com
AWS AWS Disaster Recovery Whitepaper — docs.aws.amazon.com
AWS Elastic Load Balancing Documentation (ALB, NLB, GWLB) — docs.aws.amazon.com
AWS Amazon ECS Developer Guide — docs.aws.amazon.com

Exam Certification Page

Official AWS Certified Solutions Architect – Associate — aws.amazon.com/certification

Key Concepts to Review

Must-Know for Exam Day

DR strategy RTO/RPO ranges for all 4 strategies
ALB vs NLB vs GWLB layer and protocol
SQS DLQ setup and maxReceiveCount
Route 53 failover health check requirement
Lambda concurrency limits and provisioned concurrency
RDS Multi-AZ does NOT serve read traffic

Common Exam Traps

Route 53 failover without a health check = no failover
RDS standby replica looks tempting for reads — it's wrong
NLB does not support path-based routing (that's ALB)
Lambda max timeout is 15 min — Step Functions for longer
SNS is push-based, SQS is pull-based
GWLB is for virtual appliances, not general HTTP traffic

Not affiliated with Amazon Web Services. AWS® is a registered trademark of Amazon.com, Inc.

Design Resilient Architectures

Design Resilient Architectures

Domain Breakdown — SAA-C03 Task Statements

What You'll Master

Loose Coupling Patterns

Serverless Compute

Containers

HA & Fault Tolerance

Disaster Recovery

Route 53 & DNS

Exam Strategy Tips

DR Strategy Selection

ELB Type Selection

Messaging Service Selection

Key Services — Domain 2

Core Concepts

1. Loose Coupling with SQS, SNS & EventBridge

2. API Gateway & Microservices Patterns

3. Serverless: Lambda & Fargate

4. Containers: ECS & EKS

5. Elastic Load Balancing

6. EC2 Auto Scaling

7. High Availability: Multi-AZ, Route 53 & HA Patterns

8. Disaster Recovery Strategies

Quick Reference — DR Strategies Comparison

Quick Reference — ELB Type Selection

Memory Hooks

Practice Quiz

Flashcards

Study Advisor

Beginners

Resources

AWS Official Documentation

Exam Certification Page

Key Concepts to Review

Must-Know for Exam Day

Common Exam Traps

Ready to Pass SAA-C03?