FlashGenius Logo FlashGenius
AWS SAA-C03 · Page 2 of 5

Design Resilient Architectures

SAA-C03 · Domain 2 · 26% of Exam

Study with Practice Tests →

Design Resilient Architectures

Domain 2 of the AWS SAA-C03 exam — the second largest domain at 26%. Covers loosely coupled architectures, serverless, containers, load balancing, auto scaling, high availability, and disaster recovery strategies.

Domain 2 · 26% of Exam

Domain Breakdown — SAA-C03 Task Statements

Task StatementFocusKey Services
2.1 — Scalable & Loosely CoupledMicroservices, messaging, serverless, containersSQS, SNS, EventBridge, Lambda, ECS, EKS, API Gateway, Step Functions
2.2 — HA & Fault TolerantMulti-AZ, DR strategies, Route 53, Auto ScalingALB, NLB, GWLB, EC2 Auto Scaling, Route 53, RDS Multi-AZ, RDS Proxy

What You'll Master

Loose Coupling Patterns

SQS (queue-based decoupling), SNS (fan-out pub/sub), EventBridge (event-driven routing with rules), API Gateway (managed API front door). Understand when each pattern applies and how they combine.

Serverless Compute

Lambda — event triggers, execution limits (15 min), concurrency (1,000 default), provisioned concurrency for cold starts. Fargate — serverless containers. Step Functions — multi-step workflow orchestration.

Containers

ECS (EC2 launch type vs Fargate launch type), EKS (managed Kubernetes), ECR (container registry). Know ECS task definitions, service auto scaling, and when to choose EKS over ECS.

HA & Fault Tolerance

Multi-AZ vs Multi-Region deployments. ALB (Layer 7, path routing), NLB (Layer 4, ultra-low latency, static IP), GWLB (virtual appliances). EC2 Auto Scaling — target tracking, step, scheduled, predictive policies.

Disaster Recovery

RPO and RTO definitions. 4 DR strategies: Backup & Restore → Pilot Light → Warm Standby → Active-Active. Cost increases as RTO/RPO decreases. Match business requirement to correct strategy.

Route 53 & DNS

7 routing policies: Simple, Weighted, Latency, Failover, Geolocation, Geoproximity, Multivalue. Failover routing requires a health check. Health checks: endpoint, calculated, CloudWatch alarm.

Exam Strategy Tips

DR Strategy Selection

The exam heavily tests matching RPO/RTO requirements to the correct DR strategy. Know the cost ladder: Backup & Restore (cheapest, hours RTO) → Pilot Light → Warm Standby → Active-Active (most expensive, near-zero RTO/RPO).

ELB Type Selection

ALB = HTTP/HTTPS, Layer 7, content-based routing, WAF integration. NLB = TCP/UDP, Layer 4, static IP, ultra-low latency, millions of RPS. GWLB = third-party virtual appliances (firewalls, IDS). Wrong layer = wrong answer.

Messaging Service Selection

SQS = decouple one producer from one (or competing) consumers, pull-based, durable. SNS = fan-out to multiple subscribers at once, push-based. EventBridge = event-driven routing with filtering rules, replaces CloudWatch Events.

Key Services — Domain 2

Amazon SQS Amazon SNS Amazon EventBridge Amazon API Gateway AWS Lambda AWS Fargate AWS Step Functions Amazon ECS Amazon EKS Amazon ECR ALB / NLB / GWLB EC2 Auto Scaling Amazon Route 53 Amazon RDS Multi-AZ RDS Read Replicas RDS Proxy AWS X-Ray

Core Concepts

Eight deep-dive concept cards covering every examinable topic in Domain 2. Study each section, then test yourself in the Quiz and Flashcards tabs.

1. Loose Coupling with SQS, SNS & EventBridge

  • SQS (Simple Queue Service) — decouples producers from consumers; messages persist until consumer processes and deletes; pull-based. Standard queue: best-effort ordering, at-least-once delivery. FIFO queue: strict ordering, exactly-once processing, 300 TPS limit.
  • SQS Visibility Timeout — when consumer picks up a message, it becomes invisible for the timeout period (default 30s, max 12h). If not deleted before timeout expires, message reappears and another consumer can pick it up. Increase timeout for long-processing jobs.
  • SQS Dead Letter Queue (DLQ) — receives messages that fail processing after maxReceiveCount retries. Use for debugging and isolating failed messages. Set a redrive policy on the source queue.
  • SNS (Simple Notification Service) — push-based pub/sub; one message → many subscribers (SQS, Lambda, HTTP, email, SMS, mobile push). Fan-out pattern = SNS → multiple SQS queues.
  • Amazon EventBridge — event bus for AWS service events, custom events, and SaaS events. Rules route events to targets (Lambda, SQS, Step Functions). Replaces CloudWatch Events. Schema registry for event discovery.
  • SQS vs SNS vs EventBridge — SQS = queue/decouple one consumer; SNS = fan-out/push to multiple; EventBridge = event-driven routing with rules and filtering.

2. API Gateway & Microservices Patterns

  • Amazon API Gateway — managed REST, HTTP, and WebSocket APIs. Integrates with Lambda (serverless backends), HTTP backends, and AWS services. Handles throttling, authentication, caching, and SSL termination.
  • API Gateway types — REST API (full features, caching, usage plans); HTTP API (lower latency, lower cost, fewer features); WebSocket API (real-time two-way communication).
  • Throttling — default 10,000 RPS per account; per-stage and per-method throttling; 429 Too Many Requests when exceeded. Use SQS to absorb spikes behind API Gateway.
  • Microservices design — stateless services (no server-side session) scale horizontally. Each service owns its data store. Communicate via APIs or events. Containers are ideal for packaging.
  • Multi-tier architecture — presentation → application → data tiers. Use ELB between tiers. Each tier scales independently. Private subnets for app and data tiers.
  • AWS Step Functions — orchestrate multi-step workflows. Standard workflows: durable, exactly-once, up to 1 year. Express workflows: high-volume, at-least-once, up to 5 minutes. Visual state machine. Integrates with Lambda, ECS, SQS, SNS, DynamoDB.

3. Serverless: Lambda & Fargate

  • AWS Lambda — runs code without provisioning servers. Event-driven triggers: S3, SQS, API Gateway, DynamoDB Streams, Kinesis, EventBridge. Max 15 min execution. Memory 128 MB–10 GB. Max deployment package 50 MB (zip) / 10 GB (container image).
  • Lambda concurrency — default 1,000 concurrent executions per region. Reserved concurrency = guarantee + cap for a function. Provisioned concurrency = pre-warmed instances to eliminate cold starts.
  • Lambda cold starts — container initialization delay on first invocation or after idle period. Mitigate with provisioned concurrency. Smaller packages and fewer imports also reduce cold start time.
  • Lambda layers — share code/libraries across multiple functions. Up to 5 layers per function. Store dependencies separately from function code to reduce package size.
  • AWS Fargate — serverless container execution for ECS or EKS. No EC2 to manage. Pay per vCPU and memory per second. Ideal for variable or unpredictable container workloads. No cluster management overhead.
  • Lambda vs Fargate — Lambda: short tasks ≤15 min, event-driven, pay-per-invocation. Fargate: longer running, container-based, pay per resource-time, no execution time limit.

4. Containers: ECS & EKS

  • Amazon ECS — AWS-native container orchestration. Task definition (JSON config for containers: image, CPU, memory, port mappings). Service (desired count, auto-scaling). Two launch types: EC2 (you manage instances) and Fargate (serverless).
  • ECS service auto scaling — scales tasks based on CPU%, memory%, or ALB request count per target. Integrates with Application Auto Scaling.
  • Amazon EKS — managed Kubernetes on AWS. Control plane managed by AWS. Worker nodes are EC2 instances or Fargate pods. Use when you need Kubernetes compatibility or have existing K8s workloads.
  • Amazon ECR — fully managed container registry. Integrates with ECS, EKS, and Lambda. Lifecycle policies auto-remove old images. Encryption at rest with KMS. Image scanning for vulnerabilities.
  • ECS vs EKS — ECS: AWS-native, simpler setup, tightly integrated with AWS services. EKS: Kubernetes standard, multi-cloud portability, steeper learning curve but more flexibility.
  • Container networking — ECS task networking uses awsvpc mode: each task gets its own ENI and IP address. Security groups applied at task level. EKS uses VPC CNI plugin.

5. Elastic Load Balancing

  • ALB (Application Load Balancer) — Layer 7; HTTP/HTTPS/gRPC; path-based and host-based routing; sticky sessions; target groups (EC2, Lambda, containers, IPs); WAF integration. Ideal for microservices and containerized apps.
  • NLB (Network Load Balancer) — Layer 4; TCP/UDP/TLS; ultra-low latency; static IP or Elastic IP per AZ; handles millions of RPS. Ideal for gaming, IoT, financial applications. Supports AWS PrivateLink.
  • GWLB (Gateway Load Balancer) — Layer 3; designed for third-party virtual network appliances (firewalls, IDS/IPS, deep packet inspection). Uses GENEVE protocol. Bump-in-the-wire traffic inspection pattern.
  • CLB (Classic Load Balancer) — legacy; Layer 4 and 7. Avoid for new architectures; migrate to ALB or NLB.
  • Cross-zone load balancing — ALB: enabled by default (no extra charge). NLB: disabled by default (charges apply when enabled). Distributes traffic evenly across all registered targets across all AZs.
  • ELB health checks — ALB pings target HTTP/HTTPS endpoint; NLB uses TCP, HTTP, or HTTPS. Unhealthy targets removed from rotation automatically. Configure healthy/unhealthy thresholds appropriately.

6. EC2 Auto Scaling

  • Auto Scaling Group (ASG) — min/desired/max capacity. Launches or terminates EC2 instances based on policies. Uses Launch Template (preferred over deprecated Launch Configuration).
  • Scaling policies — Target Tracking (most common: maintain a metric, e.g., 50% CPU); Step Scaling (scale by different amounts at different thresholds); Scheduled (predictable patterns); Predictive (ML-based demand forecasting).
  • Warm-up period — time for a new instance to contribute to the metric before ASG considers it "live." Prevents over-provisioning due to metric spikes during boot initialization.
  • Cooldown period — default 300s after a scaling activity; prevents rapid back-to-back scaling actions (thrashing). Target tracking policies manage their own cooldown automatically.
  • Lifecycle hooks — pause instance launch or termination to perform custom actions (install software, drain connections, take snapshot). Integrates with EventBridge, SQS, and SNS for notifications.
  • Spot instances in ASG — use mixed instances policy. Combine On-Demand base capacity with Spot for cost savings. Use capacity-optimized allocation strategy. Handle Spot interruptions gracefully via lifecycle hooks.

7. High Availability: Multi-AZ, Route 53 & HA Patterns

  • Multi-AZ design — deploy across ≥2 AZs. Use ELB to distribute traffic. Stateless app layer (no server-side session) scales easily. RDS Multi-AZ = synchronous standby replica in another AZ; automatic failover in ~60–120s. Standby does NOT serve read traffic.
  • Route 53 routing policies — Simple (one resource); Weighted (% split for A/B testing, migration); Latency (lowest latency AZ/Region); Failover (primary/secondary, requires health check); Geolocation (user's country or continent); Geoproximity (bias toward a location); Multivalue (multiple healthy IPs, up to 8).
  • Route 53 health checks — Endpoint (HTTP/HTTPS/TCP), Calculated (aggregate multiple child checks), CloudWatch Alarm. Used with Failover routing for automatic DNS failover. Health checkers come from multiple regions.
  • RDS Read Replicas vs Multi-AZ — Read Replicas: asynchronous replication, improve READ performance, can be cross-region, must manually promote to standalone. Multi-AZ: synchronous, HA and automatic failover only, no read traffic from standby.
  • RDS Proxy — connection pooler for RDS and Aurora. Reduces connection overhead especially for Lambda (which opens a new DB connection per invocation). Failover time up to 66% faster. Enforces IAM authentication. Deployed within VPC.

8. Disaster Recovery Strategies

  • RPO (Recovery Point Objective) — maximum acceptable data loss measured as time since last backup. "How old can the restored data be?"
  • RTO (Recovery Time Objective) — maximum acceptable downtime before service is restored. "How long can the business tolerate the outage?"
  • Backup & Restore — lowest cost, highest RTO/RPO. Backup data to S3 or Glacier; restore when disaster occurs. RPO = hours; RTO = hours. Suitable for non-critical workloads.
  • Pilot Light — minimal core infrastructure always running in DR region (e.g., RDS read replica). Scale up quickly in disaster. RPO = minutes; RTO = tens of minutes. Medium cost.
  • Warm Standby — scaled-down but fully functional environment in DR region, running continuously. Scale up to production capacity on disaster. RPO = seconds; RTO = minutes. Higher cost.
  • Active-Active (Multi-Site) — full production load in two or more regions simultaneously. Route 53 distributes traffic. Instant failover. RPO ≈ 0; RTO ≈ 0. Highest cost.
  • Selection guide — cost priority → Backup & Restore. Balanced trade-off → Pilot Light or Warm Standby. Zero downtime required → Active-Active. Always match the business's RTO/RTO budget to the correct strategy.

Quick Reference — DR Strategies Comparison

StrategyRPORTOCostDescription
Backup & RestoreHoursHours$Backup to S3/Glacier; restore on disaster
Pilot LightMinutesTens of minutes$$Minimal core infra running in DR region
Warm StandbySecondsMinutes$$$Scaled-down full env running continuously
Active-Active≈ 0≈ 0$$$$Full production in multiple regions

Quick Reference — ELB Type Selection

Load BalancerLayerProtocolUse Case
ALB7 (Application)HTTP, HTTPS, gRPCMicroservices, containers, path/host routing, WAF
NLB4 (Network)TCP, UDP, TLSUltra-low latency, static IP, millions RPS, PrivateLink
GWLB3 (Network)GENEVEVirtual appliances: firewalls, IDS/IPS, DPI
CLB4 & 7TCP, HTTP, HTTPSLegacy only — use ALB or NLB for new deployments

Memory Hooks

Six memorable hooks to lock in the hardest Domain 2 concepts. Each hook is designed to stick under exam pressure.

🪜
DR Cost Ladder
Backup → Pilot → Warm → Active
Cost goes UP the ladder. RTO/RPO goes DOWN. Match the business requirement to the right rung. "Can't afford downtime?" → climb to Active-Active. "Non-critical data?" → stay at Backup & Restore.
📬
SQS vs SNS vs EventBridge
Queue · Shout · Events with Rules
SQS = Queue (one consumer pulls, durable). SNS = Shout (pushes to many at once). EventBridge = Events with Rules (filter and route to targets). Fan-out pattern = SNS → multiple SQS queues.
🔀
ALB vs NLB
Application = Layer 7 · Network = Layer 4
ALB = Application: HTTP/HTTPS, path routing, WAF, content-based. NLB = Network: TCP/UDP, ultra-low latency, static IP. Wrong layer = wrong answer on the exam every time.
🗄️
RDS Multi-AZ vs Read Replicas
HA vs Performance
Multi-AZ = HA: synchronous standby, auto-failover, standby does NOT serve reads. Read Replicas = Performance: asynchronous, for reads only, can be cross-region, must manually promote to standalone.
Lambda Limits to Know
15 · 10GB · 1,000 · 50MB/10GB
Max 15 min execution. Max 10 GB memory. Default 1,000 concurrent executions per region. Package: 50 MB zip / 10 GB container. Cold starts → fix with provisioned concurrency.
🏥
Route 53 Failover
Failover = Health Check Required
Failover routing policy only works if you configure a health check on the primary endpoint. When the primary fails the health check, Route 53 automatically routes to secondary. No health check = no failover.

Practice Quiz

10 scenario-based questions modeled on SAA-C03 exam style. Select your answer and submit when done.

Q1. An application processes orders and must ensure no orders are lost during failures, but consumers can process at different speeds. Which service provides the BEST decoupling?
Q2. A company runs a web application across three AZs and needs to distribute HTTPS traffic based on URL path (/api/* to one target group, /* to another). Which load balancer should they use?
Q3. A company needs a disaster recovery solution with an RPO of 1 minute and RTO of 10 minutes for its RDS database. Which strategy meets these requirements at the LOWEST cost?
Q4. A Lambda function processes messages from SQS. Some messages repeatedly fail processing and fill the queue. What should be implemented?
Q5. A company wants to implement a fan-out pattern where a single event triggers processing by multiple SQS queues simultaneously. What architecture should they use?
Q6. An EC2 Auto Scaling group needs to handle a predictable surge in traffic every Monday morning at 8 AM. Which scaling policy is MOST appropriate?
Q7. A company needs to orchestrate a multi-step order processing workflow where each step may take between 1 second and 30 minutes, with guaranteed exactly-once execution. Which service is BEST?
Q8. A Solutions Architect needs to deploy containers without managing EC2 instances, with per-second billing and support for ECS task networking (awsvpc). Which option should they choose?
Q9. An application uses Route 53 with a failover routing policy. The primary endpoint is in us-east-1 and the secondary is in us-west-2. The primary endpoint is down, but Route 53 is still routing traffic to it. What is the MOST likely cause?
Q10. A company requires that its RDS database connections from Lambda functions do not exceed database connection limits during traffic spikes. What should be implemented?

Flashcards

8 cards covering the most-tested Domain 2 concepts. Click a card to flip it and reveal the answer.

Click any card to flip · Click again to flip back

4 DR Strategies — Cheapest to Most Expensive
Click to reveal the order and what changes
Backup & RestorePilot LightWarm StandbyActive-Active. Cost increases as you go right. RTO and RPO shrink as cost increases. Match business downtime tolerance to the right strategy.
SQS Visibility Timeout — What Happens When It Expires?
Click to reveal what occurs to the message
The message becomes visible again in the queue and another consumer can pick it up. It means the original consumer failed to process and delete it within the timeout. Increase timeout for long-running processing jobs to prevent duplicate processing.
RDS Multi-AZ — Can the Standby Serve Read Traffic?
Click to reveal the answer and why
No — the standby is purely for automatic failover (HA). It uses synchronous replication and does not accept read traffic. For read scaling, use Read Replicas (separate read endpoint, async replication, can be cross-region).
Lambda Maximum Execution Timeout
Click to reveal the limit and how to work around it
15 minutes maximum. For tasks longer than 15 min, use Step Functions to orchestrate multiple Lambda calls in sequence, or use Fargate — containers have no execution time limit.
ALB Path-Based Routing — How Does It Work?
Click to reveal the mechanism
Listener rules evaluate incoming HTTP requests by path. /api/* routes to one target group, /images/* to another. Rules have priority order; requests that match no rule go to the default action. Each target group can have its own auto scaling and health check.
Route 53 Weighted Routing — Use Cases
Click to reveal common scenarios and how Weight 0 works
A/B testing (10% new version, 90% old), blue-green deployment (gradually shift traffic), multi-region load distribution. Set weight to 0 to stop sending traffic to a record without deleting it.
SNS Fan-Out Pattern
Click to reveal the architecture
One SNS topic → multiple SQS queue subscriptions. Each SQS queue has its own independent consumer. The publisher sends one message; SNS delivers copies to every subscribed SQS queue simultaneously. Decouples publisher from multiple independent processing pipelines.
Step Functions: Standard vs Express Workflows
Click to reveal key differences
Standard: up to 1 year duration, exactly-once execution, durable, $0.025 per 1K state transitions. Best for long-running critical workflows.
Express: up to 5 min, at-least-once execution, much cheaper for high-volume async workloads (streaming, IoT).

Study Advisor

Personalized study guidance based on where you are in your SAA-C03 exam preparation.

Beginners

  • Start with SQS vs SNS — create an SQS queue in the AWS console and configure a Lambda trigger to understand the pull-based model
  • Learn the difference between stateful and stateless application design; understand why stateless apps scale horizontally
  • Walk through the AWS Well-Architected Framework Reliability Pillar to understand design principles at a high level
  • Practice identifying what "decoupling" means: producer does not need to know anything about consumers
  • Read the SQS documentation to understand visibility timeout, DLQ, and message lifecycle hands-on

Resources

Official and recommended resources for mastering SAA-C03 Domain 2.

AWS Official Documentation

Exam Certification Page

Key Concepts to Review

Must-Know for Exam Day

  • DR strategy RTO/RPO ranges for all 4 strategies
  • ALB vs NLB vs GWLB layer and protocol
  • SQS DLQ setup and maxReceiveCount
  • Route 53 failover health check requirement
  • Lambda concurrency limits and provisioned concurrency
  • RDS Multi-AZ does NOT serve read traffic

Common Exam Traps

  • Route 53 failover without a health check = no failover
  • RDS standby replica looks tempting for reads — it's wrong
  • NLB does not support path-based routing (that's ALB)
  • Lambda max timeout is 15 min — Step Functions for longer
  • SNS is push-based, SQS is pull-based
  • GWLB is for virtual appliances, not general HTTP traffic

Not affiliated with Amazon Web Services. AWS® is a registered trademark of Amazon.com, Inc.

Ready to Pass SAA-C03?

Practice with full-length timed exams and detailed explanations on FlashGenius

Start Free Practice Tests →