Cloud+: Troubleshooting Cloud Issues CV0-004 Study Guide

Core Concepts

8 essential troubleshooting topics for Cloud+ CV0-004 Domain 5

01 Cloud Troubleshooting Methodology

Structured approach: Identify → Establish theory → Test theory → Establish action plan → Implement → Verify → Document
Start with: check recent changes (what changed before the problem started?), check logs, check metrics
Divide and conquer: narrow the problem domain (network, compute, application, storage, IAM)
Correlation: is it affecting one user or all? One region or all? One service or all?
Tools: AWS CloudTrail (API calls), CloudWatch (metrics/logs), VPC Flow Logs (network), AWS Config (resource config history)
Common mistake: assuming the most complex cause — check simple things first (security group rules, IAM permissions, service limits)

02 Troubleshooting Deployment Issues

Container fails to start: check image name/tag (typo?), registry credentials, resource limits (CPU/memory too low), missing ConfigMaps/Secrets
- kubectl describe pod: shows events and error messages
- kubectl logs pod-name: shows application stdout/stderr
- kubectl get events: shows cluster-level events
IaC apply fails: check error message first, verify credentials/permissions, check resource quota limits, look for dependency ordering issues
- Terraform: run terraform plan to see what's failing before apply
CloudFormation stack stuck in ROLLBACK: check stack events in console — find the FAILED event that triggered rollback
Auto Scaling fails to launch: check launch template (AMI valid in region?), IAM instance profile permissions, security group exists in VPC, subnet capacity

03 Troubleshooting Network Connectivity

Connectivity checklist (work through layers):
- 1. Security group: does it allow the traffic (port, protocol, source IP)?
- 2. NACL: is there both inbound and outbound rule? (NACLs are stateless)
- 3. Route table: is there a route to the destination (IGW for public, NAT for private)?
- 4. Internet gateway: is it attached to the VPC?
- 5. Elastic IP / Public IP: does the instance have one (for inbound from internet)?
- 6. Instance OS: is the service listening on the port? Is iptables blocking?
VPC Flow Logs: log accepted and rejected traffic — identify which layer is blocking
DNS issues: nslookup/dig to verify resolution — wrong IP? Missing record? TTL caching old value?
VPN/Direct Connect issues: check BGP session status, verify route propagation, check pre-shared keys

04 Troubleshooting Security Incidents

Leaked credentials: immediately rotate/deactivate compromised keys — check CloudTrail for actions with compromised credentials — look for: new IAM users, new access keys, data exfiltration (unusual S3 GetObject), EC2 launches in unusual regions
Privilege escalation: attacker gains higher permissions — check CloudTrail for iam:AttachUserPolicy, iam:CreateAccessKey, iam:PutUserPolicy
Cryptomining: unusual EC2 instance types (GPU), high CPU on unexpected instances, outbound connections to mining pools (GuardDuty: CryptoCurrency:EC2/BitcoinTool.B)
Compromised instance isolation: take snapshot, then isolate (remove from load balancer, change security group to deny all, do NOT terminate — preserve for forensics)
S3 data exfiltration: check S3 access logs, CloudTrail S3 events, check for public bucket ACLs, GetObject calls from unusual IPs

05 Troubleshooting Service Disruptions

DNS failures: check Route 53 health checks, verify NS records, check TTL (cached old value?), test with nslookup/dig from multiple locations
DHCP issues: instances not getting IPs — check DHCP options set in VPC, subnet capacity (address exhaustion), check dhclient logs on Linux
NTP issues: clock skew causes TLS certificate errors, authentication failures (Kerberos, AWS Signature V4 time tolerance is 5 minutes) — check chrony/ntpd status
Load balancer health check failures: check health check path (HTTP 200?), security group allows health check traffic from LB, instance responding on health check port
Certificate errors: TLS expired? Wrong domain? Self-signed? Check with: openssl s_client -connect host:443
Service limits/quotas: check Service Quotas console, request increases proactively

06 Troubleshooting Misconfigurations

S3 bucket public access: check bucket policy and ACL — use S3 Block Public Access settings at account level
IAM misconfiguration: overly permissive policies (Action: *, Resource: *) — use IAM Access Analyzer to find overly permissive policies, unused access
Security group misconfiguration: 0.0.0.0/0 on port 22 (SSH) or 3389 (RDP) — immediate remediation required
Missing tags: resources without required tags fail cost allocation — AWS Config rule: required-tags
Wrong region: resource created in wrong region (common with console use) — check console region selector
Resource dependency issues: deleting resources with dependencies (VPC with EC2 still in it, SG attached to ENI) — must remove dependencies first

07 Performance Troubleshooting

High latency: check CloudWatch metrics (latency P99/P95/P50), check service limits, database connection pool exhaustion, memory pressure causing swap
CPU throttling: burstable instances (T3/T2) use CPU credits — when exhausted, CPU throttled to baseline. Check CPUCreditBalance metric
Memory pressure: check CloudWatch agent metrics (mem_used_percent), OOM killer in logs (dmesg | grep -i oom)
Storage I/O: EBS IOPS limits — check VolumeReadOps/VolumeWriteOps, BurstBalance for gp2. Upgrade to gp3 or io2 for consistent IOPS
Network throughput: check NetworkIn/NetworkOut against instance network bandwidth limits
Database: slow query logs, check for missing indexes, connection pool limits, read replica lag

08 Common Cloud+ Exam Troubleshooting Scenarios

"Cannot SSH to EC2": SG rule missing port 22, key pair wrong, instance in private subnet without bastion/SSM, OS firewall blocking
"Application returns 502 Bad Gateway": load balancer cannot reach backend — health checks failing, backend down, security group blocks LB to instance
"S3 access denied": IAM policy missing s3:GetObject, bucket policy denying, object ACL blocking, KMS key policy not allowing decrypt
"Lambda timeout": function runs longer than configured timeout (max 15 min), increase timeout, optimize code, or break into smaller functions
"RDS connection refused": security group not allowing application subnet to DB port 5432/3306, wrong connection string (hostname/port), RDS instance stopped
"Terraform state lock": previous apply didn't release DynamoDB lock — delete lock item manually or use terraform force-unlock

Study Advisor

Personalized study path based on your role

Study Path: Cloud Administrator

Master the Troubleshooting MethodologyInternalize the 7-step structured approach. Every exam scenario expects you to follow this flow — identify, theorize, test, plan, implement, verify, document.
Deep Dive: Network ConnectivityDrill the SG-NACL-Route-IGW-IP-OS checklist until it's automatic. Know that NACLs are stateless (both directions required) while SGs are stateful.
Practice Deployment TroubleshootingSet up a test Kubernetes cluster. Run kubectl describe pod, kubectl logs, and kubectl get events until you can diagnose a failing pod from the output alone.
Study IAM and S3 Misconfiguration PatternsReview the common IAM overpermission patterns. Practice reading bucket policies and identifying what makes an S3 bucket public. Know the Block Public Access hierarchy.
Learn Service Quotas and LimitsKnow which services have hard limits (VPCs per region, EIPs per account) and the process for requesting increases. This is frequently tested.
Practice with CloudTrail and CloudWatchEnable CloudTrail in a test account. Generate various API calls and read the logs. Practice correlating events to problems. Set up CloudWatch alarms on CPUCreditBalance.
Review All 10 Quiz QuestionsFocus on questions you got wrong. Re-read the explanations. Understand the "why" not just the answer — the exam presents novel scenarios requiring applied understanding.

Study Path: Site Reliability Engineer

Focus on Performance Troubleshooting FirstAs an SRE you already know performance patterns. Map your existing knowledge to cloud-specific metrics: CPUCreditBalance, BurstBalance, NetworkIn/Out limits, P99 latency in CloudWatch.
Master HTTP Error Code SemanticsKnow 502/503/504 cold. The exam presents ALB scenarios where you must distinguish between bad gateway (backend broken), unavailable (no healthy targets), and timeout (too slow).
Study Security Incident ResponseThe RCI pattern (Rotate, Check CloudTrail, Isolate) is heavily tested. Know privilege escalation indicators in CloudTrail: iam:AttachUserPolicy, iam:CreateAccessKey, iam:PutUserPolicy.
Review VPC Flow Logs AnalysisPractice reading Flow Log entries. Know how to distinguish ACCEPT vs REJECT, and how to use those to determine whether SG or NACL is blocking traffic.
Study IaC TroubleshootingTerraform state lock (force-unlock, DynamoDB), CloudFormation ROLLBACK (check stack events for first FAILED), and common error patterns. Run real terraform plan/apply failures in a sandbox.
Review NTP and Clock Skew ScenariosAWS Signature V4 has a 5-minute tolerance. Clock skew beyond that causes SignatureDoesNotMatch. Know how Amazon Time Sync Service (169.254.169.123) works in private subnets.

Study Path: Network Engineer

Map Familiar Concepts to AWS NetworkingYou know ACLs and stateful firewalls. In AWS: Security Groups = stateful firewall (instance level), NACLs = stateless ACL (subnet level, numbered rules, both directions required).
Master VPC Flow LogsFlow logs are your network packet capture equivalent. Know the fields (src IP, dst IP, src port, dst port, protocol, action, bytes) and how to read ACCEPT/REJECT to diagnose connectivity issues.
Study Route Tables and Gateway PatternsPrivate subnet: 0.0.0.0/0 → NAT Gateway (in public subnet). Public subnet: 0.0.0.0/0 → Internet Gateway. Know the difference and what breaks when either is missing.
Review DNS Troubleshooting ToolsKnow nslookup, dig, dig @8.8.8.8, and dig +trace. Study Route 53 health checks and how DNS TTL caching causes stale resolutions after record changes.
Study VPN and Direct Connect IssuesBGP session status, route propagation configuration, pre-shared key mismatches. Know how to verify these in the AWS console and what symptoms each failure produces.
Study Load Balancer Health Check TroubleshootingALB health checks: correct path, correct port, SG allows health check traffic from ALB to instance. Know which HTTP codes indicate healthy vs unhealthy targets.
Practice Quiz Questions 1, 3, 8, and 10These four questions directly test network troubleshooting: NAT Gateway, 502 ALB, NACL blocking, and NTP. Review each explanation carefully.

Cloud+: Troubleshooting
Cloud Issues

01 Cloud Troubleshooting Methodology

02 Troubleshooting Deployment Issues

03 Troubleshooting Network Connectivity

04 Troubleshooting Security Incidents

05 Troubleshooting Service Disruptions

06 Troubleshooting Misconfigurations

07 Performance Troubleshooting

08 Common Cloud+ Exam Troubleshooting Scenarios

Network Troubleshooting Order

NACL vs SG Troubleshooting

Security Incident Response

Common HTTP Error Codes

CPU Credits (T-instances)

Troubleshooting Starting Point

Study Path: Cloud Administrator

Study Path: Site Reliability Engineer

Study Path: Network Engineer

Cloud+: TroubleshootingCloud Issues

01 Cloud Troubleshooting Methodology

02 Troubleshooting Deployment Issues

03 Troubleshooting Network Connectivity

04 Troubleshooting Security Incidents

05 Troubleshooting Service Disruptions

06 Troubleshooting Misconfigurations

07 Performance Troubleshooting

08 Common Cloud+ Exam Troubleshooting Scenarios

Network Troubleshooting Order

NACL vs SG Troubleshooting

Security Incident Response

Common HTTP Error Codes

CPU Credits (T-instances)

Troubleshooting Starting Point

Study Path: Cloud Administrator

Study Path: Site Reliability Engineer

Study Path: Network Engineer

Cloud+: Troubleshooting
Cloud Issues