FlashGenius Logo FlashGenius
CompTIA Cloud+ · CV0-004 · V4 2024

Cloud+: Troubleshooting
Cloud Issues

Domain 5 of 6  |  12% of Exam  |  CV0-004
12%Weight
90 minDuration
750Passing Score
90Questions
V4 2024Version
Core Concepts
8 essential troubleshooting topics for Cloud+ CV0-004 Domain 5

01 Cloud Troubleshooting Methodology

  • Structured approach: Identify → Establish theory → Test theory → Establish action plan → Implement → Verify → Document
  • Start with: check recent changes (what changed before the problem started?), check logs, check metrics
  • Divide and conquer: narrow the problem domain (network, compute, application, storage, IAM)
  • Correlation: is it affecting one user or all? One region or all? One service or all?
  • Tools: AWS CloudTrail (API calls), CloudWatch (metrics/logs), VPC Flow Logs (network), AWS Config (resource config history)
  • Common mistake: assuming the most complex cause — check simple things first (security group rules, IAM permissions, service limits)

02 Troubleshooting Deployment Issues

  • Container fails to start: check image name/tag (typo?), registry credentials, resource limits (CPU/memory too low), missing ConfigMaps/Secrets
    • kubectl describe pod: shows events and error messages
    • kubectl logs pod-name: shows application stdout/stderr
    • kubectl get events: shows cluster-level events
  • IaC apply fails: check error message first, verify credentials/permissions, check resource quota limits, look for dependency ordering issues
    • Terraform: run terraform plan to see what's failing before apply
  • CloudFormation stack stuck in ROLLBACK: check stack events in console — find the FAILED event that triggered rollback
  • Auto Scaling fails to launch: check launch template (AMI valid in region?), IAM instance profile permissions, security group exists in VPC, subnet capacity

03 Troubleshooting Network Connectivity

  • Connectivity checklist (work through layers):
    • 1. Security group: does it allow the traffic (port, protocol, source IP)?
    • 2. NACL: is there both inbound and outbound rule? (NACLs are stateless)
    • 3. Route table: is there a route to the destination (IGW for public, NAT for private)?
    • 4. Internet gateway: is it attached to the VPC?
    • 5. Elastic IP / Public IP: does the instance have one (for inbound from internet)?
    • 6. Instance OS: is the service listening on the port? Is iptables blocking?
  • VPC Flow Logs: log accepted and rejected traffic — identify which layer is blocking
  • DNS issues: nslookup/dig to verify resolution — wrong IP? Missing record? TTL caching old value?
  • VPN/Direct Connect issues: check BGP session status, verify route propagation, check pre-shared keys

04 Troubleshooting Security Incidents

  • Leaked credentials: immediately rotate/deactivate compromised keys — check CloudTrail for actions with compromised credentials — look for: new IAM users, new access keys, data exfiltration (unusual S3 GetObject), EC2 launches in unusual regions
  • Privilege escalation: attacker gains higher permissions — check CloudTrail for iam:AttachUserPolicy, iam:CreateAccessKey, iam:PutUserPolicy
  • Cryptomining: unusual EC2 instance types (GPU), high CPU on unexpected instances, outbound connections to mining pools (GuardDuty: CryptoCurrency:EC2/BitcoinTool.B)
  • Compromised instance isolation: take snapshot, then isolate (remove from load balancer, change security group to deny all, do NOT terminate — preserve for forensics)
  • S3 data exfiltration: check S3 access logs, CloudTrail S3 events, check for public bucket ACLs, GetObject calls from unusual IPs

05 Troubleshooting Service Disruptions

  • DNS failures: check Route 53 health checks, verify NS records, check TTL (cached old value?), test with nslookup/dig from multiple locations
  • DHCP issues: instances not getting IPs — check DHCP options set in VPC, subnet capacity (address exhaustion), check dhclient logs on Linux
  • NTP issues: clock skew causes TLS certificate errors, authentication failures (Kerberos, AWS Signature V4 time tolerance is 5 minutes) — check chrony/ntpd status
  • Load balancer health check failures: check health check path (HTTP 200?), security group allows health check traffic from LB, instance responding on health check port
  • Certificate errors: TLS expired? Wrong domain? Self-signed? Check with: openssl s_client -connect host:443
  • Service limits/quotas: check Service Quotas console, request increases proactively

06 Troubleshooting Misconfigurations

  • S3 bucket public access: check bucket policy and ACL — use S3 Block Public Access settings at account level
  • IAM misconfiguration: overly permissive policies (Action: *, Resource: *) — use IAM Access Analyzer to find overly permissive policies, unused access
  • Security group misconfiguration: 0.0.0.0/0 on port 22 (SSH) or 3389 (RDP) — immediate remediation required
  • Missing tags: resources without required tags fail cost allocation — AWS Config rule: required-tags
  • Wrong region: resource created in wrong region (common with console use) — check console region selector
  • Resource dependency issues: deleting resources with dependencies (VPC with EC2 still in it, SG attached to ENI) — must remove dependencies first

07 Performance Troubleshooting

  • High latency: check CloudWatch metrics (latency P99/P95/P50), check service limits, database connection pool exhaustion, memory pressure causing swap
  • CPU throttling: burstable instances (T3/T2) use CPU credits — when exhausted, CPU throttled to baseline. Check CPUCreditBalance metric
  • Memory pressure: check CloudWatch agent metrics (mem_used_percent), OOM killer in logs (dmesg | grep -i oom)
  • Storage I/O: EBS IOPS limits — check VolumeReadOps/VolumeWriteOps, BurstBalance for gp2. Upgrade to gp3 or io2 for consistent IOPS
  • Network throughput: check NetworkIn/NetworkOut against instance network bandwidth limits
  • Database: slow query logs, check for missing indexes, connection pool limits, read replica lag

08 Common Cloud+ Exam Troubleshooting Scenarios

  • "Cannot SSH to EC2": SG rule missing port 22, key pair wrong, instance in private subnet without bastion/SSM, OS firewall blocking
  • "Application returns 502 Bad Gateway": load balancer cannot reach backend — health checks failing, backend down, security group blocks LB to instance
  • "S3 access denied": IAM policy missing s3:GetObject, bucket policy denying, object ACL blocking, KMS key policy not allowing decrypt
  • "Lambda timeout": function runs longer than configured timeout (max 15 min), increase timeout, optimize code, or break into smaller functions
  • "RDS connection refused": security group not allowing application subnet to DB port 5432/3306, wrong connection string (hostname/port), RDS instance stopped
  • "Terraform state lock": previous apply didn't release DynamoDB lock — delete lock item manually or use terraform force-unlock
Memory Hooks
6 mnemonics and patterns to lock in troubleshooting knowledge fast

Network Troubleshooting Order

SG-NACL-Route-IGW-IP-OS

Work through layers from outermost to innermost: Security Group → NACL → Route Table → Internet Gateway → Public IP → OS firewall

NACL vs SG Troubleshooting

SG forgets nothing, NACL forgets everything

SG is stateful (remembers the connection, auto-allows return traffic). NACL is stateless — you must add rules for both inbound AND outbound directions

Security Incident Response

RCI

Rotate credentials immediately, Check CloudTrail for all actions taken with compromised creds, Isolate compromised resources (don't terminate — preserve for forensics)

Common HTTP Error Codes

502 / 503 / 504 / 403 / 404

502 = Backend broken. 503 = Backend busy. 504 = Backend too slow. 403 = You're not allowed. 404 = Not found.

CPU Credits (T-instances)

Bank idle, spend busy

T-type instances bank credits when idle, spend credits when busy. CPUCreditBalance near zero = throttled to baseline CPU — looks like "low CPU but slow app"

Troubleshooting Starting Point

CWCL

Check What Changed Last — most cloud problems follow a recent change. Always ask: what was deployed, modified, or updated recently?

Practice Quiz
10 scenario-based questions covering Domain 5 troubleshooting
Flashcards
12 cards — click to flip, use arrows to navigate
Click to reveal answer
Study Advisor
Personalized study path based on your role

Study Path: Cloud Administrator

Study Path: Site Reliability Engineer

Study Path: Network Engineer

Resources
Official documentation and reference materials
CompTIA Cloud+ Certification

Official exam objectives, study materials, and exam registration for CV0-004

📋
AWS VPC Flow Logs Documentation

Official AWS guide to enabling, reading, and querying VPC Flow Logs for network troubleshooting

🛠
kubectl Cheat Sheet

Official Kubernetes reference for kubectl commands used in container and pod troubleshooting