Six pillars for building secure, reliable, efficient, and cost-optimized systems — and the trade-offs between them.
Six pillars for building secure, reliable, efficient, and cost-optimized systems — and the trade-offs between them.
Lesson outline
A Series C startup was acquired and their architecture needed to scale from 50,000 to 5 million users. Before the migration, the acquiring company ran an AWS Well-Architected Review.
Three critical findings: single point of failure in the database (no Multi-AZ), secrets stored in EC2 userdata (security), and no documented runbook for any incident (reliability). All three were fixed before the migration.
Seven months later, the primary database failed. Instead of a 6-hour outage, RDS Multi-AZ automatic failover restored service in 73 seconds. The Well-Architected review paid for itself 1,000 times over.
What is the AWS Well-Architected Framework?
A set of architectural best practices organized into six pillars, developed by AWS from reviewing thousands of customer workloads. Each pillar has a set of design principles, questions, and best practices. The free AWS Well-Architected Tool in the console lets you review any workload against these pillars.
The six pillars of the Well-Architected Framework
The hardest part of the Well-Architected Framework is that the pillars trade off against each other. There is no architecture that is simultaneously maximally reliable, maximally performant, and maximally cost-optimized.
| Trade-off scenario | Decision | What you gain | What you sacrifice |
|---|---|---|---|
| Multi-AZ vs single-AZ database | Multi-AZ | Reliability (73s failover vs 6h recovery) | Cost (2× database cost) |
| DynamoDB on-demand vs provisioned | On-demand | Cost efficiency at low/unpredictable load; Reliability (no throttling) | Cost at high predictable load (provisioned is 80% cheaper) |
| Synchronous vs async processing | Async (SQS + Lambda) | Reliability (retries, dead letter queues); Cost (pay per invocation) | Complexity; Operational excellence (harder to debug) |
| Caching (ElastiCache vs no cache) | Add cache | Performance; Cost (fewer DB reads) | Reliability (cache invalidation bugs); Operational excellence (more components) |
| Spot instances vs On-Demand | Spot for batch jobs | Cost (70-90% cheaper) | Reliability (spot interruptions); must design for graceful shutdown |
How to make pillar trade-off decisions
Match your reliability and cost investments to the business criticality of the workload. A payment processing service needs maximum reliability (Multi-AZ, read replicas, chaos testing). An internal analytics dashboard can trade reliability for cost (single-AZ, no HA, spot instances for batch processing).
A company uses a single-AZ RDS instance to reduce database costs. Which Well-Architected pillar are they trading off?
Reliability is most often the pillar with the highest failure risk. Here are the key design patterns:
Key reliability patterns with AWS services
1# Well-Architected: Reliability pillar — RDS with Multi-AZ and automated backups2resource "aws_db_instance" "main" {3identifier = "production-db"4engine = "postgres"5engine_version = "15.4"6instance_class = "db.t3.large"78# Reliability: Multi-AZ for automatic failover (73s vs hours)9multi_az = true10Multi-AZ is the most important reliability setting — enables 73s automatic failover11# Reliability: Automated backups with 7-day retention12backup_retention_period = 713backup_window = "03:00-04:00"Automated backups enable point-in-time recovery — your RPO is minutes, not days14maintenance_window = "sun:05:00-sun:06:00"1516# Security: encryption at rest (Security pillar)17storage_encrypted = true18kms_key_id = aws_kms_key.db.arnEncryption at rest is a Security pillar requirement — always enable for production data1920# Reliability: automated minor version upgrades21auto_minor_version_upgrade = true2223# Reliability: deletion protection prevents accidental deletion24deletion_protection = trueDeletion protection prevents someone accidentally `terraform destroy`-ing your production database2526# Cost: storage autoscaling prevents manual intervention27max_allocated_storage = 100028}
Cloud architecture interviews, solutions architect roles, and technical leadership discussions. AWS certifications (SAA, SAP) test this extensively.
Common questions:
Key takeaways
A startup wants to minimize AWS costs on their MVP. Should they use Multi-AZ RDS?
It depends on business criticality. For an MVP with no paying customers, single-AZ is an acceptable cost trade-off. Once customers are paying or the product is business-critical, Multi-AZ is required (Reliability pillar). This is the explicit trade-off the framework asks you to make consciously.
What is the purpose of a "game day" in the Reliability pillar?
A game day is a scheduled exercise where you simulate failures (AZ outage, database failover, instance termination) to measure your actual RTO and RTO — and discover recovery procedures that do not work before a real incident does.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.