Every architectural decision has costs and benefits. Learn the systematic process for making informed choices you can defend.
Every architectural decision has costs and benefits. Learn the systematic process for making informed choices you can defend.
Lesson outline
It is a familiar story. The system was running out of memory. It was 2 AM. The on-call engineer had to decide: restart the service (fast, unknown risk of data loss) or gracefully drain it (slow, safe). They chose to restart. 3 days later, they discovered the restart had corrupted 40,000 user records. The fix took 3 months.
The decision was not wrong because of bad judgment. It was wrong because it was made without a framework. What are the options? What are the consequences of each? What is irreversible? Who needs to know?
Trade-off analysis is that framework — applied not just to incidents, but to every architectural decision you make.
What is trade-off analysis?
A systematic process for evaluating architectural options by making the costs and benefits of each explicit, understanding which trade-offs are reversible vs irreversible, and making a decision that you can defend to your team, your manager, and yourself at 3 AM six months later.
The 8 dimensions of architectural trade-offs
Jeff Bezos's two-door framework is the most useful mental model for trade-off analysis:
Type 1 vs Type 2 decisions
Type 1 decisions are irreversible or very costly to reverse — walk through a one-way door. These need deep analysis, senior sign-off, and extensive consideration of failure modes. Type 2 decisions are easily reversible — walk through a two-way door. These should be made quickly by the people closest to the problem. Most decisions are Type 2, but teams often treat them as Type 1, creating decision-making paralysis.
| Decision | Type | Why |
|---|---|---|
| Choice of primary database (Postgres vs DynamoDB) | Type 1 | Migrating between database paradigms at scale costs months of engineering |
| Cache TTL value (60s vs 300s) | Type 2 | One config change, deployed in minutes, instantly reversible |
| Microservices vs monolith architecture | Type 1 | Once distributed, organizational and data ownership patterns form around it |
| Feature flag: enable new UI for 5% of users | Type 2 | Toggle back off if issues arise, zero user impact |
| Adopting a new cloud region (multi-region) | Type 1 | Data gravity, compliance, network architecture — all shift when you add a region |
| Switching from polling to WebSocket for real-time updates | Type 2 (probably) | New client code + server code, but both can coexist and be rolled back |
The rule: spend 80% of your analysis effort on Type 1 decisions. Move fast on Type 2. Most architecture velocity is lost by treating Type 2 decisions as Type 1.
Your team is choosing between PostgreSQL and DynamoDB for a new service. According to the reversibility test, how should this decision be treated?
Here is a practical template for documenting and communicating architectural trade-offs:
Architecture Decision Record (ADR) template
# ADR-042: Use DynamoDB for user session storage ## Context We need session storage for 50M active sessions. Current Redis cluster is hitting memory limits at $8k/month. ## Options considered 1. Scale Redis vertically (r6g.4xlarge) — $16k/month, single point of failure 2. Redis Cluster (3-node) — $12k/month, complex failover 3. DynamoDB with TTL — $1.2k/month, fully managed, auto-scaling ## Decision DynamoDB with 24-hour TTL per session. ## Trade-offs accepted - Slightly higher read latency (~5ms vs <1ms for Redis) - NoSQL data model (sessions are simple key-value, acceptable) - No in-process caching (acceptable — sessions accessed once per request) ## Trade-offs rejected - Redis complexity at scale outweighs latency benefit for this use case - Latency SLA is 50ms end-to-end; 5ms session lookup is acceptable ## Reversibility Type 2 — can migrate back to Redis if latency proves problematic. Session data is simple key-value; migration script is trivial.
ADRs are the most valuable engineering artifact that most teams do not write. They capture not just what you decided, but why — including the options you rejected and the trade-offs you accepted. Six months later when someone asks "why are we using DynamoDB for sessions?", the ADR answers it.
| Trade-off | When to prioritize A | When to prioritize B |
|---|---|---|
| SQL (A) vs NoSQL (B) | Complex relationships, ACID needed, evolving schema via migrations | Simple access patterns, massive scale, flexible schema, cost optimization |
| Synchronous (A) vs Asynchronous (B) | User needs immediate response (checkout confirmation, auth) | Background work (email, image processing, analytics events) |
| Monolith (A) vs Microservices (B) | <5 engineers, MVP, single domain, fast iteration needed | >20 engineers, independent scaling needed, multiple deployment cadences |
| Consistency (A) vs Availability (B) | Financial data, inventory counts, user auth state | Social feeds, product recommendations, search indexes, analytics |
| Build (A) vs Buy (B) | Core differentiator, unique requirements, team has deep expertise | Commodity infrastructure, standard compliance, no strategic differentiation |
The one question that clarifies every trade-off
Ask: "What is the worst case if this decision is wrong?" If the worst case is a performance regression you can diagnose and fix in a week — it is probably Type 2. If the worst case is a 3-month migration project or a security breach that leaks user data — it is Type 1. Match your analysis depth to the worst-case consequences.
Senior engineering and architect interviews — nearly every system design question is implicitly a trade-off analysis question. "Design X" means "design X and justify the trade-offs you made."
Common questions:
Key takeaways
What is a Type 1 vs Type 2 architectural decision?
Type 1: one-way door — irreversible or very costly to undo (database choice, microservices vs monolith). Type 2: two-way door — easily reversible (feature flags, cache TTL, logging format). Invest deeply in Type 1; move fast on Type 2.
When should you choose eventual consistency over strong consistency?
When the data can tolerate brief staleness without business impact — social feeds, product recommendations, search indexes, analytics. NOT for financial balances, inventory counts, or auth state where incorrect reads cause real harm.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.