The Simplified Tech

Engineering, with opinions

Teardowns, hard truths, and deep-dives that make you better at the job, not another tutorial dump.

Engineering ShortsOne idea, 15 seconds. Scroll the feed →

Shorts, quick hits

Open the feed
Gotcha

Your retries are turning a blip into a full outage

When a service slows down, naive retries multiply the load exactly when the system is weakest, creating a retry storm that finishes the job the original failure started. Always pair retries with exponential backoff AND jitter, cap the total attempts, and add a circuit breaker so you stop hammering a dependency that is already down.

Deep dive
Did you know

99.9% uptime is 43 minutes of downtime every month

Each extra nine is roughly a 10x cost in engineering effort. 99.9% buys you ~43 min/month, 99.99% only ~4 min, 99.999% just 26 seconds. Pick the target your users actually feel, not the most nines you can brag about, because the gap between them is where your whole roadmap goes to die.

Deep dive
Rule of thumb

100% reliability is the wrong target on purpose

If your SLO is 99.9%, you have a 0.1% error budget, and that budget is permission to ship fast. Budget left over means you are being too cautious; budget burned means you freeze features and pay down reliability. It turns the dev-vs-ops fight into one shared number instead of a vibe.

Deep dive
Hard truth

Errors-over-1% alerts page you for nothing and miss the slow bleed

A static threshold fires on every harmless spike and sleeps through a slow leak that quietly drains your whole error budget by Friday. Multi-window, multi-burn-rate alerts fix both: a fast window catches sudden disasters, a slow window catches the gradual bleed, and both tie directly to budget burn instead of an arbitrary percentage.

Deep dive
Rule of thumb

Four metrics catch almost every user-facing problem

Latency, traffic, errors, saturation. If you only instrument these four golden signals you will catch the vast majority of incidents before users complain. Everything else is detail you add once these tell you where to look, so start here before you drown in a thousand dashboards.

Deep dive
Rule of thumb

If toil passes 50% of your week, reliability is already losing

Toil is manual, repetitive work that scales with your service but never improves it. Google caps it at 50% per engineer for a reason: past that, there is no time left to build the automation that would kill the toil, and you spiral. Measure it, cap it, then automate the most frequent offender first.

Deep dive
Hot take

Resilience you never tested is just a hope

You do not know your failover works until you kill the primary on purpose. Chaos engineering injects real failure (kill a node, add 200ms latency, drop a region) in controlled conditions so you find the broken assumption during business hours instead of at 3am. Start with a small blast radius and a hypothesis, not a free-for-all.

Deep dive
Hard truth

Blame the engineer and you guarantee the next outage

If people fear punishment they hide the messy details that actually explain the failure, so you fix nothing. A blameless postmortem assumes everyone acted reasonably given what they knew, then fixes the system that let one human mistake cause an outage. The output is action items on the system, never a name.

Deep dive
Rule of thumb

Overload doesn't have to mean outage

When demand exceeds capacity, serving everyone slowly means serving no one. Shed load early: reject a fraction of requests fast with a clean 429, protect the critical path, and let non-essential features degrade on purpose. A checkout that works while recommendations are down beats a site that is fully down.

Deep dive
Gotcha

The N+1 query that's fast in dev and dead in prod

Your ORM lazily fires one query per row, so a list of 20 items runs 21 queries. With 20 rows of seed data it is instant; with 50,000 production rows it melts the database. Eager-load the relation (a JOIN or a single batched IN query) and watch your endpoint go from seconds to milliseconds.

Deep dive
Hard truth

Exactly-once delivery doesn't exist, stop designing for it

The network gives you at-least-once, never exactly-once, so retries will replay your payment request. The fix is not magic delivery, it is idempotency: attach a client-generated key, store the result the first time, and return that same result on every replay. A retried charge then becomes a no-op instead of a furious customer.

Deep dive
Gotcha

One expired cache key can take down your database

When a hot key expires, every concurrent request misses at once and stampedes the database with the same expensive query. Defend with a short lock so only one request recomputes, serve slightly stale data while it does, and add jitter to TTLs so keys never all expire on the same second.

Deep dive
See all 68

Prefer a path? Read by role.

Every article, sequenced from fundamentals to senior for your track.

From what the cloud actually is to IAM, networking, high availability, multi-region, and cost, the path from zero to senior cloud engineer.

Foundations

7

Junior

16

Senior

15

Cross-Cutting / NFR

8
📬

One sharp email a week

The best shorts and a new teardown, every week. No fluff.

Join 2,000+ engineers already subscribed