The Simplified Tech

Engineering, with opinions

Teardowns, hard truths, and deep-dives that make you better at the job, not another tutorial dump.

Engineering ShortsOne idea, 15 seconds. Scroll the feed →

Cloud Networking Fundamentals: How a VPC Actually Works

Networking is the wall most beginners hit. This is the from-scratch mental model, VPCs, subnets, route tables, gateways, and security groups, explained so clearly you could build one today. With a diagram, real Terraform, and the mistakes that cost people hours.

Read the story 14 min read

Shorts, quick hits

Open the feed

Gotcha

Your retries are turning a blip into a full outage

When a service slows down, naive retries multiply the load exactly when the system is weakest, creating a retry storm that finishes the job the original failure started. Always pair retries with exponential backoff AND jitter, cap the total attempts, and add a circuit breaker so you stop hammering a dependency that is already down.

Deep dive

Did you know

99.9% uptime is 43 minutes of downtime every month

Each extra nine is roughly a 10x cost in engineering effort. 99.9% buys you ~43 min/month, 99.99% only ~4 min, 99.999% just 26 seconds. Pick the target your users actually feel, not the most nines you can brag about, because the gap between them is where your whole roadmap goes to die.

Deep dive

Rule of thumb

100% reliability is the wrong target on purpose

If your SLO is 99.9%, you have a 0.1% error budget, and that budget is permission to ship fast. Budget left over means you are being too cautious; budget burned means you freeze features and pay down reliability. It turns the dev-vs-ops fight into one shared number instead of a vibe.

Deep dive

Hard truth

Errors-over-1% alerts page you for nothing and miss the slow bleed

A static threshold fires on every harmless spike and sleeps through a slow leak that quietly drains your whole error budget by Friday. Multi-window, multi-burn-rate alerts fix both: a fast window catches sudden disasters, a slow window catches the gradual bleed, and both tie directly to budget burn instead of an arbitrary percentage.

Deep dive

Rule of thumb

Four metrics catch almost every user-facing problem

Latency, traffic, errors, saturation. If you only instrument these four golden signals you will catch the vast majority of incidents before users complain. Everything else is detail you add once these tell you where to look, so start here before you drown in a thousand dashboards.

Deep dive

Rule of thumb

If toil passes 50% of your week, reliability is already losing

Toil is manual, repetitive work that scales with your service but never improves it. Google caps it at 50% per engineer for a reason: past that, there is no time left to build the automation that would kill the toil, and you spiral. Measure it, cap it, then automate the most frequent offender first.

Deep dive

Hot take

Resilience you never tested is just a hope

You do not know your failover works until you kill the primary on purpose. Chaos engineering injects real failure (kill a node, add 200ms latency, drop a region) in controlled conditions so you find the broken assumption during business hours instead of at 3am. Start with a small blast radius and a hypothesis, not a free-for-all.

Deep dive

Hard truth

Blame the engineer and you guarantee the next outage

If people fear punishment they hide the messy details that actually explain the failure, so you fix nothing. A blameless postmortem assumes everyone acted reasonably given what they knew, then fixes the system that let one human mistake cause an outage. The output is action items on the system, never a name.

Deep dive

Rule of thumb

Overload doesn't have to mean outage

When demand exceeds capacity, serving everyone slowly means serving no one. Shed load early: reject a fraction of requests fast with a clean 429, protect the critical path, and let non-essential features degrade on purpose. A checkout that works while recommendations are down beats a site that is fully down.

Deep dive

Gotcha

The N+1 query that's fast in dev and dead in prod

Your ORM lazily fires one query per row, so a list of 20 items runs 21 queries. With 20 rows of seed data it is instant; with 50,000 production rows it melts the database. Eager-load the relation (a JOIN or a single batched IN query) and watch your endpoint go from seconds to milliseconds.

Deep dive

Hard truth

Exactly-once delivery doesn't exist, stop designing for it

The network gives you at-least-once, never exactly-once, so retries will replay your payment request. The fix is not magic delivery, it is idempotency: attach a client-generated key, store the result the first time, and return that same result on every replay. A retried charge then becomes a no-op instead of a furious customer.

Deep dive

Gotcha

One expired cache key can take down your database

When a hot key expires, every concurrent request misses at once and stampedes the database with the same expensive query. Defend with a short lock so only one request recomputes, serve slightly stale data while it does, and add jitter to TTLs so keys never all expire on the same second.

Deep dive See all 68

Cloud & Infra

What the cloud actually is, once you strip the marketing.

Engineering, with opinions

Cloud Networking Fundamentals: How a VPC Actually Works

Shorts, quick hits

Cloud & Infra

Cloud Networking Fundamentals: How a VPC Actually Works

How the Cloud Actually Works: Regions, AZs & the Edge

IaaS vs PaaS vs SaaS, What You Actually Manage

Ship It

Cloud Networking Fundamentals: How a VPC Actually Works

What DevOps Actually Is (It's Not a Job Title)

CI/CD Fundamentals: What a Pipeline Really Does

Keep It Up

What is Site Reliability Engineering?

SLIs, SLOs & SLAs: Defining Reliability

Error Budgets Explained: Reliability You Can Actually Spend

AI Engineering

What Is an AI Engineer?

How LLMs Actually Work (for Engineers)

Prompt Engineering Fundamentals

Security Desk

What Is Application Security?

Authentication vs Authorization: The Two Most-Confused Words in Security

The OWASP Top 10, Explained

Server Side

What Is a Backend Engineer?

How the Web Works: HTTP Requests

REST API Design: Clean, Predictable HTTP APIs

The Browser

What Is a Frontend Engineer?

How the Browser Renders a Page

CSS Fundamentals: The Box Model & Layout

Browse everything

Prefer a path? Read by role.

Foundations

Cloud Networking Fundamentals: How a VPC Actually Works

How the Cloud Actually Works: Regions, AZs & the Edge

IaaS vs PaaS vs SaaS, What You Actually Manage

Junior

Deploying Your First Production App the Right Way

Infrastructure as Code with Terraform, Start Here

Load Balancing & Auto-Scaling Explained

Senior

Designing for High Availability & Disaster Recovery (RTO/RPO)

Multi-Region Architecture: When You Actually Need It

The Well-Architected Framework, Decoded

Cross-Cutting / NFR

Observability: Metrics, Logs & Traces (The Three Pillars)

Reliability & Resilience: Designing for Failure

Scalability Principles: Stateless, Horizontal & Decoupled

One sharp email a week