Interactive Explainer

🎯Key Takeaways

Choose PostgreSQL as your default — it handles more scale than most companies ever reach.

Design schemas based on your query patterns, not just the data model.

Store money as integer cents. Use TIMESTAMPTZ. Use UUID primary keys.

The scaling ladder: optimize → vertical scale → read replicas → caching → sharding.

Sharding is a last resort. It eliminates cross-shard JOINs and makes transactions between shards impossible.

ACID guarantees (especially Atomicity and Durability) are why relational databases power banking, e-commerce, and healthcare.

Database Fundamentals: SQL, Scaling, and Data Modeling

The database is the hardest part of your system to change. Get it right first.

~7 min read

Be the first to complete!

What you'll learn

Choose PostgreSQL as your default — it handles more scale than most companies ever reach.
Design schemas based on your query patterns, not just the data model.
Store money as integer cents. Use TIMESTAMPTZ. Use UUID primary keys.
The scaling ladder: optimize → vertical scale → read replicas → caching → sharding.
Sharding is a last resort. It eliminates cross-shard JOINs and makes transactions between shards impossible.
ACID guarantees (especially Atomicity and Durability) are why relational databases power banking, e-commerce, and healthcare.

Lesson outline

Your Database Is the One Thing You Can't Easily Replace

In 2012, Instagram was acquired by Facebook for $1 billion. At the time, they had 13 employees and 30 million users. Their entire backend ran on PostgreSQL. They never needed to switch.

The Migration Nightmare

Twitter spent 3 years migrating from MySQL to Manhattan (their custom storage system). Uber migrated from PostgreSQL to MySQL in 2016 and wrote a 6,000-word blog post explaining why. Database migrations are the most expensive engineering projects most companies undertake.

The question isn't "which database is best." It's "which database fits my data model, access patterns, and scale requirements."

Database Type	Best For	Examples	Avoid When
Relational (SQL)	Structured data, complex queries, transactions	PostgreSQL, MySQL, SQLite	Massive scale with simple lookups, unstructured data
Document (NoSQL)	Flexible schemas, nested data, rapid iteration	MongoDB, Firestore, DynamoDB	Complex JOIN-heavy queries, strong consistency required
Key-Value	Session storage, caching, simple lookups	Redis, DynamoDB, etcd	Complex queries, data that needs relationships
Wide-Column	Time-series, high write throughput, append-heavy	Cassandra, HBase, BigTable	Complex queries, strong consistency, small datasets
Graph	Relationship-heavy data, social graphs, recommendations	Neo4j, Amazon Neptune	Non-graph data, simple queries
Search	Full-text search, faceted filtering	Elasticsearch, Solr, Typesense	Authoritative data source, complex transactions

The Relational Model: Still the Default for Good Reason

PostgreSQL and MySQL power the majority of the world's business data. The relational model — tables, rows, columns, foreign keys — has survived 50 years because it works.

ACID: The Guarantee That Makes Banks Trust Databases

🔵Atomicity — A transaction either fully completes or fully rolls back. No partial writes. "Transfer $100: debit A AND credit B — or neither."
🟢Consistency — Every transaction brings the database from one valid state to another. Foreign key constraints, check constraints — all enforced.
🟡Isolation — Concurrent transactions don't see each other's intermediate state. Four isolation levels: Read Uncommitted → Read Committed → Repeatable Read → Serializable.
🔴Durability — Committed transactions survive crashes. PostgreSQL writes a WAL (Write-Ahead Log) before confirming success.

Isolation Levels in Production

Most databases default to Read Committed. PostgreSQL uses it. MySQL InnoDB defaults to Repeatable Read. Serializable is the safest but slowest. Pick the lowest isolation level that still prevents your specific concurrency bugs.

database-schema.sql

1-- PostgreSQL schema with production-grade patterns
2 
3-- Users table: always use UUID primary keys for distributed systems
4CREATE TABLE users (
UUID over auto-increment: no coordination needed across database replicas
5  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
6  email       TEXT UNIQUE NOT NULL,
7  name        TEXT NOT NULL,
TIMESTAMPTZ stores timezone — use it. TIMESTAMP without TZ causes daylight saving bugs
8  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),  -- timezone-aware!
9  deleted_at  TIMESTAMPTZ                           -- soft delete, not hard delete
10);
11 
12-- Orders table: proper foreign keys with cascading
13CREATE TABLE orders (
14  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
ON DELETE RESTRICT prevents orphaned orders. ON DELETE CASCADE would delete orders with users
15  user_id     UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
Store money as integer cents, never as FLOAT. 0.1 + 0.2 = 0.30000000000000004 in floating point
16  status      TEXT NOT NULL CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled')),
17  total_cents INTEGER NOT NULL CHECK (total_cents >= 0),  -- store money as integers!
18  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
19);
20 
Composite index on (user_id, created_at) supports "get user orders sorted by date" in one index scan
21-- Index for common query pattern: user's orders
22CREATE INDEX idx_orders_user_id_created ON orders(user_id, created_at DESC);
23 
24-- ACID transaction: transfer money atomically
25BEGIN;
26  UPDATE accounts SET balance = balance - 10000 WHERE id = 'account-A';
27  UPDATE accounts SET balance = balance + 10000 WHERE id = 'account-B';
28  -- If any error here, entire transaction rolls back
29COMMIT;
30 
31-- Window functions: FAANG interview favorite
32SELECT
33  user_id,
34  order_id,
35  total_cents,
36  SUM(total_cents) OVER (PARTITION BY user_id ORDER BY created_at) AS running_total,
37  RANK() OVER (PARTITION BY user_id ORDER BY total_cents DESC) AS order_rank
38FROM orders;

Data Modeling: Design for Queries, Not Just for Structure

Most junior engineers design schemas based on the real world. Senior engineers design schemas based on the queries that will run against them.

The Query-First Design Rule

Before designing any table, write the 5 most common queries your application needs. Then design the schema to answer those queries efficiently. Your indexes flow from your query patterns, not from your data model.

Normal Form	Eliminates	Trade-Off	Use When
1NF	Repeating groups, multi-valued columns	More tables	Always (minimum standard)
2NF	Partial functional dependencies	More joins	Always for relational data
3NF	Transitive dependencies	Even more joins	Standard for OLTP systems
Denormalized	Joins (intentionally)	Data duplication, update anomalies	Read-heavy, analytics, search indexes

When to Denormalize Intentionally

Instagram stores the follower count directly on the user row (denormalized) instead of counting follower rows every time. The count is eventually consistent but the read is instant. This is the right trade-off for 10B reads/day vs occasional inconsistency.

Data Modeling Process at FAANG

→

Write the 5–10 most critical queries your system needs to answer

→

Identify the entities (User, Order, Product) and their relationships

→

Normalize to 3NF as a starting point

→

Identify which queries would require expensive JOINs or full table scans

→

Selectively denormalize those specific hot paths (add materialized counts, embed frequently co-read data)

→

Add indexes for every WHERE clause and JOIN column in your critical queries

Load test with production-scale data before shipping

Write the 5–10 most critical queries your system needs to answer

Identify the entities (User, Order, Product) and their relationships

Normalize to 3NF as a starting point

Identify which queries would require expensive JOINs or full table scans

Selectively denormalize those specific hot paths (add materialized counts, embed frequently co-read data)

Add indexes for every WHERE clause and JOIN column in your critical queries

Load test with production-scale data before shipping

Scaling Databases: When One Machine Isn't Enough

The Database Scaling Ladder (climb before you jump)

Optimize queries first — Add indexes, rewrite slow queries, use EXPLAIN ANALYZE. 80% of "we need to scale our database" problems are solved here.
Vertical scaling — Bigger machine: more RAM, faster disk. A $5,000/month RDS instance handles enormous load. Often cheaper than engineering time.
Read replicas — Route read queries to replicas. Instagram ran on one primary + multiple read replicas for years. Replication lag is the tradeoff.
Caching layer — Add Redis/Memcached in front of hot data. The database never sees cache-hit queries.
Horizontal sharding — Split data across multiple databases by key (user_id % N). Only do this when you've exhausted everything else. Huge operational complexity.

Sharding Is a Last Resort

Sharding eliminates cross-shard JOINs, makes transactions spanning shards impossible, and requires a routing layer. Pinterest moved away from sharding to Vitess (MySQL with transparent sharding). Stripe and Shopify still run on single-region PostgreSQL with clever partitioning. Don't shard prematurely.

Technique	Addresses	Trade-Off	When to Apply
Index optimization	Slow queries	Slightly slower writes	Always — first step
Connection pooling (PgBouncer)	Connection exhaustion	Config complexity	>100 concurrent connections
Read replicas	Read throughput	Replication lag, eventual consistency	>70% reads in workload
Partitioning	Table bloat, time-series data	Partition pruning must be planned	>100M rows in one table
Sharding	Write throughput, data size	No cross-shard JOINs, complex ops	After everything else fails
NewSQL (CockroachDB, Spanner)	Global scale with SQL	Latency, cost, operational complexity	Global consistency required

read-replica-routing.ts

1// Read replica routing with Prisma
2import { PrismaClient } from '@prisma/client';
3 
4// Primary: all writes go here
5const primary = new PrismaClient({
Never mix connection pools — primary is for writes, replica for reads
6  datasources: { db: { url: process.env.DATABASE_PRIMARY_URL } }
7});
8 
9// Replica: reads only
10const replica = new PrismaClient({
11  datasources: { db: { url: process.env.DATABASE_REPLICA_URL } }
12});
13 
14// Route based on operation type
15class DatabaseRouter {
Replica reads are eventually consistent. Replication lag is typically 10-100ms
16  async findUser(id: string) {
17    // READ → replica (eventual consistency acceptable)
18    return replica.user.findUnique({ where: { id } });
19  }
20 
21  async createOrder(data: OrderCreateInput) {
22    // WRITE → primary (strong consistency required)
23    return primary.order.create({ data });
Post-write reads should go to primary to avoid read-after-write inconsistency
24  }
25 
26  async getOrderForDisplay(id: string) {
27    // Reads that MUST be fresh (after a write) → primary
28    return primary.order.findUnique({
29      where: { id },
30      include: { user: true, items: true }
31    });
32  }
33}
34 
35// ⚠️ Read-after-write problem:
36// User creates order → replica may not have it yet
37// Solution: For 1 second after writes, route reads to primary
38// Or: use a cache to store "just created" IDs

How this might come up in interviews

Database questions at FAANG assess whether you understand trade-offs, not just syntax. They want to see you reason about consistency vs availability, normalization vs performance, and when NOT to scale.

Common questions:

How would you design the database schema for a social media platform?
What is the difference between OLTP and OLAP? When would you use a separate analytics database?
Explain when you would denormalize a database schema
How do read replicas work and what consistency guarantees do they provide?

Strong answers include:

Mentions query-first design unprompted
Knows ACID and can explain isolation levels
Understands read-after-write consistency problem with replicas

Red flags:

Says "use NoSQL for scale" without understanding why
Doesn't know what an index is or when to add one
Designs schema without asking about query patterns

Quick check · Database Fundamentals: SQL, Scaling, and Data Modeling

1 / 3

You need to store monetary values in a database. Which column type should you use?

Key takeaways

Choose PostgreSQL as your default — it handles more scale than most companies ever reach.
Design schemas based on your query patterns, not just the data model.
Store money as integer cents. Use TIMESTAMPTZ. Use UUID primary keys.
The scaling ladder: optimize → vertical scale → read replicas → caching → sharding.
Sharding is a last resort. It eliminates cross-shard JOINs and makes transactions between shards impossible.
ACID guarantees (especially Atomicity and Durability) are why relational databases power banking, e-commerce, and healthcare.

From the books

Designing Data-Intensive Applications — Martin Kleppmann (2017)

Part I: Foundations of Data Systems

The fundamentals of data systems (replication, partitioning, transactions) are more important than any specific database product. These principles apply across all systems.

The Art of PostgreSQL — Dimitri Fontaine (2019)

Chapter on Indexing and Query Optimization

PostgreSQL is more powerful than most engineers realize. Window functions, CTEs, and partial indexes can often replace application-layer complexity.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Your Database Is the One Thing You Can't Easily Replace

In 2012, Instagram was acquired by Facebook for $1 billion. At the time, they had 13 employees and 30 million users. Their entire backend ran on PostgreSQL. They never needed to switch.

The Migration Nightmare

The question isn't "which database is best." It's "which database fits my data model, access patterns, and scale requirements."

Database Type	Best For	Examples	Avoid When
Relational (SQL)	Structured data, complex queries, transactions	PostgreSQL, MySQL, SQLite	Massive scale with simple lookups, unstructured data
Document (NoSQL)	Flexible schemas, nested data, rapid iteration	MongoDB, Firestore, DynamoDB	Complex JOIN-heavy queries, strong consistency required
Key-Value	Session storage, caching, simple lookups	Redis, DynamoDB, etcd	Complex queries, data that needs relationships
Wide-Column	Time-series, high write throughput, append-heavy	Cassandra, HBase, BigTable	Complex queries, strong consistency, small datasets
Graph	Relationship-heavy data, social graphs, recommendations	Neo4j, Amazon Neptune	Non-graph data, simple queries
Search	Full-text search, faceted filtering	Elasticsearch, Solr, Typesense	Authoritative data source, complex transactions

The Relational Model: Still the Default for Good Reason

PostgreSQL and MySQL power the majority of the world's business data. The relational model — tables, rows, columns, foreign keys — has survived 50 years because it works.

ACID: The Guarantee That Makes Banks Trust Databases

🔵Atomicity — A transaction either fully completes or fully rolls back. No partial writes. "Transfer $100: debit A AND credit B — or neither."
🟢Consistency — Every transaction brings the database from one valid state to another. Foreign key constraints, check constraints — all enforced.
🟡Isolation — Concurrent transactions don't see each other's intermediate state. Four isolation levels: Read Uncommitted → Read Committed → Repeatable Read → Serializable.
🔴Durability — Committed transactions survive crashes. PostgreSQL writes a WAL (Write-Ahead Log) before confirming success.

Isolation Levels in Production

database-schema.sql

1-- PostgreSQL schema with production-grade patterns
2 
3-- Users table: always use UUID primary keys for distributed systems
4CREATE TABLE users (
UUID over auto-increment: no coordination needed across database replicas
5  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
6  email       TEXT UNIQUE NOT NULL,
7  name        TEXT NOT NULL,
TIMESTAMPTZ stores timezone — use it. TIMESTAMP without TZ causes daylight saving bugs
8  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),  -- timezone-aware!
9  deleted_at  TIMESTAMPTZ                           -- soft delete, not hard delete
10);
11 
12-- Orders table: proper foreign keys with cascading
13CREATE TABLE orders (
14  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
ON DELETE RESTRICT prevents orphaned orders. ON DELETE CASCADE would delete orders with users
15  user_id     UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
Store money as integer cents, never as FLOAT. 0.1 + 0.2 = 0.30000000000000004 in floating point
16  status      TEXT NOT NULL CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled')),
17  total_cents INTEGER NOT NULL CHECK (total_cents >= 0),  -- store money as integers!
18  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
19);
20 
Composite index on (user_id, created_at) supports "get user orders sorted by date" in one index scan
21-- Index for common query pattern: user's orders
22CREATE INDEX idx_orders_user_id_created ON orders(user_id, created_at DESC);
23 
24-- ACID transaction: transfer money atomically
25BEGIN;
26  UPDATE accounts SET balance = balance - 10000 WHERE id = 'account-A';
27  UPDATE accounts SET balance = balance + 10000 WHERE id = 'account-B';
28  -- If any error here, entire transaction rolls back
29COMMIT;
30 
31-- Window functions: FAANG interview favorite
32SELECT
33  user_id,
34  order_id,
35  total_cents,
36  SUM(total_cents) OVER (PARTITION BY user_id ORDER BY created_at) AS running_total,
37  RANK() OVER (PARTITION BY user_id ORDER BY total_cents DESC) AS order_rank
38FROM orders;

Data Modeling: Design for Queries, Not Just for Structure

Most junior engineers design schemas based on the real world. Senior engineers design schemas based on the queries that will run against them.

The Query-First Design Rule

Normal Form	Eliminates	Trade-Off	Use When
1NF	Repeating groups, multi-valued columns	More tables	Always (minimum standard)
2NF	Partial functional dependencies	More joins	Always for relational data
3NF	Transitive dependencies	Even more joins	Standard for OLTP systems
Denormalized	Joins (intentionally)	Data duplication, update anomalies	Read-heavy, analytics, search indexes

When to Denormalize Intentionally

Data Modeling Process at FAANG

→

Write the 5–10 most critical queries your system needs to answer

→

Identify the entities (User, Order, Product) and their relationships

→

Normalize to 3NF as a starting point

→

Identify which queries would require expensive JOINs or full table scans

→

Selectively denormalize those specific hot paths (add materialized counts, embed frequently co-read data)

→

Add indexes for every WHERE clause and JOIN column in your critical queries

Load test with production-scale data before shipping

Write the 5–10 most critical queries your system needs to answer

Identify the entities (User, Order, Product) and their relationships

Normalize to 3NF as a starting point

Identify which queries would require expensive JOINs or full table scans

Selectively denormalize those specific hot paths (add materialized counts, embed frequently co-read data)

Add indexes for every WHERE clause and JOIN column in your critical queries

Load test with production-scale data before shipping

Scaling Databases: When One Machine Isn't Enough

The Database Scaling Ladder (climb before you jump)

Optimize queries first — Add indexes, rewrite slow queries, use EXPLAIN ANALYZE. 80% of "we need to scale our database" problems are solved here.
Vertical scaling — Bigger machine: more RAM, faster disk. A $5,000/month RDS instance handles enormous load. Often cheaper than engineering time.
Read replicas — Route read queries to replicas. Instagram ran on one primary + multiple read replicas for years. Replication lag is the tradeoff.
Caching layer — Add Redis/Memcached in front of hot data. The database never sees cache-hit queries.
Horizontal sharding — Split data across multiple databases by key (user_id % N). Only do this when you've exhausted everything else. Huge operational complexity.

Sharding Is a Last Resort

Technique	Addresses	Trade-Off	When to Apply
Index optimization	Slow queries	Slightly slower writes	Always — first step
Connection pooling (PgBouncer)	Connection exhaustion	Config complexity	>100 concurrent connections
Read replicas	Read throughput	Replication lag, eventual consistency	>70% reads in workload
Partitioning	Table bloat, time-series data	Partition pruning must be planned	>100M rows in one table
Sharding	Write throughput, data size	No cross-shard JOINs, complex ops	After everything else fails
NewSQL (CockroachDB, Spanner)	Global scale with SQL	Latency, cost, operational complexity	Global consistency required

read-replica-routing.ts

1// Read replica routing with Prisma
2import { PrismaClient } from '@prisma/client';
3 
4// Primary: all writes go here
5const primary = new PrismaClient({
Never mix connection pools — primary is for writes, replica for reads
6  datasources: { db: { url: process.env.DATABASE_PRIMARY_URL } }
7});
8 
9// Replica: reads only
10const replica = new PrismaClient({
11  datasources: { db: { url: process.env.DATABASE_REPLICA_URL } }
12});
13 
14// Route based on operation type
15class DatabaseRouter {
Replica reads are eventually consistent. Replication lag is typically 10-100ms
16  async findUser(id: string) {
17    // READ → replica (eventual consistency acceptable)
18    return replica.user.findUnique({ where: { id } });
19  }
20 
21  async createOrder(data: OrderCreateInput) {
22    // WRITE → primary (strong consistency required)
23    return primary.order.create({ data });
Post-write reads should go to primary to avoid read-after-write inconsistency
24  }
25 
26  async getOrderForDisplay(id: string) {
27    // Reads that MUST be fresh (after a write) → primary
28    return primary.order.findUnique({
29      where: { id },
30      include: { user: true, items: true }
31    });
32  }
33}
34 
35// ⚠️ Read-after-write problem:
36// User creates order → replica may not have it yet
37// Solution: For 1 second after writes, route reads to primary
38// Or: use a cache to store "just created" IDs