Reliability of Stateful & Data Systems

On this page

The stateless services failed over fine. The database didn't.
Stateless is easy. State is where reliability is won or lost.
The picture: primary, replicas, failover, and backups
Steps to make a data store reliable
Replication vs backup vs snapshot, they protect against different things
Verify it: failover health and backup restore
Consistency under failover, and the split-brain trap
Operating stateful workloads on Kubernetes
Common mistakes that cost hours (and data)
Takeaways
Where to go next

TL;DR

Stateless services fail over easily; databases, queues, and caches are where reliability is won or lost. Master replication, failover, backups versus PITR, and avoiding split-brain, including for stateful workloads on Kubernetes.

The stateless services failed over fine. The database didn't.

It's 3am. An availability zone drops. Your load balancer notices in seconds and pulls the bad targets. Your stateless API pods reschedule onto healthy nodes, pass their health checks, and start serving traffic again. The dashboards go green. You almost go back to sleep.

Then the error rate spikes again. The API is up, but every write is failing. The primary database lived in the zone that just died. The replica is there, ready, but nothing promoted it, because nobody had tested the failover in eight months and the automation referenced a security group that no longer existed. You spend the next 40 minutes promoting a replica by hand, praying its replication lag was small, and wondering exactly how many seconds of writes you just lost.

This is the recurring lesson of on-call: stateless tiers heal themselves; stateful tiers do not. A web server has no memory, so any copy is as good as any other. A database *is* the memory. You can't just spin up another one, it has to be the right one, with the right data, promoted in the right order, and the cost of getting that wrong is measured in lost transactions, not lost requests.

Who this is for

Engineers who can already keep a stateless service alive, autoscaling, health checks, rolling deploys, and now own a database, queue, or cache where the data itself is the thing that can't be lost. You know what a replica is; you want to know how replication, backups, and failover actually behave when a node dies at 3am. Some Kubernetes familiarity helps for the last section but isn't required.

Stateless is easy. State is where reliability is won or lost.

Stateless services are easy to make reliable, any instance is interchangeable. State is hard, because every copy is a claim about the truth, and reliability is the discipline of keeping those claims in agreement while machines fail underneath them.
The core idea of this article

Reliability for stateless workloads is a numbers game: run more replicas, spread them across zones, let the orchestrator replace dead ones. There is no "wrong" instance. Reliability for stateful workloads is a *truth* game. The instant you have two copies of data, you have a question, which one is authoritative?, and every failure mode is some version of that question being answered wrong: a stale replica promoted, two nodes both believing they're the primary, a backup that restores to a moment that never actually existed.

A team of identical cashiers, if one quits mid-shift, the next rings you up and nobody notices.Stateless API pods: any instance serves any request; losing one is invisible.

The single accounting ledger the cashiers write into, there can only be one source of truth, and a torn page is permanent.The primary database: one authoritative copy; corruption or loss is not recoverable from the cashiers.

A scribe copying the ledger a few seconds behind, useful if the original burns, but only as current as the last line copied.An async replica: survives primary loss, but anything written after the last replicated transaction is gone.

A locked nightly photocopy in a fireproof safe across town.A backup: an independent point-in-time copy that survives bugs and operator mistakes the replica would faithfully copy.

Why state changes the rules

The picture: primary, replicas, failover, and backups

A reliable data store has two completely separate safety systems, and conflating them is the root of most data-loss incidents. Replication keeps live copies for availability, if the primary dies, a replica takes over fast. Backups + point-in-time recovery (PITR) keep independent historical copies for durability, if data is corrupted, deleted, or a bad migration runs, you can rewind to a moment before the damage. Replication protects against a machine dying. Backups protect against the *truth itself* being wrong.

The primary streams its write-ahead log to replicas (live, for failover) and ships it to object storage (durable, for PITR). The two paths protect against different failures.

1
The application reads and writes the primary
All writes go to exactly one node, the primary, the single source of truth. Reads may be routed to replicas to spread load, but every write funnels through one place so there's an unambiguous order of events.
2
The primary streams its log to replicas
Every change is recorded in the write-ahead log (WAL). Replicas replay that stream to stay current. A synchronous replica acknowledges before the primary confirms the write (zero loss, higher latency); an async replica trails behind (lower latency, a small window of loss).
3
The same log is archived to durable storage
The WAL is also shipped to object storage and paired with a periodic base backup. Together they let you replay the database forward to any second in time, that's PITR. This path is independent of the replicas.
4
A controller watches health and promotes on failure
When the primary stops responding, the failover controller picks the most up-to-date replica, promotes it to primary, and repoints the application. The whole game is doing this quickly, exactly once, and never to a stale node.

Steps to make a data store reliable

There's a natural order here. Each step assumes the one before it actually works, and "actually works" means tested, not configured.

1
Define your RPO and RTO first
RPO (recovery point objective) is how much data you can afford to lose; RTO (recovery time objective) is how long you can be down. "Zero" for either is extremely expensive, pick honest numbers per dataset. Everything below is just the machinery to hit them. See Disaster Recovery: RTO, RPO & DR Strategies.
2
Add at least one replica in another failure domain
Another availability zone at minimum. A replica in the same rack as the primary protects against a disk failure but not a zone outage. Match the replica's resources to the primary, an undersized replica falls behind and is useless as a failover target.
3
Set up backups + PITR, separate from replication
A base backup plus continuous WAL archiving to storage that is isolated from the primary's account/region. A replica that faithfully copies a DROP TABLE is not a backup.
4
Automate failover, with a guard against split-brain
Use a consensus-based controller (Patroni, the managed provider's HA, or built-in like RDS Multi-AZ) that fences the old primary so it can't keep accepting writes after a new one is elected.
5
Test failover and restore on a schedule
Run a real failover and a real restore-to-PITR in a non-prod environment regularly. An untested backup is a hope, not a recovery plan. Measure the actual RTO/RPO you achieve and compare it to your target.
6
Watch replication lag as a first-class signal
Replica lag is the size of your data-loss window if the primary dies *right now*. Alert on it. Lag silently growing is one of the most common ways a "safe" setup turns into a 30-second data loss during failover.

Replication vs backup vs snapshot, they protect against different things

These three are constantly confused, and the confusion is what causes data loss. They are not redundant; you need all three. The key question for any copy is: *what failure does it protect against, and how much data does it lose?*

Mechanism	Protects against	Does NOT protect against	Typical RPO
Replication (sync)	Node/zone failure of the primary	Bad migrations, deletes, corruption, replicas copy them instantly	~0 (no committed writes lost)
Replication (async)	Node/zone failure, cross-region disaster	Logical errors; the recent-write window during failover	Seconds to minutes (lag)
Snapshot	Volume loss; fast rollback to a recent point	Slow drift between snapshots; storage-account compromise if co-located	Minutes to hours (snapshot interval)
Backup + PITR	Corruption, bad deploys, ransomware, operator error	Nothing it can't rewind from, but restore takes real time (RTO)	Seconds (replay WAL to any point)

Three different safety mechanisms. Replication is for availability; backups and snapshots are for durability against logical errors.

Replication is not a backup

This is the single most expensive misconception in stateful systems. Replication copies every change, including the catastrophic ones, to every replica in milliseconds. The day someone runs DELETE FROM orders without a WHERE, all your replicas obediently delete every order too. Only an independent, point-in-time backup saves you.

Verify it: failover health and backup restore

Two things must be continuously true: the failover target is healthy and not lagging, and the backups can actually be restored. Here's a health check a controller can poll to decide whether a replica is safe to promote, it refuses any replica whose lag exceeds your RPO.

replica-failover-check.sh

bash

#!/usr/bin/env bash
# Decide whether a Postgres replica is safe to promote.
# Exit 0 = healthy & within RPO; non-zero = do NOT promote.
set -euo pipefail

REPLICA_HOST="${1:?usage: $0 <replica-host>}"
MAX_LAG_SECONDS="${MAX_LAG_SECONDS:-5}"   # your RPO budget

# 1. Is it actually in recovery (a real replica, not a stale ex-primary)?
in_recovery=$(psql -h "$REPLICA_HOST" -tAc "SELECT pg_is_in_recovery();")
if [ "$in_recovery" != "t" ]; then
  echo "FAIL: $REPLICA_HOST is not a replica (split-brain risk)"; exit 2
fi

# 2. How far behind is it, in seconds of wall-clock time?
lag=$(psql -h "$REPLICA_HOST" -tAc \
  "SELECT COALESCE(EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())), 1e9);")
lag=${lag%.*}

if [ "$lag" -gt "$MAX_LAG_SECONDS" ]; then
  echo "FAIL: lag ${lag}s exceeds RPO ${MAX_LAG_SECONDS}s, promoting loses data"; exit 3
fi

echo "OK: $REPLICA_HOST in recovery, lag ${lag}s (<= ${MAX_LAG_SECONDS}s), safe to promote"

Backups are worthless until you've restored one. Wire a restore drill into a scheduled job that restores to a throwaway instance, runs an integrity query, and fails loudly if the data isn't there. This is the SQL the drill runs after restoring to a target point in time:

restore-verification.sql

sql

-- Run against a database freshly restored from backup + PITR.
-- Fails loudly if the restore is empty, stale, or structurally broken.

-- 1. Core tables exist and are non-empty.
DO $$
DECLARE n bigint;
BEGIN
  SELECT count(*) INTO n FROM orders;
  IF n = 0 THEN RAISE EXCEPTION 'restore check failed: orders table is empty'; END IF;
END $$;

-- 2. The restore reached the expected point in time (data is fresh, not days old).
SELECT
  max(created_at)                         AS newest_row,
  now() - max(created_at)                 AS staleness
FROM orders
HAVING now() - max(created_at) < interval '24 hours';
-- If this returns zero rows, the restore is too old, alert.

-- 3. Referential integrity survived the restore (no orphaned children).
SELECT count(*) AS orphaned_line_items
FROM order_items oi
LEFT JOIN orders o ON o.id = oi.order_id
WHERE o.id IS NULL;
-- Expect 0. Anything else means a torn / partial restore.

Consistency under failover, and the split-brain trap

Failover is where consistency guarantees go to die. The promotion itself is a distributed-systems problem: you're changing the source of truth while machines may be partitioned, slow, or briefly back from the dead. Two failure modes dominate.

Lost writes (the async gap)

If you promote an async replica, every transaction the primary committed but hadn't yet streamed is simply gone. The replica never saw it, so the new primary's history has a hole. This is exactly your replication lag, realized as data loss. Synchronous replication closes this gap, the primary waits for the replica to acknowledge before confirming a write, at the cost of write latency and an availability trade-off (if the sync replica is down, writes block). Choosing sync vs async per dataset is choosing your RPO. For the deeper rules on isolation and what "committed" even means, see Database Transactions & Consistency.

Split-brain (two primaries)

Split-brain is the nightmare: a network partition makes the controller think the primary is dead, so it promotes a replica, but the old primary is actually alive and still taking writes from clients that can still reach it. Now you have two primaries accepting conflicting writes to the same logical dataset. When the partition heals, there is no correct way to merge them; someone's data is getting overwritten or thrown away. This is silent, permanent corruption, and it's far worse than an outage.

Fencing is what prevents split-brain

Before promoting a new primary, the controller must guarantee the old one cannot accept writes, by killing it, revoking its network access, or requiring a quorum (majority) to elect a leader so a minority partition can never promote. "Fencing" / STONITH (shoot the other node in the head) is not optional; it's the entire point of a real HA controller versus a script that just promotes a replica.

This is why consensus-based systems (Raft/Paxos under the hood of etcd, Patroni's DCS, managed Multi-AZ) require a *majority* to elect a primary: a partition that can't reach the majority is forbidden from promoting, which makes two simultaneous primaries impossible by construction.

Operating stateful workloads on Kubernetes

Kubernetes was built for the easy case, stateless pods you can kill and reschedule anywhere. Running databases on it is doable but every default works against you, because every default assumes interchangeability.

Use a StatefulSet, never a Deployment. StatefulSets give each pod a stable identity (db-0, db-1), stable storage that follows the pod, and ordered startup, all of which a database needs and a Deployment deliberately doesn't provide.
Pin storage with the right reclaim policy. A PersistentVolumeClaim must bind to a volume in the same zone as the pod, and the StorageClass reclaim policy must be Retain so a deleted PVC doesn't delete your data with it.
Prefer a battle-tested Operator over hand-rolling. Operators (CloudNativePG, Zalando/Crunchy Postgres, Vitess) encode failover, fencing, backup, and PITR as controllers that understand the database. Reimplementing that with raw manifests is how you build split-brain by hand.
Set PodDisruptionBudgets and anti-affinity. Without a PDB, a routine node drain can evict your primary and a replica at once. Anti-affinity keeps replicas on different nodes and zones so one node loss isn't two-thirds of your cluster.
Mind that the scheduler will happily reschedule your primary. A pod eviction that's invisible for a stateless service is a failover event for a database. Treat node maintenance as a controlled failover, not a background detail.

The honest take: a managed database (RDS, Cloud SQL, Aurora) hands you tested failover, automated backups, and PITR for a price. Run databases on Kubernetes when you have a real reason, multi-cloud portability, scale, or cost at volume, and the operational maturity to back it, not by default.

Common mistakes that cost hours (and data)

1Untested failover. The automation looks fine in the console and references a security group, IAM role, or DNS record that quietly drifted away months ago. You discover this at 3am during the only failover that matters. Run game days.
2Treating replication as a backup. Replicas copy DROP TABLE and bad migrations instantly and faithfully. Without an independent PITR backup, a single bad command erases every copy you have.
3Ignoring replication lag. Lag is your data-loss window made visible. A replica that's secretly 4 minutes behind turns a "zero-loss" failover into a four-minute hole. Alert on lag against your RPO.
4No PITR, only nightly snapshots. A snapshot loses everything since the last snapshot. If a corruption lands at 4pm and your last snapshot was midnight, you've lost 16 hours. WAL archiving lets you rewind to the second before the damage.
5No fencing, so split-brain on every partition. A promote-on-failure script with no quorum or fencing will eventually run two primaries during a network blip and silently corrupt data, worse than the outage it was meant to prevent.
6Running stateful pods like stateless ones. Deployments instead of StatefulSets, no PDB, Delete reclaim policy, no anti-affinity, each one turns a routine Kubernetes operation into a data incident.

Takeaways

The whole article in nine lines

Stateless tiers self-heal; stateful tiers must be promoted correctly, once, to the right node.
Replication is for availability (a node died). Backups + PITR are for durability (the truth is wrong).
Replication is not a backup, replicas faithfully copy your catastrophic mistakes.
Define RPO (data you can lose) and RTO (time you can be down) first; everything else is machinery to hit them.
Replica lag is your data-loss window. Alert on it against your RPO.
Sync replication = ~0 RPO at the cost of write latency; async = lower latency, a loss window.
Split-brain (two primaries) is worse than an outage, fencing and quorum prevent it.
An untested backup or failover is a hope, not a plan. Drill both on a schedule.
On Kubernetes, use StatefulSets, an Operator, PDBs, and Retain storage, or use a managed DB.

Where to go next

Stateful reliability sits at the intersection of distributed systems, transactions, and disaster recovery. These pair naturally with this article:

Database Transactions & Consistency, what "committed" really means, isolation levels, and why consistency is the thing failover threatens.
Disaster Recovery: RTO, RPO & DR Strategies, turning the RPO/RTO numbers in this article into a full DR strategy and runbook.
Reliability & Resilience: Designing for Failure, the broader patterns (redundancy, blast radius, graceful degradation) that this is the stateful chapter of.
Follow the full SRE career path to build these into a coherent operational practice.

An availability zone holding your primary database just died. What's your reliability concern?

Check your understanding

1. Why does the article say stateless tiers heal themselves but stateful tiers do not?

2. How do replication and backup differ in what they protect against?

Frequently asked questions

Why do stateless services recover from a zone failure but databases do not?

A stateless service has no memory, so any copy is as good as any other and the orchestrator just reschedules it. A database is the memory, so you cannot spin up another one: it has to be the right one, with the right data, promoted in the right order.

Are replication and backups interchangeable?

No, they are two completely separate safety systems that protect against different things, and conflating them is the root of most data disasters. Replication keeps a live copy for failover, while backups and point-in-time recovery protect against data loss and corruption.

Why did a replica fail to take over during an outage?

In the article's example the replica was ready but nothing promoted it, because the failover had not been tested in eight months and the automation referenced a security group that no longer existed. Untested failover is how a few seconds of downtime becomes 40 minutes of manual promotion.

What is split-brain in a stateful system?

Split-brain is when two nodes both believe they are the primary, so both accept writes and the copies of data disagree. It is one version of the central stateful failure mode: the question of which copy is authoritative being answered wrong.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Managed Databases: Relational vs NoSQL in the Cloud

Read

DevOps

Kubernetes in Production: Beyond the Tutorial

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

Reliability of Stateful & Data Systems

01The stateless services failed over fine. The database didn't.

02Stateless is easy. State is where reliability is won or lost.

03The picture: primary, replicas, failover, and backups

04Steps to make a data store reliable

05Replication vs backup vs snapshot, they protect against different things

06Verify it: failover health and backup restore

07Consistency under failover, and the split-brain trap

Lost writes (the async gap)

Split-brain (two primaries)

08Operating stateful workloads on Kubernetes

09Common mistakes that cost hours (and data)

10Takeaways

11Where to go next

Frequently asked questions

Want to go deeper?

Managed Databases: Relational vs NoSQL in the Cloud

Kubernetes in Production: Beyond the Tutorial

Reliability & Resilience: Designing for Failure

The stateless services failed over fine. The database didn't.

Stateless is easy. State is where reliability is won or lost.

The picture: primary, replicas, failover, and backups

Steps to make a data store reliable

Replication vs backup vs snapshot, they protect against different things

Verify it: failover health and backup restore

Consistency under failover, and the split-brain trap

Operating stateful workloads on Kubernetes

Common mistakes that cost hours (and data)

Takeaways

Where to go next