Interactive Explainer

🎯Key Takeaways

The message flow follows a write-ahead pattern: persist the message to durable storage BEFORE attempting delivery. This guarantees zero message loss even if the delivery pipeline fails. The sender receives a "sent" acknowledgment as soon as storage confirms the write, and a "delivered" acknowledgment when the recipient device sends a client-level ACK.

WebSocket is the primary transport for all production chat systems, providing sub-10ms frame delivery with minimal overhead. Long Polling serves as an automatic fallback for restricted networks. The connection server fleet is the most stateful component — each server maintains tens of thousands of persistent TCP connections with heartbeat-based liveness monitoring.

Message storage must be partitioned by conversation_id (not user_id) because the primary access pattern is loading recent messages for a conversation. Use Snowflake-style message IDs as the clustering key in descending order to make "get last 50 messages" a single-partition sequential read. Time-based bucketing prevents unbounded partition growth in active conversations.

Group chat fan-out strategy must vary by group size: direct fan-out for small groups (up to 200 members) where messages are delivered individually to each member, and pub/sub fan-out for large channels (200+ members) where messages are published to a topic and each connection server fans out locally. This reduces fan-out cost from O(members) to O(connection_servers).

Presence, typing indicators, and read receipts are secondary features that generate surprisingly high traffic volumes. Implement them on separate infrastructure from the message delivery pipeline (bulkhead pattern) so failures in presence do not affect message delivery. Use lazy presence (only push updates to users actively viewing the relevant UI), throttle typing events to one every 3-5 seconds, and aggregate read receipts using watermark-based tracking rather than per-message events.

Designing a Real-Time Chat System at Scale

A comprehensive deep-dive into designing a production-grade real-time chat system capable of handling billions of messages daily across platforms like WhatsApp, Slack, Discord, and Facebook Messenger. Covers WebSocket connection management and lifecycle, message flow from sender through gateway and storage to receiver, message storage schema design with time-series optimization using Cassandra and MySQL, group chat fan-out architecture for small groups versus large channels, online presence and typing indicator services with heartbeat protocols, delivery semantics including sent-delivered-read receipt tracking with at-least-once guarantees, and scaling strategies including sharding by user and conversation, cross-region replication, message sync protocols, and end-to-end encryption considerations. Chat systems are among the most frequently asked system design questions at FAANG companies because they test real-time communication, persistent storage, delivery guarantees, presence management, and horizontal scaling — all in a single 45-minute session.

~45 min read

Be the first to complete!

What you'll learn

The message flow follows a write-ahead pattern: persist the message to durable storage BEFORE attempting delivery. This guarantees zero message loss even if the delivery pipeline fails. The sender receives a "sent" acknowledgment as soon as storage confirms the write, and a "delivered" acknowledgment when the recipient device sends a client-level ACK.
WebSocket is the primary transport for all production chat systems, providing sub-10ms frame delivery with minimal overhead. Long Polling serves as an automatic fallback for restricted networks. The connection server fleet is the most stateful component — each server maintains tens of thousands of persistent TCP connections with heartbeat-based liveness monitoring.
Message storage must be partitioned by conversation_id (not user_id) because the primary access pattern is loading recent messages for a conversation. Use Snowflake-style message IDs as the clustering key in descending order to make "get last 50 messages" a single-partition sequential read. Time-based bucketing prevents unbounded partition growth in active conversations.
Group chat fan-out strategy must vary by group size: direct fan-out for small groups (up to 200 members) where messages are delivered individually to each member, and pub/sub fan-out for large channels (200+ members) where messages are published to a topic and each connection server fans out locally. This reduces fan-out cost from O(members) to O(connection_servers).
Presence, typing indicators, and read receipts are secondary features that generate surprisingly high traffic volumes. Implement them on separate infrastructure from the message delivery pipeline (bulkhead pattern) so failures in presence do not affect message delivery. Use lazy presence (only push updates to users actively viewing the relevant UI), throttle typing events to one every 3-5 seconds, and aggregate read receipts using watermark-based tracking rather than per-message events.

Lesson outline

Why Chat Systems Are Among the Hardest Designs

Chat systems are deceptively simple on the surface — users send text messages to each other, and messages appear on screens. But underneath that simplicity lies one of the most demanding distributed systems problems in all of software engineering. A chat system must simultaneously guarantee real-time delivery (sub-second latency), persistent storage (messages must never be lost), ordering guarantees (messages appear in the correct sequence), presence tracking (who is online right now), and delivery confirmation (was the message received and read). No other system design question requires you to solve all five of these constraints at once.

Consider the scale of production chat systems. WhatsApp handles over 100 billion messages per day from 2 billion monthly active users. Slack serves over 30 million daily active users across millions of workspaces with rich threading, search, and file sharing. Discord supports servers with hundreds of thousands of concurrent users in a single channel, all receiving messages in real time. Facebook Messenger processes 100 billion messages per day across 1.3 billion monthly users. Each of these platforms has made fundamentally different architectural choices, yet they all grapple with the same core engineering challenges.

Why Interviewers Love This Question

A chat system touches every distributed systems concept: real-time networking (WebSockets, long polling, heartbeats), persistent storage (message tables, time-series partitioning), delivery guarantees (at-least-once semantics, idempotency, client ACKs), presence management (heartbeat protocols, pub/sub), group messaging (fan-out, membership services, ordering), offline handling (queuing, push notifications), and horizontal scaling (sharding, replication, sync protocols). It can be discussed at L4 depth (basic WebSocket + message store) or L7 depth (cross-region replication with causal ordering and end-to-end encryption). This range makes it one of the most versatile system design questions at FAANG companies.

The fundamental challenge is that chat combines two conflicting paradigms: real-time streaming and durable persistence. A real-time system optimizes for latency — get the message to the recipient as fast as possible, ideally under 100 milliseconds. A durable storage system optimizes for reliability — never lose a message, even if servers crash. In most systems, you pick one paradigm or the other. Chat demands both simultaneously, and the interaction between them creates cascading complexity.

Scale numbers across major chat platforms

WhatsApp — 2B+ MAU, 100B+ messages/day, supports 1-on-1 and groups up to 1024 members. End-to-end encrypted. Runs on remarkably few servers due to Erlang/BEAM efficiency — famously handled 900M users with only 50 engineers.
Slack — 30M+ DAU across millions of workspaces, 1.5B+ messages sent per week. Supports channels with thousands of members, rich threading, file sharing, search across all messages, and extensive API/bot integration.
Discord — 200M+ MAU, supports servers with 800K+ members. Single channels can have tens of thousands of concurrent viewers. Real-time voice, video, and text in the same infrastructure. Message history is persistent and searchable.
Facebook Messenger — 1.3B+ MAU, 100B+ messages/day including text, images, video, payments, and AI chatbots. Supports both 1-on-1 and group conversations with up to 250 members. Deeply integrated with Facebook and Instagram social graphs.

Scale Numbers to Remember for Interviews

A chat system at WhatsApp scale processes 100B messages/day, which is roughly 1.15M messages/second average and 5-10M messages/second at peak. Each message is approximately 100 bytes (text) to 100KB (with media metadata). The system maintains 500M+ concurrent WebSocket connections globally. Message delivery latency target: p99 under 500ms for online recipients. Offline message queue must handle storing messages for users who are offline for up to 30 days. These numbers drive every architectural decision from connection server fleet size to storage partitioning strategy.

The Interview Opening

When asked to design a chat system, strong candidates immediately clarify two things: (1) Is this 1-on-1 messaging, group chat, or both? This determines the fan-out strategy. (2) What are the delivery guarantees — is it acceptable to occasionally duplicate a message, or must delivery be exactly once? Start by saying: "The central architectural decisions are the transport mechanism for real-time delivery, the message storage model for persistence, and the delivery guarantee semantics. Let me walk through each." This shows the interviewer you understand the problem is deeper than just WebSockets.

This entry walks through the complete design from requirements to production architecture. We cover connection management with WebSockets, the end-to-end message flow from sender to recipient, message storage schema design, group chat fan-out architecture, presence and typing indicators, delivery semantics with receipt tracking, and scaling strategies for billions of daily messages. Each section is written at the depth expected in a FAANG L5-L6 system design interview.

Requirements & Scale Estimation

Before drawing any architecture diagrams, a strong system design answer begins with clarifying requirements and running back-of-the-envelope calculations. For a chat system, the functional requirements define the messaging capabilities, while the non-functional requirements establish the latency, durability, and scale targets that drive every architectural decision.

Functional requirements

1-on-1 messaging — Users can send text messages to other users in real time. Messages are persistent and available on any device the user logs into. Messages support text, emoji, and references to media (images, files, video).
Group chat — Users can create group conversations with multiple participants. Messages sent to a group are delivered to all members. Groups range from small (2-10 members, like family chats) to large (1000+ members, like community channels).
Online presence — Users can see whether their contacts are currently online, offline, or idle. Presence updates propagate within a few seconds of a user changing state.
Read receipts — Senders can see when their message has been delivered to the recipient device and when the recipient has actually read it. In group chats, the sender can see delivery and read counts.
Message history & sync — Users can scroll back through past messages. When a user logs in on a new device, messages sync from the server. Offline users receive all missed messages when they reconnect.
Typing indicators — When a user is typing in a conversation, other participants see a real-time typing indicator. This is ephemeral — never persisted, only pushed to active participants.

Non-functional requirements

Latency — Message delivery under 500ms at p99 for online recipients on the same continent. Cross-region delivery under 1 second. Typing indicators under 300ms. Presence updates within 5-10 seconds of actual state change.
Durability — Zero message loss. Once a message is accepted by the server, it must be persisted and eventually delivered. Messages must survive server crashes, data center failures, and network partitions.
Availability — 99.99% uptime for message sending and receiving. Graceful degradation: if presence service is down, messaging still works. If read receipts fail, messages still deliver.
Ordering — Messages within a single conversation must appear in a consistent order for all participants. Causal ordering: if user A sends message M1, then sees message M2 from user B, and then sends M3, all participants must see M1 before M3.
Scale — Support 500M DAU, 50B messages/day, 200M concurrent WebSocket connections. Handle burst loads during events (New Year messages, breaking news) with 5-10x traffic spikes.

Now let us run the estimation math. These numbers determine whether a single server, a handful of servers, or a massive distributed fleet is required for each component.

Metric	Calculation	Result
Average messages/sec	50B messages / 86,400 seconds	~578K msg/sec
Peak messages/sec (5x)	578K x 5	~2.9M msg/sec
Average message size	Text: ~100 bytes, with metadata: ~500 bytes	500 bytes
Daily storage (text only)	50B x 500 bytes	~25 TB/day
Annual storage (text only)	25 TB x 365	~9.1 PB/year
Concurrent connections	500M DAU x 40% concurrent	~200M connections
Connections per server (at 100K/server)	200M / 100K	~2,000 connection servers
Bandwidth (inbound)	2.9M msg/sec x 500 bytes	~1.45 GB/sec peak inbound

Estimation Tips for the Interview

Always show the interviewer your math explicitly. State your assumptions (500M DAU, 100 messages per user per day, 500 bytes per message) and derive each number. Interviewers care more about your reasoning process than the exact number. Round aggressively — 86,400 seconds per day can be rounded to 100K for mental math. The key insight: 50B messages/day means you need a distributed message store (single-server databases top out at thousands of writes/sec), and 200M concurrent connections means you need a massive fleet of connection servers (each handling 50-100K connections).

Do Not Forget Media Messages

Text messages are small (100-500 bytes), but chat systems also handle images (100KB-5MB), videos (1-100MB), and files (up to 100MB+). Media is never sent through the message pipeline. Instead, the client uploads media to object storage (S3) and sends a message containing the media URL. The recipient downloads the media separately. This keeps the real-time message pipeline lightweight and prevents large files from blocking text message delivery.

These estimation numbers reveal the architecture. We need a fleet of 2,000+ stateful WebSocket connection servers, a distributed message store capable of sustained millions of writes per second, a message routing layer that can look up which connection server each recipient is connected to, and an offline queue for users who are not currently connected. Each of these components has its own scaling challenges, which we address in the following sections.

Connection Management — WebSocket vs Long Polling

The transport layer is the foundation of a real-time chat system. Unlike traditional HTTP request-response APIs, chat requires the server to push messages to clients at any time without the client asking. This demands a persistent, bidirectional communication channel between every online client and the server infrastructure. The choice of transport protocol has profound implications for server resource consumption, latency, battery life on mobile devices, and operational complexity.

WebSocket is the dominant transport for production chat systems. A WebSocket connection starts as an HTTP request (the handshake) and then upgrades to a persistent, full-duplex TCP connection. Once established, both the client and server can send frames at any time with minimal overhead — just 2-14 bytes of framing per message, compared to hundreds of bytes of HTTP headers for each request. WhatsApp, Slack, Discord, and Messenger all use WebSockets as their primary transport.

graph TD
    A[Client] -->|"1. HTTP Upgrade Request"| B[Load Balancer L4/L7]
    B -->|"2. Route to Connection Server"| C[Connection Server]
    C -->|"3. Upgrade to WebSocket"| A
    C -->|"4. Register connection"| D[Connection Registry / Redis]
    A <-->|"5. Bidirectional messages"| C
    C -->|"6. Heartbeat every 30s"| A
    A -->|"7. Pong response"| C
    C -->|"8. No pong in 90s = disconnect"| E[Cleanup: Remove from Registry]

    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style D fill:#fff3e0

WebSocket connection lifecycle: from HTTP upgrade through heartbeat maintenance to cleanup on disconnect

The connection server fleet is the most stateful component in the entire chat architecture. Each connection server maintains tens of thousands of open TCP connections, each representing an active user session. When a connection server receives a message destined for a user connected to that server, it writes the message frame directly to the appropriate TCP socket. This statefulness creates operational challenges: you cannot simply restart a connection server without disconnecting all its users, and load balancing must be sticky (route the same client to the same server for reconnections).

Connection server responsibilities

Connection lifecycle — Accept WebSocket upgrade requests, authenticate the user (validate JWT or session token during handshake), register the connection in the connection registry (Redis), and clean up on disconnect.
Heartbeat management — Send WebSocket ping frames every 30 seconds. If the client does not respond with a pong within 90 seconds, consider the connection dead and clean up. This detects silent TCP disconnections (mobile network switches, laptop sleep).
Message routing — Receive inbound messages from clients and forward them to the message service. Receive outbound messages from the message service and push them to the appropriate WebSocket connection.
Presence reporting — Report connection and disconnection events to the presence service. The presence service uses these events plus heartbeats to determine online/offline status.
Graceful shutdown — During deployments or scaling events, drain connections gradually: stop accepting new connections, notify connected clients to reconnect to another server, wait for clients to migrate, then shut down.

Connection Registry Design

The connection registry maps user_id to the connection_server_id handling that user. This is stored in Redis for fast lookup: when the message service needs to deliver a message to user X, it queries the registry to find which connection server holds user X's socket, then routes the message to that server. Use Redis with replication for high availability. The registry must handle 200M+ entries at WhatsApp scale. Each entry is small (user_id: 8 bytes, server_id: 4 bytes, metadata: 20 bytes), totaling roughly 6.4 GB — easily fits in a single Redis cluster.

Long polling is the fallback transport for environments where WebSockets are blocked (some corporate firewalls, older proxy servers). In long polling, the client sends an HTTP request that the server holds open for up to 30 seconds. If a message arrives during that window, the server responds immediately with the message. If no message arrives, the server responds with an empty body after the timeout, and the client immediately sends a new request. The disadvantage is higher latency (messages wait for the next poll cycle), higher server resource usage (holding open HTTP connections consumes more memory than WebSocket frames), and higher battery drain on mobile (frequent HTTP roundtrips).

Interview Tip: Always Mention the Fallback

When discussing WebSockets in an interview, always mention the long-polling fallback. Real production systems cannot assume WebSockets work everywhere. Corporate firewalls, certain mobile carriers, and legacy proxy servers may block WebSocket upgrades. A robust chat system detects WebSocket failure during the initial handshake and transparently falls back to long polling. Slack does exactly this — their client library attempts WebSocket first and degrades to long polling if the upgrade fails.

Reconnection logic is critical for mobile clients. Mobile devices constantly switch between WiFi and cellular networks, enter tunnels, or go through periods of poor connectivity. The client must detect disconnection (via heartbeat failure or TCP reset), exponentially back off reconnection attempts (1s, 2s, 4s, 8s, capped at 30s), and upon reconnection, sync any messages missed during the disconnect window. The sync protocol sends the client's last-seen message ID, and the server responds with all messages after that ID. This is the foundation of the message sync protocol discussed in Section 9.

Common Mistake: Ignoring Connection Storms

When a data center or connection server fleet experiences a brief outage, all affected clients disconnect simultaneously. When the outage resolves, all clients attempt to reconnect at once, creating a "thundering herd" or connection storm. Without mitigation, this reconnection flood can overwhelm the connection servers and cause a cascading failure loop. Prevention: implement jittered exponential backoff on the client (random delay between 0 and backoff_interval), and implement connection rate limiting on the server (accept at most N new connections per second per server, returning a "retry later" response to excess attempts).

Message Flow — Send, Store, Deliver

The end-to-end message flow is the heart of any chat system. When a user types a message and taps send, that message must traverse a pipeline of services, be persisted durably, and be delivered to the intended recipient — all within a few hundred milliseconds. Understanding this flow in detail is critical for a system design interview because it reveals the interaction between stateless services, stateful connection servers, persistent storage, and asynchronous delivery mechanisms.

Let us trace the complete lifecycle of a single 1-on-1 message from sender to recipient. This is the critical path that must be optimized for latency.

sequenceDiagram
    participant S as Sender Client
    participant CS1 as Connection Server A
    participant GW as API Gateway
    participant MS as Message Service
    participant DB as Message Store
    participant MQ as Message Queue
    participant CS2 as Connection Server B
    participant R as Recipient Client

    S->>CS1: 1. Send message via WebSocket
    CS1->>GW: 2. Forward to API Gateway
    GW->>MS: 3. Route to Message Service
    MS->>DB: 4. Persist message (write-ahead)
    DB-->>MS: 5. Write acknowledged
    MS-->>CS1: 6. ACK to sender (message stored)
    CS1-->>S: 7. Show sent checkmark
    MS->>MQ: 8. Publish delivery event
    MQ->>CS2: 9. Route to recipient connection server
    CS2->>R: 10. Push message via WebSocket
    R-->>CS2: 11. Client ACK (delivered)
    CS2-->>MS: 12. Update status: delivered
    MS-->>CS1: 13. Notify sender
    CS1-->>S: 14. Show delivered checkmark

Complete message lifecycle from send to delivery confirmation in a 1-on-1 chat

Step by step, here is what happens. The sender client writes the message to its local WebSocket connection (step 1). The connection server receives the WebSocket frame and forwards it to the API gateway (step 2), which routes to the message service (step 3). The message service assigns a globally unique message ID (using a Snowflake-style ID generator for chronological sortability), validates the message (sender is a participant in this conversation, message size is within limits), and writes the message to the durable message store (step 4). Once the write is acknowledged by the store (step 5), the message service sends an acknowledgment back to the sender via the connection server (steps 6-7). The sender sees a single checkmark indicating the message has been stored safely.

Message ID Generation with Snowflake IDs

Use a Snowflake-style ID generator for message_id: 41 bits for timestamp (69 years), 10 bits for machine ID (1024 machines), 12 bits for sequence number (4096 IDs per millisecond per machine). This produces message IDs that are globally unique without coordination, chronologically sortable (critical for message ordering), and compact (64-bit integer). The timestamp component means messages can be sorted by ID alone, eliminating the need for a separate created_at index. WhatsApp and Discord both use Snowflake-style message IDs.

After persisting, the message service publishes a delivery event to a message queue or event bus (step 8). A delivery worker consumes this event, looks up the recipient in the connection registry to find which connection server holds their WebSocket, and routes the message to that connection server (step 9). The connection server pushes the message to the recipient via their WebSocket connection (step 10). The recipient client receives the message, displays it, and sends a client-level ACK back through the WebSocket (step 11). This ACK propagates back to the message service (step 12), which updates the message status to "delivered" and notifies the sender (steps 13-14). The sender sees a double checkmark.

What happens when the recipient is offline

→

The delivery worker looks up the recipient in the connection registry and finds no active connection.

→

The message is placed in an offline queue (a per-user queue in Redis or a persistent queue in Kafka, keyed by recipient_id).

→

A push notification is sent to the recipient device via APNs (iOS) or FCM (Android) with a truncated message preview.

→

When the recipient comes online and establishes a WebSocket connection, the connection server triggers a sync: it reads all queued messages from the offline queue and pushes them to the client.

→

The client processes each message, displays them in order, and sends ACKs for each one. The offline queue is drained.

If the recipient does not come online within 30 days, messages in the offline queue are still available in the persistent message store and will be synced when the user eventually reconnects.

The delivery worker looks up the recipient in the connection registry and finds no active connection.

The message is placed in an offline queue (a per-user queue in Redis or a persistent queue in Kafka, keyed by recipient_id).

A push notification is sent to the recipient device via APNs (iOS) or FCM (Android) with a truncated message preview.

When the recipient comes online and establishes a WebSocket connection, the connection server triggers a sync: it reads all queued messages from the offline queue and pushes them to the client.

The client processes each message, displays them in order, and sends ACKs for each one. The offline queue is drained.

If the recipient does not come online within 30 days, messages in the offline queue are still available in the persistent message store and will be synced when the user eventually reconnects.

The Write-Ahead Pattern Is Non-Negotiable

The message must be written to durable storage BEFORE the delivery attempt. If you deliver first and then try to store, a crash between delivery and storage means the message is lost from the server perspective. The recipient has it, but the sender and the server do not. This violates durability. Always persist first (write-ahead), then attempt delivery. If delivery fails (recipient offline), the message is safe in storage and can be delivered later. This is the write-ahead log pattern applied to messaging.

The entire critical path from send to delivered confirmation has a latency budget of roughly 300-500ms. The breakdown is approximately: client to connection server (10-30ms network), connection server to message service (1-5ms internal), message service to database write (5-20ms depending on storage), message service to queue publish (1-5ms), queue to delivery worker (1-5ms), delivery worker to recipient connection server (1-5ms), and connection server to recipient client (10-30ms network). The dominant factor is the two network hops to/from the clients, especially on mobile networks with higher latency.

Interview Signal: Discuss Idempotency

Mobile networks are unreliable. A client may send a message, lose connectivity before receiving the server ACK, and then resend the same message after reconnecting. Without idempotency, this creates duplicate messages. Solution: the client generates a client-side unique message ID (UUID) before sending. The server uses this client_message_id as an idempotency key — if a message with the same client_message_id already exists, the server returns the existing message instead of creating a duplicate. Mention this in your interview to show you understand real-world reliability challenges.

Message Storage Design

Message storage is the most write-intensive component of a chat system. At WhatsApp scale, the storage layer must handle 1-3 million writes per second sustained, with daily ingestion of 25+ TB of new message data. The choice of storage technology, table schema, partition strategy, and index design determines whether the system can scale to billions of daily messages or collapses under write pressure.

The message table schema must support two primary access patterns efficiently: (1) insert a new message for a conversation, and (2) retrieve the most recent N messages for a conversation (the "load chat history" query). Both operations must be fast — writes in single-digit milliseconds and reads returning the last 50 messages in under 10ms.

Column	Type	Purpose
message_id	BIGINT (Snowflake ID)	Globally unique, chronologically sortable primary key
conversation_id	BIGINT	Partition key — all messages in a conversation are co-located
sender_id	BIGINT	The user who sent this message
message_type	TINYINT	Text=1, Image=2, Video=3, File=4, System=5
content	TEXT/BLOB	Message body (text) or media URL (for media messages)
status	TINYINT	Sent=1, Delivered=2, Read=3
created_at	TIMESTAMP	Server-side timestamp (also embedded in Snowflake ID)
client_message_id	UUID	Client-generated idempotency key for deduplication

The critical insight for message storage is the partition key. Messages must be partitioned by conversation_id, not by user_id. The reason: the most frequent query is "get the last 50 messages in conversation X." If messages are partitioned by conversation_id, all messages for a conversation are stored on the same partition (same node or same SSTable in Cassandra), and the query is a simple sequential read of the last 50 rows. If messages were partitioned by user_id, loading a conversation would require reading from multiple partitions (one for each participant), then merging and sorting — much slower.

Partition Key Strategy

Use conversation_id as the partition key and message_id (Snowflake) as the clustering key with descending order. In Cassandra: PRIMARY KEY (conversation_id, message_id) WITH CLUSTERING ORDER BY (message_id DESC). This means "get the last 50 messages" is a single-partition query that reads the first 50 rows — the absolute fastest query pattern in any distributed database. In MySQL/PostgreSQL, the equivalent is an index on (conversation_id, message_id DESC) with the query: SELECT * FROM messages WHERE conversation_id = ? ORDER BY message_id DESC LIMIT 50.

Choosing between Cassandra and MySQL for message storage is a critical architectural decision that depends on scale, operational expertise, and access patterns.

Criteria	Cassandra	MySQL (with sharding)
Write throughput	Millions of writes/sec with linear horizontal scaling. Writes go to any node.	Thousands of writes/sec per shard. Requires application-level sharding to scale beyond single node.
Read pattern	Excellent for partition-key lookups. Single-partition reads are microsecond-fast.	Excellent with proper indexes. Complex queries are easier with SQL joins.
Consistency	Tunable: ONE, QUORUM, ALL. Typically use LOCAL_QUORUM for chat messages.	Strong consistency per shard. Cross-shard transactions require 2PC or Saga.
Operational complexity	Higher — requires understanding of compaction, repair, tombstones, gossip protocol.	Lower — well-understood operationally, massive ecosystem of tools and expertise.
Data model flexibility	Denormalized, query-driven design. Schema changes require new tables.	Normalized relational model. Schema changes via ALTER TABLE (can be slow on large tables).
Production users	Discord (trillions of messages), Netflix, Apple.	Facebook Messenger (with custom sharding), Slack (with Vitess), WhatsApp (Erlang + custom).

Discord's Cassandra Journey

Discord stores trillions of messages in Cassandra. They partition by (channel_id, bucket) where bucket is a time-based bucket (one bucket per 10 days for active channels). The clustering key is message_id (Snowflake) in descending order. This design means loading recent messages hits a single partition. However, Discord later migrated from Cassandra to ScyllaDB (a Cassandra-compatible database written in C++) because Cassandra's garbage collection pauses caused latency spikes during compaction. The lesson: Cassandra's data model is excellent for chat, but its JVM-based implementation can cause operational headaches at extreme scale.

Time-series optimization is essential for chat storage. Messages are inherently time-ordered — users almost always read recent messages, and older messages are accessed exponentially less frequently. This creates a natural hot/cold data pattern: the last 30 days of messages are "hot" (frequently accessed), while older messages are "cold" (rarely accessed). Production systems exploit this pattern with tiered storage: hot messages live in fast storage (SSD-backed Cassandra or MySQL on NVMe), while cold messages are archived to cheaper storage (S3 or HDD-backed clusters). A background archival process moves messages older than 30-90 days from the hot tier to the cold tier, keeping the hot tier small and fast.

Interview Signal: Mention Time-Based Bucketing

In Cassandra, partitions grow unbounded if the partition key is just conversation_id — a very active group chat could have millions of messages in a single partition, causing "wide partition" performance problems. Solution: bucket by time. Use a composite partition key of (conversation_id, time_bucket) where time_bucket is calculated as message_timestamp / bucket_size (e.g., 10 days). When loading messages, query the most recent bucket first. If the user scrolls further back, query the previous bucket. This bounds partition size and keeps reads fast. Discord uses exactly this technique.

For search functionality (searching message content within a conversation or across all conversations), the primary message store is not suitable — Cassandra does not support full-text search, and SQL LIKE queries are too slow at scale. Instead, message content is asynchronously indexed into Elasticsearch or a similar search engine. A Kafka consumer reads new messages from the message pipeline and writes them to the search index. Search queries hit Elasticsearch, which returns message IDs, and the chat client then fetches the full messages from the primary store. This separation keeps the write-optimized message store simple while providing rich search capabilities.

Group Chat Architecture

Group chat introduces a fan-out problem that fundamentally changes the message delivery architecture. In a 1-on-1 conversation, sending a message means delivering it to exactly one recipient. In a group conversation with N members, sending a message means delivering it to N-1 recipients. The complexity and cost of this fan-out varies dramatically depending on group size, and production systems handle small groups and large channels with entirely different mechanisms.

Consider the fan-out math. A message to a group of 10 requires 9 deliveries. A message to a group of 100 requires 99 deliveries. A message to a Discord server channel with 100,000 members viewing the channel requires 100,000 deliveries. If each delivery takes 1ms of processing, a 10-person group adds 9ms of fan-out latency, while a 100K-member channel adds 100 seconds — obviously unacceptable. This is why the fan-out strategy must differ by group size.

graph TD
    A[Sender sends message to Group] --> B{Group Size?}
    B -->|"Small Group ≤ 200"| C[Direct Fan-Out]
    B -->|"Large Channel > 200"| D[Pub/Sub Fan-Out]

    C --> C1[Lookup all member connection servers]
    C1 --> C2[Send message to each member individually]
    C2 --> C3[Each member gets push via WebSocket]

    D --> D1[Publish to channel topic]
    D1 --> D2[Connection servers subscribe to channels]
    D2 --> D3[Each server pushes to its local members]

    style C fill:#e8f5e9
    style D fill:#fff3e0
    style B fill:#e1f5fe

Fan-out strategy differs by group size: direct delivery for small groups, pub/sub for large channels

For small groups (up to about 200 members), direct fan-out works well. When a message arrives at the message service, it looks up the group membership list, queries the connection registry for each member, and sends the message individually to each member's connection server. The message is stored once in the messages table with the group's conversation_id, but delivered N-1 times. Each recipient sees the same message_id. This approach is simple, provides consistent ordering (all members get messages in the same order), and allows per-member delivery tracking (you know exactly which members received and read the message).

Small Group Optimization: Recipient List in Message

For small groups, the message service can include the full recipient list in the delivery event. The delivery worker then fans out to all recipients in parallel without needing to query the membership service separately. This saves one network hop per message. WhatsApp includes the recipient list directly in the message metadata for groups up to 1024 members.

For large channels (hundreds or thousands of concurrent viewers), direct fan-out is too slow and too expensive. Instead, use a pub/sub model. Each large channel is represented as a topic in a pub/sub system (Redis Pub/Sub, or a dedicated message broker). Connection servers that have users subscribed to that channel subscribe to the channel topic. When a message is sent to the channel, it is published to the topic once. Each subscribed connection server receives the message and fans it out locally to the users on that server who are in the channel. This reduces the fan-out from N messages to M messages, where M is the number of connection servers with active subscribers (typically orders of magnitude smaller than N).

Group membership service responsibilities

Membership storage — Maintains the list of members for each group. Stored in a relational database with a group_members table: (group_id, user_id, role, joined_at). Cached in Redis for fast lookups during message fan-out.
Membership changes — Handles join, leave, add, remove, and role change operations. Each change is versioned — the group has a membership_version counter that increments on every change. Clients track the membership_version to detect stale state.
Permission enforcement — Validates that the sender is a member of the group and has permission to send messages (some groups restrict messaging to admins only). This check must be fast — it is on the critical path for every group message.
Fan-out coordination — Provides the list of recipients for a group message. For small groups, returns the full member list. For large channels, returns the list of subscribed connection servers from the pub/sub subscription registry.

Message ordering in group chats is more complex than in 1-on-1 conversations. In a 1-on-1 chat, only two users are sending, and the server assigns monotonically increasing Snowflake IDs, so ordering is straightforward. In a group chat, multiple users may send messages simultaneously. The server must establish a total order for all messages in the group. The simplest approach (and what most systems use) is server-side ordering: the message service assigns Snowflake IDs sequentially, and all clients display messages ordered by message_id. Since Snowflake IDs are timestamp-based, this gives a rough chronological order with tie-breaking by machine_id and sequence number.

The Large Group Read Receipt Problem

Read receipts work well for small groups: when user A reads a message, a read event is sent to the server and fanned out to the sender. But in a group with 10,000 members, a single message generates up to 10,000 read receipt events. This creates a quadratic problem: 100 messages x 10,000 members = 1,000,000 read receipt events. Solutions: (1) aggregate read receipts — instead of individual events, periodically report "X members have read messages up to message_id Y," (2) disable per-message read receipts for large groups and show only a read count, (3) use a separate, eventually consistent read receipt service that batches updates. Slack disables read receipts for large channels entirely.

Interview Depth: Mention the Discord Approach

Discord handles channels with hundreds of thousands of members by separating "channel messages" from "notification delivery." Every message is stored once and is available when a user scrolls to that channel. But notifications (push notifications, unread badges) are only generated for users who have enabled notifications for that channel. This means a message to a 100K-member channel does NOT generate 100K push notifications — only the subset of members with notifications enabled (typically a small fraction) receive push delivery. The rest see the message when they actively visit the channel. This insight dramatically reduces fan-out cost for large channels.

Online Presence & Typing Indicators

Online presence (showing whether a user is online, offline, or idle) and typing indicators (showing when someone is composing a message) are secondary features that seem simple but create surprisingly complex engineering challenges at scale. These features are inherently real-time, ephemeral, and high-frequency — the exact combination that stresses distributed systems the most.

The presence service tracks the online/offline status of every user in the system. At 500M DAU with 200M concurrent connections, the presence service must manage 200 million active presence entries and handle millions of status changes per minute (users connecting, disconnecting, and timing out). The naive approach — broadcasting every status change to every user who cares — creates an explosion of messages that can dwarf the actual chat traffic.

graph TD
    A[Client connects] -->|"Register: user online"| B[Presence Service]
    B -->|"Store in Redis"| C["Presence Store<br/>user_id → {status, last_seen, server_id}"]
    D[Client heartbeat every 30s] -->|"Update last_seen"| B
    B -->|"Publish status change"| E[Pub/Sub Channel per User]
    E -->|"Notify subscribed friends"| F[Connection Servers]
    F -->|"Push presence update"| G[Online Friends]

    H[No heartbeat for 90s] -->|"Mark offline"| B
    B -->|"Publish offline event"| E

    style B fill:#f3e5f5
    style C fill:#fff3e0
    style E fill:#e8f5e9

Presence service architecture: heartbeat-driven status tracking with pub/sub notification to friends

The heartbeat protocol is the foundation of presence detection. When a user connects, the connection server reports the event to the presence service, which stores the entry in Redis: user_id maps to a hash containing status (online/idle/offline), last_seen timestamp, and connection_server_id. The client sends a heartbeat every 30 seconds through the WebSocket connection. The connection server forwards these heartbeats to the presence service, which updates the last_seen timestamp. If no heartbeat is received for 90 seconds (3 missed heartbeats), the presence service transitions the user to offline status.

Presence Fan-Out Optimization

The naive presence fan-out is expensive: when user A comes online, broadcast this to all of user A's friends who are currently online. If user A has 500 friends and 200 are online, that is 200 presence update messages. Multiply by millions of users connecting and disconnecting, and presence updates overwhelm the system. Optimization: use lazy presence. Instead of pushing presence changes to all friends, only push to users who are currently viewing a screen that shows user A's presence (an open chat window with user A, or a contact list that is visible). The client subscribes to presence updates for specific users when the UI needs it, and unsubscribes when navigating away. This reduces fan-out by 90% or more.

The presence store in Redis uses a simple key-value structure with TTL-based expiry as a safety net. Each entry is a Redis hash: HSET presence:{user_id} status "online" last_seen 1704067200 server_id "ws-server-42". The key has a TTL of 120 seconds, which is reset on every heartbeat. If the presence service crashes or the heartbeat pipeline fails, the TTL ensures stale presence entries are automatically cleaned up. This is a common pattern: use active updates for normal operation and TTL as a passive safety net.

Presence states and transitions

Online — User has an active WebSocket connection and has sent a heartbeat within the last 90 seconds. Shown as a green dot in the UI.
Idle — User has an active WebSocket connection but has not interacted with the app for 5-15 minutes (no keystrokes, taps, or scrolls). The client sends an "idle" signal, and the presence service transitions the user. Shown as a yellow dot.
Offline — No active WebSocket connection, or heartbeat has not been received for 90+ seconds. Shown as a gray dot with "last seen X minutes ago" timestamp.
Do Not Disturb — User has explicitly set this status. Presence is "online" internally, but notifications are suppressed and the UI shows a DND indicator. This is a user-controlled override, not a system-detected state.

Typing indicators are even more ephemeral than presence. When a user starts typing in a conversation, the client sends a "typing" event to the server. The server forwards this event to the other participants in the conversation (for 1-on-1) or to the conversation's pub/sub topic (for groups). Critically, typing events are never persisted — they are purely real-time, fire-and-forget messages. If the recipient is offline, the typing event is simply dropped.

Typing Indicator Throttling

Without throttling, a fast typist generates dozens of typing events per second. This wastes bandwidth and server resources. Solution: throttle typing events on the client side to at most one event every 3-5 seconds. The typing indicator UI shows "User is typing..." for 5 seconds after the last received typing event, then disappears. If another typing event arrives within that 5-second window, the timer resets. If the user sends a message, the typing indicator is immediately cleared. This simple client-side logic reduces typing event traffic by 90%+ while maintaining a responsive UI.

Interview Signal: Separate Presence from Messaging

Strong candidates explicitly separate the presence service from the message delivery pipeline. Presence is best-effort — if a presence update is delayed by a few seconds or lost entirely, it is a minor UI inconvenience. Messages are durable — losing a message is unacceptable. By keeping these on separate infrastructure, a presence service outage does not affect message delivery, and message delivery spikes do not overwhelm the presence system. This is the "bulkhead" pattern: isolate non-critical from critical paths so failures do not cascade.

At scale, presence is one of the most expensive features in a chat system relative to its perceived importance. WhatsApp intentionally limits presence visibility: you only see last-seen times, not real-time online status, and users can disable even that. This dramatically reduces the presence fan-out volume. Slack shows real-time presence but limits it to workspace members (much smaller than a global friend graph). Discord shows presence but only within servers you share with the user. Each platform makes different trade-offs between presence richness and system cost.

Delivery Semantics — Sent, Delivered, Read

Delivery semantics define the contract between the chat system and its users: what does it mean for a message to be "sent," "delivered," and "read"? Getting these semantics right is critical for user trust — users need confidence that their messages are reaching the intended recipient. The engineering challenge is implementing these guarantees reliably across unreliable networks, intermittent connectivity, and millions of concurrent sessions.

The three delivery states correspond to specific system events. "Sent" (single checkmark) means the server has persisted the message and acknowledged receipt to the sender. "Delivered" (double checkmark) means the recipient device has received the message and sent a client-level ACK back to the server. "Read" (blue double checkmark or equivalent) means the recipient has actually viewed the message in the chat window. Each state transition requires a round-trip confirmation, and each can fail independently.

Delivery state machine transitions

→

CREATED: Message exists only on the sender device. Not yet sent to server.

→

SENT: Server has received and persisted the message. Server ACK received by sender. Sender sees single checkmark.

→

DELIVERED: Recipient device has received the message via WebSocket push or offline sync. Client ACK sent from recipient to server. Sender sees double checkmark.

→

READ: Recipient has scrolled the message into view in the chat window. Read receipt sent from recipient to server. Sender sees blue checkmarks (WhatsApp) or "Seen" label (Messenger).

FAILED: Server rejected the message (e.g., sender is not a conversation participant, message too large) or the message could not be persisted. Sender sees error indicator with retry option.

CREATED: Message exists only on the sender device. Not yet sent to server.

SENT: Server has received and persisted the message. Server ACK received by sender. Sender sees single checkmark.

DELIVERED: Recipient device has received the message via WebSocket push or offline sync. Client ACK sent from recipient to server. Sender sees double checkmark.

READ: Recipient has scrolled the message into view in the chat window. Read receipt sent from recipient to server. Sender sees blue checkmarks (WhatsApp) or "Seen" label (Messenger).

FAILED: Server rejected the message (e.g., sender is not a conversation participant, message too large) or the message could not be persisted. Sender sees error indicator with retry option.

The system uses at-least-once delivery semantics. This means the system guarantees that every message will be delivered at least once to the recipient, but in rare cases (network retries, server failovers), a message might be delivered twice. The alternative, exactly-once delivery, is essentially impossible in a distributed system without prohibitive coordination overhead. At-least-once is the pragmatic choice made by WhatsApp, Slack, Discord, and every other production chat system.

Client-Side Deduplication

Since the system uses at-least-once delivery, duplicate messages can arrive at the recipient. The client must deduplicate using the message_id (or client_message_id). Before displaying a received message, the client checks its local message cache: if a message with this ID already exists, it is a duplicate and is silently dropped. This is a simple O(1) lookup in a local hash set. The server can also deduplicate during the sync process by ensuring the sync response does not contain messages the client already has (based on the client's last known message_id).

The client ACK mechanism is what enables the transition from "sent" to "delivered." When the recipient device receives a message via WebSocket, it must send an explicit acknowledgment back to the server. This ACK is a small message containing the message_id and a timestamp. The server updates the message status in the database and notifies the sender. If the server does not receive a client ACK within a timeout window (30 seconds), it retries delivery by pushing the message again. After 3 failed retry attempts, the server falls back to push notification delivery.

The Offline Queue Problem

When a user is offline, messages accumulate in an offline queue. The user might be offline for hours, days, or even weeks. When they reconnect, all queued messages must be delivered in the correct order. At scale, the offline queue for a single user could contain thousands of messages across hundreds of conversations. The sync process must be efficient: send messages in chronological batches, allow the client to ACK in batches (not one by one), and prioritize recent conversations over old ones. WhatsApp uses a per-user offline queue in a persistent store, with batch sync delivering up to 5000 messages per sync request.

Read receipts require careful handling of privacy and performance. On the privacy side, many users prefer not to send read receipts (WhatsApp allows disabling them). On the performance side, read receipts generate a high volume of status update messages. When a user opens a conversation and scrolls through 50 unread messages, the client should NOT send 50 individual read receipt events. Instead, it sends a single "read up to message_id X" event, which marks all messages up to that ID as read in a single operation.

Read Receipt Aggregation

Efficient read receipt implementation: the client tracks the highest message_id visible in the chat viewport. Every 2 seconds (debounced), if the highest visible message_id has changed, the client sends a single read receipt: {conversation_id, last_read_message_id}. The server updates the read watermark for this user in this conversation. To check whether a specific message has been read, compare message_id against the read watermark. This reduces read receipt traffic from one event per message to one event per 2-second reading session, regardless of how many messages are visible.

Push notifications serve as the delivery fallback for offline users. When the delivery system cannot reach the recipient via WebSocket (no active connection in the registry) and the message has been sitting in the offline queue for more than a configurable delay (typically 5-30 seconds), a push notification is sent via APNs (iOS) or FCM (Android). The push notification contains a truncated message preview and metadata to identify the conversation. When the user taps the notification, the app opens, establishes a WebSocket connection, syncs all pending messages, and the user sees the full conversation. Push notifications are rate-limited per user to prevent notification flooding — at most one push notification per conversation per minute.

Common Mistake: Trusting Push Notification Delivery

Push notifications (APNs/FCM) are best-effort — they are not guaranteed to be delivered. The device might have notifications disabled, the push token might be expired, or the notification service might be throttling. Never use push notifications as the primary delivery mechanism. They are a notification mechanism only — the actual message content is always delivered through the WebSocket sync pipeline when the user opens the app. If you design push notifications as the delivery mechanism, you will lose messages for users with notifications disabled.

Scaling to Billions of Messages

Scaling a chat system from millions to billions of daily messages requires careful attention to sharding, replication, cross-region deployment, message synchronization, and encryption. Each of these topics could fill an entire system design session on its own, but an L5-L6 candidate is expected to discuss the key strategies and trade-offs at a high level with specific architectural choices.

Sharding the message store is the first scaling challenge. As discussed in the storage section, messages are partitioned by conversation_id. But how are conversations distributed across database nodes? The standard approach is consistent hashing: conversation_id is hashed, and the hash determines which shard (database partition) stores that conversation's messages. Consistent hashing ensures that adding or removing shards only redistributes a small fraction of conversations, avoiding a full data migration.

Sharding strategies and trade-offs

Shard by conversation_id (recommended) — All messages in a conversation are on the same shard. Queries for chat history are single-shard (fast). Adding members to a group does not move data. Downside: a very active group conversation creates a hot shard. Mitigation: rate-limit message sending per conversation and spread hot conversations across multiple sub-shards with a secondary hash.
Shard by user_id — All of a user's messages are on the same shard. Useful for "search all my messages" queries. Downside: loading a conversation requires reading from multiple shards (one per participant) and merging results. This is significantly slower for the primary access pattern (load chat history) and is generally not recommended for chat.
Shard by message_id range — Messages are distributed by time range across shards (shard 1 has January messages, shard 2 has February, etc.). Useful for time-based archival. Downside: recent messages are all on the same shard, creating a write hotspot. Not recommended as a primary sharding strategy.

Connection Server Sharding

Connection servers are sharded by user: each user connects to one specific connection server (determined by consistent hashing of user_id or by a load-balanced assignment stored in the connection registry). This means the connection server fleet scales linearly with concurrent users. At 200M concurrent connections with 100K connections per server, you need 2,000 connection servers. These servers are stateful (each holds open TCP connections) and must be treated with care during deployments: use graceful drain (stop accepting new connections, wait for existing connections to migrate) rather than hard restarts.

Cross-region replication is essential for a global chat service. Users in Asia should not experience 200ms+ latency because the chat servers are in North America. The standard approach is to deploy chat infrastructure in multiple regions (e.g., US-East, EU-West, Asia-Pacific) and route each user to the nearest region. Messages between users in the same region stay local. Messages between users in different regions require cross-region routing.

graph LR
    subgraph "US-East Region"
        A1[Connection Servers] --> B1[Message Service]
        B1 --> C1[Message Store Shard]
    end

    subgraph "EU-West Region"
        A2[Connection Servers] --> B2[Message Service]
        B2 --> C2[Message Store Shard]
    end

    subgraph "Cross-Region Bus"
        D[Kafka / Event Bridge]
    end

    B1 -->|"Cross-region message"| D
    D -->|"Replicate to EU"| B2
    B2 -->|"Cross-region message"| D
    D -->|"Replicate to US"| B1

    style D fill:#fff3e0

Multi-region chat architecture with cross-region event bus for inter-region message routing

The cross-region message flow works as follows. User A in US-East sends a message to User B in EU-West. User A's message is received by a US-East connection server, processed by the US-East message service, and stored in a US-East message shard. The message service also publishes the message to a cross-region event bus (Kafka with cross-region replication, or a dedicated message relay). The EU-West message service consumes the event, stores a replica in an EU-West shard (for fast local reads when User B loads the conversation), and delivers the message to User B through an EU-West connection server. Cross-region latency adds 100-200ms, but this is a one-time cost per message, and subsequent reads by User B are served from the local EU-West replica.

The message sync protocol ensures clients always have a consistent view of their conversations, even after periods of disconnection. When a client reconnects, it sends its sync state to the server: the last message_id it has seen for each active conversation (or a single global sync cursor if using a sequential event log). The server computes the delta — all messages after the client's last known position — and sends them in batches. This protocol must be efficient: a user who was offline for a week in 100 active group chats could have thousands of unread messages, and the sync must complete within seconds.

Sync Protocol Design

The client maintains a per-conversation sync cursor (the highest message_id received). On reconnection, the client sends a sync request: {conversations: [{conversation_id: "abc", last_message_id: 12345}, ...]}. The server queries each conversation for messages after the given last_message_id, batches the results (up to 100 messages per conversation, 5000 messages total per sync request), and returns them ordered by conversation then by message_id. If more messages exist beyond the batch limit, the response includes a "has_more" flag, and the client sends additional sync requests. Priority: conversations with the most recent activity are synced first.

End-to-end encryption (E2EE) adds a layer of security where messages are encrypted on the sender device and can only be decrypted on the recipient device. The server stores ciphertext and cannot read message content. WhatsApp, Signal, and iMessage all implement E2EE. The key insight for a system design interview is that E2EE fundamentally changes the architecture: the server cannot index, search, filter, or moderate message content, because it only sees encrypted blobs. This means server-side search, spam filtering, and content moderation must either operate on metadata only or be implemented on the client device.

E2EE Architectural Impact

End-to-end encryption has major architectural implications. Server-side message search is impossible (the server cannot read message content), so search must happen on-device with a local index. Spam and abuse detection must rely on metadata (message frequency, recipient patterns) rather than content analysis. Group chat E2EE requires the Signal Protocol's group messaging extension, where each message is encrypted separately for each group member using their public key — this means group message size scales linearly with group size. Backup and multi-device sync require additional key management: the user's encryption keys must be securely stored or derived on each device.

Interview Depth: Mention E2EE Trade-offs

In an interview, you do not need to explain the cryptographic details of E2EE (Signal Protocol, Diffie-Hellman key exchange, etc.). But you should mention it as an architectural consideration and discuss the trade-offs: E2EE provides strong privacy guarantees but eliminates server-side search, complicates multi-device sync (each device needs its own key pair), makes server-side spam detection harder, and increases message size for group chats. Stating these trade-offs shows architectural maturity.

Finally, operational scaling concerns at billions of messages per day include: monitoring message delivery latency at every hop (sender to server, server to storage, storage ACK, queue to delivery, delivery to recipient), alerting on delivery failure rates per region, capacity planning for storage growth (25 TB/day means 9 PB/year — plan for multi-year retention), and automated scaling of connection server fleets based on concurrent connection counts. Chat systems at this scale require dedicated SRE teams for each major subsystem: connection management, message storage, delivery pipeline, and presence.

How this might come up in interviews

Chat system design appears directly in FAANG interviews ("Design WhatsApp," "Design Slack," "Design Facebook Messenger," "Design Discord") and indirectly in questions about real-time messaging, notification systems, and collaboration tools. It is one of the top 5 most asked system design questions because it tests real-time networking (WebSockets), durable storage (message persistence), delivery guarantees (at-least-once with deduplication), presence management (heartbeat protocols), group messaging (fan-out strategies), and horizontal scaling (sharding, replication) — all in a single question. At L4-L5, candidates are expected to describe WebSocket basics, a simple message flow, and basic storage. At L6+, candidates must demonstrate deep knowledge of delivery semantics, connection storm mitigation, cross-region replication, and sync protocols.

Common questions:

L4: Design a basic 1-on-1 chat application. Walk through how a message gets from sender to recipient. [Tests: basic understanding of WebSocket transport, message storage, simple delivery flow, ability to draw a sequence diagram]
L4-L5: Explain the difference between WebSocket and Long Polling for real-time messaging. When would you use each, and what are the trade-offs? [Tests: understanding of transport protocols, latency vs compatibility trade-offs, ability to discuss fallback strategies]
L5: How would you design the message storage layer for a chat system handling 10 billion messages per day? What database would you choose and how would you partition the data? [Tests: storage estimation, schema design, partition key selection, understanding of Cassandra vs MySQL trade-offs]
L5-L6: Design the group chat message delivery system. How does fan-out work differently for a 10-person group versus a 100,000-member channel? [Tests: fan-out strategies, direct delivery vs pub/sub, membership service design, understanding of the quadratic read-receipt problem]
L6: Walk through the complete delivery guarantee system — how do you ensure a message is never lost and never delivered twice? [Tests: write-ahead persistence, client ACK protocol, idempotency with client_message_id, offline queue management, push notification fallback]
L6-L7: How would you scale a chat system to support 500 million concurrent connections across multiple geographic regions? Discuss connection management, cross-region message routing, and the sync protocol for offline users. [Tests: connection server fleet sizing, consistent hashing, cross-region replication, sync protocol design, connection storm mitigation, E2EE architectural impact]

Try this question: Ask the interviewer: Is this 1-on-1 messaging, group chat, or both? What is the maximum group size we need to support? Do we need end-to-end encryption? What are the latency targets for message delivery? Do we need to support message search? Is this a mobile-first or desktop-first application (affects connection management and battery optimization)? Are there compliance requirements for message retention or deletion?

Strong answer: Drawing a complete message sequence diagram from sender to recipient with numbered steps. Discussing WebSocket heartbeat, reconnection with jittered backoff, and connection storm mitigation. Explaining the write-ahead pattern with specific latency budgets per hop. Calculating storage and connection server fleet sizing with real numbers. Discussing Cassandra partition key design with time-based bucketing. Mentioning the bulkhead pattern for separating presence from messaging. Bringing up E2EE trade-offs without being asked.

Red flags: Using HTTP polling as the primary transport without mentioning WebSocket. Delivering messages before persisting them (violating durability). Not mentioning offline message handling or push notification fallback. Using a single database without sharding for billions of messages. Not differentiating between small group and large channel fan-out. Ignoring message ordering concerns. Not discussing any form of delivery acknowledgment or receipt tracking. Proposing exactly-once delivery without acknowledging it is impractical.

Key takeaways

The message flow follows a write-ahead pattern: persist the message to durable storage BEFORE attempting delivery. This guarantees zero message loss even if the delivery pipeline fails. The sender receives a "sent" acknowledgment as soon as storage confirms the write, and a "delivered" acknowledgment when the recipient device sends a client-level ACK.
WebSocket is the primary transport for all production chat systems, providing sub-10ms frame delivery with minimal overhead. Long Polling serves as an automatic fallback for restricted networks. The connection server fleet is the most stateful component — each server maintains tens of thousands of persistent TCP connections with heartbeat-based liveness monitoring.
Message storage must be partitioned by conversation_id (not user_id) because the primary access pattern is loading recent messages for a conversation. Use Snowflake-style message IDs as the clustering key in descending order to make "get last 50 messages" a single-partition sequential read. Time-based bucketing prevents unbounded partition growth in active conversations.
Group chat fan-out strategy must vary by group size: direct fan-out for small groups (up to 200 members) where messages are delivered individually to each member, and pub/sub fan-out for large channels (200+ members) where messages are published to a topic and each connection server fans out locally. This reduces fan-out cost from O(members) to O(connection_servers).
Presence, typing indicators, and read receipts are secondary features that generate surprisingly high traffic volumes. Implement them on separate infrastructure from the message delivery pipeline (bulkhead pattern) so failures in presence do not affect message delivery. Use lazy presence (only push updates to users actively viewing the relevant UI), throttle typing events to one every 3-5 seconds, and aggregate read receipts using watermark-based tracking rather than per-message events.

Before you move on: can you answer these?

A user sends a message while on a flaky mobile connection. The message reaches the server but the ACK is lost due to a network drop. The user's app retries. Walk through exactly what happens and how you prevent duplicate messages.

When the user composes the message, the client generates a client_message_id (UUID) before sending. The message is sent to the server with this client_message_id. The server receives the message, checks if a message with this client_message_id already exists in the message store. Since this is the first attempt, no duplicate exists, so the server persists the message and sends an ACK back to the client. The ACK is lost due to the network drop. The client does not receive the ACK within its timeout window (5 seconds), so it retries by sending the exact same message with the same client_message_id. The server receives the retry, checks the message store, finds a message with this client_message_id already exists, and returns the existing message (with its server-assigned message_id and status) as the ACK response instead of creating a duplicate. The client receives the ACK and shows the sent checkmark. On the recipient side, even if the delivery pipeline somehow sends the message twice, the recipient client deduplicates by checking message_id against its local cache before displaying.

You need to design the storage for a chat system handling 50 billion messages per day. Calculate the annual storage requirement and explain your sharding strategy with specific numbers.

Average message size with metadata is approximately 500 bytes. Daily storage: 50B messages x 500 bytes = 25 TB/day. Annual storage: 25 TB x 365 = 9.125 PB/year. With 3x replication factor, total disk needed: approximately 27.4 PB/year. Sharding strategy: shard by conversation_id using consistent hashing across 1024 logical partitions. With 500M active conversations, each partition holds roughly 500K conversations. Map logical partitions to physical shards (Cassandra nodes or MySQL instances): at 10TB per physical node, you need approximately 2,740 nodes to hold one year of data with replication. Use time-based bucketing within each partition (conversation_id, time_bucket) to bound partition size. Implement tiered storage: messages older than 90 days are moved from hot storage (NVMe SSD) to cold storage (HDD or S3), reducing the hot tier to approximately 2.25 PB (90 days). The hot tier needs approximately 675 SSD-backed nodes with replication.

Explain the difference between how you would handle message fan-out for a 10-person family group chat versus a 50,000-member Discord channel. Why can't the same mechanism work for both?

For the 10-person group: use direct fan-out. When a message arrives, the message service looks up the 10 members from the membership cache, queries the connection registry for each member's connection server, and sends the message individually to each member's connection server. Total work: 10 registry lookups and 10 message deliveries. This completes in under 10ms. Read receipts are tracked per-member. For the 50,000-member Discord channel: direct fan-out requires 50,000 registry lookups and 50,000 individual deliveries per message — this takes seconds and overwhelms the delivery pipeline if messages arrive frequently. Instead, use pub/sub fan-out: the channel is a pub/sub topic. Each connection server that has active subscribers in the channel subscribes to the topic. When a message is published to the topic, each subscribed connection server receives it once and fans out locally to its connected members. If members are spread across 500 connection servers, the fan-out is 500 messages instead of 50,000. Read receipts are aggregated into counts rather than tracked per-member.

🧠Mental Model

💡 Analogy

A chat system works like a postal system with mailboxes. Each user has a personal mailbox (their message queue and offline storage). Postal carriers (connection servers) deliver letters (messages) when the recipient is home (online) — they walk up to the door and hand the letter directly. When the recipient is not home (offline), the carrier leaves the package at the door (offline queue) and sticks a notification in the mailbox (push notification). Group chats work like mailing lists: one letter gets photocopied and placed into every member's mailbox. The post office (message service) keeps a carbon copy of every letter in its archive (message store) so it can be retrieved later. The postal system also tracks delivery status: a sent stamp when the letter leaves the post office, a delivery confirmation when it reaches the mailbox, and a read receipt when the recipient opens it. For presence, think of each house having a light in the window — if the light is on, the person is home. Neighbors (friends) can glance at the light to check, but the postal system does not send a telegram every time someone turns their light on or off.

⚡ Core Idea

A chat system is fundamentally a message routing and persistence system. Messages flow from sender through a gateway to a storage layer (for durability) and then to the recipient (for real-time delivery). The architectural complexity arises from handling the mismatch between online and offline users, maintaining message ordering across distributed servers, and scaling the connection management layer to millions of concurrent WebSocket connections.

🎯 Why It Matters

Chat system design tests every core distributed systems skill in a single question: real-time networking, durable storage, delivery guarantees, presence management, and horizontal scaling. Understanding the complete message flow — from WebSocket frame to persistent storage to recipient delivery — demonstrates architectural competence that applies to any real-time system, not just chat. The patterns you learn here (write-ahead persistence, client ACKs, offline queuing, heartbeat-based presence, fan-out strategies) appear in notification systems, real-time collaboration tools, IoT platforms, and gaming backends.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Designing a Real-Time Chat System at Scale

Why Chat Systems Are Among the Hardest Designs

Requirements & Scale Estimation

Connection Management — WebSocket vs Long Polling

Message Flow — Send, Store, Deliver

Message Storage Design

Group Chat Architecture

Online Presence & Typing Indicators

Delivery Semantics — Sent, Delivered, Read

Scaling to Billions of Messages

Discussion

In-app Q&A

Designing a Real-Time Chat System at Scale

Why Chat Systems Are Among the Hardest Designs

Requirements & Scale Estimation

Connection Management — WebSocket vs Long Polling

Message Flow — Send, Store, Deliver

Message Storage Design

Group Chat Architecture

Online Presence & Typing Indicators

Delivery Semantics — Sent, Delivered, Read

Scaling to Billions of Messages

Discussion

In-app Q&A