Skip to main content
Career Paths
Concepts
Design File Storage
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Designing a File Storage & Sync Service (Dropbox / Google Drive)

A comprehensive deep-dive into designing a production-grade cloud file storage and synchronization service capable of handling hundreds of millions of users, petabytes of data, and real-time multi-device sync. Covers file chunking and content-addressable storage, metadata service design with SQL-backed folder hierarchies, block storage vs object storage architecture, the sync protocol with long polling and conflict detection, conflict resolution strategies from last-writer-wins to operational transform, sharing and permission models with ACL inheritance, and scaling with erasure coding, tiered storage, and cross-region replication. This is one of the most frequently asked system design interview questions at FAANG companies because it tests storage architecture, distributed synchronization, consistency guarantees, and cost optimization in a single 45-minute session.

🎯Key Takeaways
Separate metadata (SQL, strongly consistent) from file data (object storage, high durability). This two-plane architecture is the foundation of every production file storage system and allows each plane to scale independently.
File chunking with content-addressable storage (SHA-256 hashing) enables three critical capabilities simultaneously: resumable uploads, cross-user deduplication (60-75% storage savings), and efficient sync (only changed chunks are transferred).
The sync protocol uses a monotonic cursor per device and long-poll notifications to achieve near-real-time propagation. When a file changes, only the devices with access are notified, and they fetch only the new or modified chunks.
Conflict resolution is the hardest problem. Use version vectors for detection and conflicted copies as the safe default for binary files. Three-way merge and OT/CRDTs are only practical for structured text data in real-time collaboration scenarios.
Cost optimization through tiered storage (hot, warm, cold, archive) and garbage collection of orphaned chunks is essential at scale. Moving 85% of data from hot to cold storage can save hundreds of millions of dollars annually at exabyte scale.

Designing a File Storage & Sync Service (Dropbox / Google Drive)

A comprehensive deep-dive into designing a production-grade cloud file storage and synchronization service capable of handling hundreds of millions of users, petabytes of data, and real-time multi-device sync. Covers file chunking and content-addressable storage, metadata service design with SQL-backed folder hierarchies, block storage vs object storage architecture, the sync protocol with long polling and conflict detection, conflict resolution strategies from last-writer-wins to operational transform, sharing and permission models with ACL inheritance, and scaling with erasure coding, tiered storage, and cross-region replication. This is one of the most frequently asked system design interview questions at FAANG companies because it tests storage architecture, distributed synchronization, consistency guarantees, and cost optimization in a single 45-minute session.

~43 min read
Be the first to complete!
What you'll learn
  • Separate metadata (SQL, strongly consistent) from file data (object storage, high durability). This two-plane architecture is the foundation of every production file storage system and allows each plane to scale independently.
  • File chunking with content-addressable storage (SHA-256 hashing) enables three critical capabilities simultaneously: resumable uploads, cross-user deduplication (60-75% storage savings), and efficient sync (only changed chunks are transferred).
  • The sync protocol uses a monotonic cursor per device and long-poll notifications to achieve near-real-time propagation. When a file changes, only the devices with access are notified, and they fetch only the new or modified chunks.
  • Conflict resolution is the hardest problem. Use version vectors for detection and conflicted copies as the safe default for binary files. Three-way merge and OT/CRDTs are only practical for structured text data in real-time collaboration scenarios.
  • Cost optimization through tiered storage (hot, warm, cold, archive) and garbage collection of orphaned chunks is essential at scale. Moving 85% of data from hot to cold storage can save hundreds of millions of dollars annually at exabyte scale.

Lesson outline

Why File Storage & Sync Is a Top-Tier Interview Question

Cloud file storage and synchronization is one of the most deceptively complex problems in distributed systems. On the surface, it seems simple: let users upload files, download them from another device, and share them with others. But beneath this simplicity lies a web of engineering challenges that span storage architecture, distributed consensus, conflict resolution, permission management, and cost optimization at massive scale.

Dropbox serves over 700 million registered users, with over 15 million paying customers, storing more than 2 exabytes of data. Google Drive has over 1 billion users. Microsoft OneDrive is embedded in every Windows installation and Office 365 subscription, reaching hundreds of millions of daily active users. iCloud Drive syncs data across the entire Apple ecosystem. These services collectively handle billions of file operations per day, and their architectures represent some of the most sophisticated distributed storage systems ever built.

Why Interviewers Love This Question

A file storage and sync service touches every layer of the distributed systems stack: chunked upload protocols, content-addressable storage, metadata consistency, real-time sync with conflict detection, permission inheritance in tree structures, and cost optimization across storage tiers. It can be discussed at L4 depth (basic upload/download with S3) or L7 depth (delta sync algorithms, vector clocks for conflict resolution, erasure coding for durability). This range makes it a top-tier calibration question at Google, Meta, Dropbox, and Microsoft.

The core technical challenge that makes this problem interesting is synchronization. When a user edits a file on their laptop while offline and a colleague edits the same file on their desktop, what happens when both devices come back online? How do you detect that a conflict occurred? How do you resolve it without losing either person's work? How do you propagate the resolution to all other devices that have a copy? These questions require deep understanding of distributed systems fundamentals: consistency models, version vectors, event ordering, and conflict-free replicated data types.

Core engineering challenges in file storage and sync

  • Chunking and deduplication — Breaking files into fixed or variable-size chunks, hashing each chunk for deduplication, and storing only unique chunks across all users. Dropbox reported 75% storage savings from cross-user deduplication.
  • Multi-device synchronization — Propagating changes across laptops, phones, tablets, and web clients in near-real-time. Each device has its own local state, connection quality, and sync cadence.
  • Conflict detection and resolution — Detecting when two devices modified the same file concurrently and resolving the conflict without data loss. This is fundamentally a distributed consensus problem.
  • Permission management at scale — Enforcing read/write/share permissions across deeply nested folder hierarchies, shared links with expiration, and organizational policies — all while maintaining sub-100ms access checks.
  • Storage cost optimization — Balancing durability (eleven 9s), availability, and cost across hot, warm, and cold storage tiers. A single percentage point of storage efficiency at exabyte scale saves millions of dollars annually.
  • Bandwidth efficiency — Minimizing data transfer through delta sync, compression, and deduplication. Users on mobile networks or slow connections need sync to work with minimal bandwidth consumption.

Scale Numbers to Remember

Dropbox: 700M registered users, 2+ exabytes stored, 1.2 billion files uploaded per day. Google Drive: 1B+ users, deeply integrated with Google Workspace. Average file size varies enormously: documents are 50KB-5MB, photos 2-10MB, videos 100MB-10GB. A typical active user syncs 5-20 files per day across 2-3 devices. Peak sync load often occurs Monday morning as weekend changes propagate.

Understanding the scale is critical because it drives every architectural decision. A file storage service for 1,000 users can use a single database with files stored on local disk. A service for 700 million users requires sharded metadata databases, distributed object storage spanning multiple data centers, a real-time sync protocol handling millions of concurrent long-poll connections, and a permission system that can evaluate access checks in microseconds. The interview question is specifically testing whether you can navigate these scaling thresholds and explain which architectural components become necessary at each stage.

ServiceUsersStorageKey Technical Innovation
Dropbox700M registered2+ exabytesBlock-level sync, cross-user dedup, Magic Pocket storage
Google Drive1B+Not disclosed (multi-exabyte)Deep Workspace integration, real-time collaboration
OneDrive400M+ (est.)Not disclosedOS-level integration, Known Folder Move, differential sync
iCloud Drive850M+ (Apple IDs)Not disclosedEnd-to-end encryption option, seamless Apple ecosystem sync

Requirements & Scale Estimation

Before designing any architecture, a strong system design answer begins with requirements gathering and back-of-the-envelope estimation. This phase establishes the scope of the problem and surfaces the constraints that drive every subsequent decision. Interviewers specifically look for candidates who quantify before they design.

Functional requirements

  • File upload and download — Users can upload files from any device and download them from any other device. Support resumable uploads for large files (multi-GB videos). Target upload throughput: handle files up to 50GB.
  • Automatic sync across devices — When a file is created, modified, or deleted on one device, all other devices with access should reflect the change within seconds (when online) or upon reconnection.
  • File versioning — Maintain version history for files so users can revert to previous versions. Store at least 30 days of version history for paying users.
  • Sharing and collaboration — Users can share files and folders with specific people (email) or via links (view-only, edit). Support organizational sharing with inherited permissions.
  • Offline access — Users can mark files for offline access. Edits made offline sync when connectivity is restored, with conflict detection.
  • Folder hierarchy — Support arbitrarily nested folder structures with move, rename, and delete operations that cascade correctly to children.

Non-functional requirements

  • Durability — Files must never be lost. Target: 99.999999999% (eleven 9s) durability — the same standard as AWS S3. This means losing less than 1 file per 100 billion files per year.
  • Availability — The service should be available 99.9% of the time (8.7 hours of downtime per year maximum). Reads should be prioritized over writes during degraded states.
  • Low sync latency — Changes should propagate to online devices within 5 seconds for small files. Large files should begin syncing immediately with progress indication.
  • Bandwidth efficiency — Minimize data transfer. If a user changes 1KB in a 100MB file, do not re-upload the entire file. Delta sync and deduplication are essential.
  • Security — Encryption at rest (AES-256) and in transit (TLS 1.3). Optional end-to-end encryption for sensitive files. Access control enforcement at every layer.

Always Show Your Estimation Math

In a FAANG interview, walking through the estimation math demonstrates quantitative thinking. Do not skip this step. The numbers you derive here will drive your architectural decisions: whether you need sharding, how large your cache should be, and what storage tier strategy to use.

Scale estimation walkthrough for a Dropbox-scale service

→

01

Users and activity: 500M total users, 100M DAU. Each active user syncs an average of 10 files per day. That is 1 billion file sync operations per day.

→

02

QPS calculation: 1B file operations / 86,400 seconds = ~11,574 operations/sec average. With a 3x peak multiplier (Monday morning sync storms): ~35,000 operations/sec peak.

→

03

Storage per day: Average file size is 500KB (weighted across documents, photos, and occasional videos). New unique data per day: 1B files x 500KB = 500TB/day of gross uploads. With deduplication (Dropbox reports ~75% dedup ratio): 125TB/day of net new storage. Annual: 125TB x 365 = ~45PB/year.

→

04

Metadata size: Each file needs a metadata record: file_id (8B), user_id (8B), filename (256B), path (512B), chunk_list (variable, ~200B avg), timestamps (24B), permissions (64B), version (8B) = ~1KB per file metadata record. For 10 billion total files: 10B x 1KB = 10TB of metadata. This fits in a sharded relational database.

→

05

Bandwidth: 1B file operations/day x 500KB average = 500TB/day = ~46 Gbps sustained. Peak at 3x: ~138 Gbps. This requires a distributed CDN and multiple data center ingress points.

06

Sync connections: 100M DAU with an average of 2 devices each = 200M concurrent long-poll connections. Each connection consumes a socket and minimal memory (~10KB). Total memory for connections: 200M x 10KB = 2TB. This requires a fleet of lightweight notification servers.

1

Users and activity: 500M total users, 100M DAU. Each active user syncs an average of 10 files per day. That is 1 billion file sync operations per day.

2

QPS calculation: 1B file operations / 86,400 seconds = ~11,574 operations/sec average. With a 3x peak multiplier (Monday morning sync storms): ~35,000 operations/sec peak.

3

Storage per day: Average file size is 500KB (weighted across documents, photos, and occasional videos). New unique data per day: 1B files x 500KB = 500TB/day of gross uploads. With deduplication (Dropbox reports ~75% dedup ratio): 125TB/day of net new storage. Annual: 125TB x 365 = ~45PB/year.

4

Metadata size: Each file needs a metadata record: file_id (8B), user_id (8B), filename (256B), path (512B), chunk_list (variable, ~200B avg), timestamps (24B), permissions (64B), version (8B) = ~1KB per file metadata record. For 10 billion total files: 10B x 1KB = 10TB of metadata. This fits in a sharded relational database.

5

Bandwidth: 1B file operations/day x 500KB average = 500TB/day = ~46 Gbps sustained. Peak at 3x: ~138 Gbps. This requires a distributed CDN and multiple data center ingress points.

6

Sync connections: 100M DAU with an average of 2 devices each = 200M concurrent long-poll connections. Each connection consumes a socket and minimal memory (~10KB). Total memory for connections: 200M x 10KB = 2TB. This requires a fleet of lightweight notification servers.

These numbers reveal the architectural requirements immediately. 35K operations/sec means you need distributed processing, not a single server. 45PB/year of storage means object storage with tiered lifecycle policies. 10TB of metadata means a sharded SQL database, not a single instance. 200M concurrent connections means a dedicated notification service, not HTTP polling. Every number maps to an architectural decision.

Do Not Forget Write Amplification

Each user upload triggers multiple downstream writes: chunk storage (1-N chunks), metadata database update, sync notification to all linked devices, version history entry, and potentially a search index update. A single file upload can generate 5-10x write amplification. Factor this into your capacity planning.

File Chunking & Deduplication

File chunking is the foundational technique that makes cloud file storage efficient, resumable, and deduplicated. Instead of treating files as opaque blobs, the system breaks each file into fixed-size or variable-size chunks, typically 4MB each, and stores each chunk independently. This seemingly simple decision enables three critical capabilities: resumable uploads (if a 1GB upload fails at 600MB, you resume from the last successful chunk, not from the beginning), deduplication (if two users upload the same photo, the system stores only one copy of each chunk), and efficient sync (if a user modifies a small portion of a large file, only the changed chunks are re-uploaded).

Dropbox pioneered the use of content-addressable storage for consumer file sync. Each chunk is identified not by a sequential ID or filename, but by the SHA-256 hash of its content. This means the chunk's address IS its content — if two chunks have the same hash, they are guaranteed to contain the same bytes (with astronomically low collision probability: 1 in 2^256). This is called content-addressable storage (CAS), and it is the same principle used by Git for object storage.

graph TD
    A[Original File<br/>16MB document] --> B[Chunking Engine]
    B --> C1[Chunk 1<br/>4MB<br/>SHA-256: a3f7...]
    B --> C2[Chunk 2<br/>4MB<br/>SHA-256: b8e2...]
    B --> C3[Chunk 3<br/>4MB<br/>SHA-256: c5d1...]
    B --> C4[Chunk 4<br/>4MB<br/>SHA-256: a3f7...]

    C1 --> D{Hash Lookup<br/>in Block Store}
    C2 --> D
    C3 --> D
    C4 --> D

    D -->|New Hash| E[Store Chunk<br/>in Object Storage]
    D -->|Existing Hash| F[Skip Upload<br/>Reference Existing]

    E --> G[Metadata DB:<br/>file_id -> ordered list of chunk hashes]
    F --> G

    style C1 fill:#4ade80,stroke:#166534
    style C4 fill:#4ade80,stroke:#166534
    style F fill:#fbbf24,stroke:#92400e

File chunking and deduplication flow: identical chunks (C1 and C4 share the same hash) are stored only once

Why 4MB Chunks?

The chunk size is a trade-off. Too small (e.g., 64KB) means excessive metadata overhead: a 1GB file would produce 16,384 chunk records. Too large (e.g., 64MB) means poor deduplication: a single byte change in a 64MB chunk forces re-uploading the entire chunk. 4MB is the sweet spot used by Dropbox and many other services: a 1GB file produces 256 chunks (manageable metadata), and changes localized to one section only re-upload that 4MB chunk.

The chunking process works as follows on the client side. When a user saves a file, the desktop client divides the file into 4MB blocks. For each block, it computes the SHA-256 hash. It then sends the list of hashes to the server, asking: "which of these chunks do you already have?" The server checks each hash against its chunk index and responds with the list of missing chunks. The client uploads only the missing chunks. This is called the "has-list" protocol, and it is what makes deduplication work at the upload level rather than just the storage level — the client never wastes bandwidth uploading data the server already has.

Deduplication levels in a file storage system

  • Intra-file deduplication — Within a single file, identical chunks are stored once. Common in files with repeated data patterns or when appending to a file that shares chunks with its previous version.
  • Cross-user deduplication — If User A and User B both upload the same PDF, the system stores only one copy of each unique chunk. Dropbox reported this saves approximately 75% of raw storage — a massive cost reduction at exabyte scale.
  • Version deduplication — When a file is edited, most chunks remain unchanged between versions. Only the modified chunks are stored as new entries. Version history shares unchanged chunks with the current version, making versioning nearly free for small edits.

Security Implications of Cross-User Dedup

Cross-user deduplication can leak information. If the server responds "I already have this chunk" to a client upload, an attacker can infer that someone else has uploaded a file containing that chunk. This is called a "confirmation of file" attack. Mitigations include: encrypting chunks with per-user keys (which prevents cross-user dedup), using convergent encryption (encrypt with a key derived from the content hash — same content produces same ciphertext), or limiting dedup confirmations to trusted enterprise accounts.

Client-side upload protocol with deduplication

→

01

Client detects file change via filesystem watcher (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows).

→

02

Client splits the file into 4MB chunks and computes SHA-256 hash for each chunk.

→

03

Client sends the ordered list of chunk hashes to the metadata service: "I have file X, version N+1, consisting of chunks [hash_a, hash_b, hash_c, hash_d]."

→

04

Metadata service checks the block index and responds: "I need chunks hash_b and hash_d. I already have hash_a and hash_c."

→

05

Client uploads only the missing chunks (hash_b and hash_d) directly to the block storage service.

→

06

Block storage confirms receipt. Metadata service updates the file record to point to the new ordered chunk list. The previous version's chunk list is preserved for version history.

07

Metadata service sends a sync notification to all other devices linked to this user (or shared folder) indicating that file X has a new version.

1

Client detects file change via filesystem watcher (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows).

2

Client splits the file into 4MB chunks and computes SHA-256 hash for each chunk.

3

Client sends the ordered list of chunk hashes to the metadata service: "I have file X, version N+1, consisting of chunks [hash_a, hash_b, hash_c, hash_d]."

4

Metadata service checks the block index and responds: "I need chunks hash_b and hash_d. I already have hash_a and hash_c."

5

Client uploads only the missing chunks (hash_b and hash_d) directly to the block storage service.

6

Block storage confirms receipt. Metadata service updates the file record to point to the new ordered chunk list. The previous version's chunk list is preserved for version history.

7

Metadata service sends a sync notification to all other devices linked to this user (or shared folder) indicating that file X has a new version.

For variable-size chunking (used in more advanced implementations), algorithms like Rabin fingerprinting or FastCDC determine chunk boundaries based on content rather than fixed offsets. This means that inserting 10 bytes at the beginning of a file does not shift all subsequent chunk boundaries, which would invalidate every chunk hash and force re-uploading the entire file. Instead, only the chunk containing the insertion point changes. This is particularly important for delta sync efficiency.

Metadata Service Design

The metadata service is the brain of the file storage system. It maintains the mapping between logical file paths (what the user sees) and physical storage locations (where chunks live). It tracks folder hierarchies, file versions, sharing permissions, and sync state for every device. While block storage holds the actual data, the metadata service holds the truth about what data exists, who owns it, and how it is organized.

Metadata must be strongly consistent. If a user renames a folder on their laptop, every subsequent access from any device must see the new name — there is no acceptable window where one device sees the old name and another sees the new name. This consistency requirement makes a relational database (PostgreSQL or MySQL) the natural choice for metadata storage, not an eventually consistent NoSQL system. Dropbox uses MySQL (sharded) for its metadata tier; Google Drive uses a proprietary strongly consistent database (Spanner for some layers).

TableKey ColumnsPurpose
usersuser_id (PK), email, quota_bytes, plan_tier, created_atUser accounts and storage quotas
filesfile_id (PK), user_id (FK), parent_folder_id (FK), filename, size_bytes, content_hash, is_deleted, latest_versionFile metadata with soft delete support
file_versionsversion_id (PK), file_id (FK), version_number, chunk_list (JSON array of hashes), modified_by, modified_at, size_bytesVersion history — each version stores its ordered chunk list
chunkschunk_hash (PK), size_bytes, ref_count, storage_tier, created_atGlobal chunk index — one row per unique chunk across all users
foldersfolder_id (PK), user_id (FK), parent_folder_id (FK, nullable for root), folder_name, created_atFolder hierarchy — self-referential for nesting
sharingshare_id (PK), resource_id (FK), resource_type (file|folder), shared_with (user_id or email), permission (view|edit|owner), shared_by, expires_atSharing permissions for files and folders
device_sync_statedevice_id (PK), user_id (FK), last_sync_cursor, last_seen_atPer-device sync checkpoint for resumable sync

File Tree as Adjacency List

The folder hierarchy is stored as an adjacency list: each row in the folders table has a parent_folder_id pointing to its parent. This makes single-level listing fast (SELECT * FROM folders WHERE parent_folder_id = X) but recursive operations (move a deeply nested tree, calculate total folder size) require recursive queries or application-level traversal. For deep hierarchies, consider a materialized path column (e.g., "/root/docs/work/2024/") that enables prefix-based queries.

Versioning is implemented through the file_versions table. Every time a file is modified, a new row is inserted with the updated chunk_list and an incremented version_number. The files table always points to the latest version via latest_version. This append-only versioning pattern means reverting to a previous version is simply updating the latest_version pointer — no data needs to be copied or moved.

erDiagram
    USERS ||--o{ FILES : owns
    USERS ||--o{ FOLDERS : owns
    FOLDERS ||--o{ FOLDERS : contains
    FOLDERS ||--o{ FILES : contains
    FILES ||--o{ FILE_VERSIONS : has
    FILE_VERSIONS }o--o{ CHUNKS : references
    FILES ||--o{ SHARING : shared_via
    FOLDERS ||--o{ SHARING : shared_via
    USERS ||--o{ DEVICE_SYNC_STATE : has

    USERS {
        bigint user_id PK
        string email
        bigint quota_bytes
        string plan_tier
    }
    FILES {
        bigint file_id PK
        bigint user_id FK
        bigint parent_folder_id FK
        string filename
        bigint size_bytes
        string content_hash
        boolean is_deleted
        int latest_version
    }
    FILE_VERSIONS {
        bigint version_id PK
        bigint file_id FK
        int version_number
        json chunk_list
        bigint modified_by
        timestamp modified_at
    }
    CHUNKS {
        string chunk_hash PK
        int size_bytes
        int ref_count
        string storage_tier
    }
    FOLDERS {
        bigint folder_id PK
        bigint user_id FK
        bigint parent_folder_id FK
        string folder_name
    }

Entity-relationship diagram for the metadata service showing the core tables and their relationships

Consistency requirements for the metadata service

  • File operations must be atomic — Creating a file involves inserting into files, file_versions, and potentially chunks tables. These must be wrapped in a database transaction. A partial write (file record exists but version record does not) would leave the system in an inconsistent state.
  • Folder moves must be serialized — Moving a folder with 10,000 children requires updating parent_folder_id for all descendants. Concurrent moves of overlapping subtrees can create cycles (folder A moved into folder B while folder B is moved into folder A). Use advisory locks or optimistic concurrency with version checks.
  • Chunk reference counting must be accurate — When a file version is deleted and its chunks are no longer referenced by any other file, the ref_count must be decremented. If ref_count reaches zero, the chunk is eligible for garbage collection. Inaccurate ref counts lead to either orphaned chunks (wasted storage) or premature deletion (data loss).
  • Sync cursors must be monotonically increasing — Each device tracks its sync position with a cursor (a logical timestamp or sequence number). The cursor must never go backward, even during metadata database failovers, or the device will miss changes and drift out of sync.

Metadata Is the Bottleneck, Not Storage

At scale, the metadata service is almost always the bottleneck, not the block storage. Uploading a 1GB file requires 1 metadata transaction but 256 chunk uploads that can be parallelized across many storage nodes. Listing a folder with 10,000 files requires a metadata query that scans 10,000 rows. Shard the metadata database by user_id early, and ensure the most common queries (file listing, sync cursor reads) hit indexes.

Sharding the metadata database by user_id is the standard approach. All of a user's files, folders, versions, and sync state live on the same shard, which means most operations are single-shard transactions (fast and consistent). The exception is sharing: when User A shares a folder with User B, the sharing record may live on a different shard. This cross-shard reference is typically handled by storing the share record on both shards or by using a dedicated sharing service with its own database.

Block Storage & Object Storage Architecture

The storage architecture of a file sync service is split into two fundamentally different planes: the metadata plane and the data plane. The metadata plane (the SQL database described in the previous section) stores lightweight records about files, folders, and permissions. The data plane stores the actual file content — the raw bytes — in an object storage system. This separation is one of the most important architectural decisions in the entire system because it allows each plane to scale independently, use different storage technologies, and have different consistency and durability characteristics.

The data plane stores chunks (the 4MB blocks from the chunking step) in an object storage system. For most teams, this means Amazon S3, Google Cloud Storage, or Azure Blob Storage. These services provide eleven 9s of durability (99.999999999%) by automatically replicating data across multiple availability zones. Dropbox originally used S3 but eventually built their own storage system called Magic Pocket when the cost and performance requirements exceeded what S3 could offer at their scale. For an interview, S3 is the right default answer unless the interviewer specifically asks about building custom storage.

graph TB
    subgraph Client["Client Devices"]
        L[Laptop]
        M[Mobile]
        W[Web Browser]
    end

    subgraph API["API Gateway Layer"]
        AG[API Gateway / Load Balancer]
    end

    subgraph MetadataPlane["Metadata Plane"]
        MS[Metadata Service]
        MDB[(MySQL Sharded<br/>Files, Folders, Versions<br/>Permissions, Sync State)]
        MC[Metadata Cache<br/>Redis Cluster]
    end

    subgraph DataPlane["Data Plane / Block Storage"]
        BS[Block Storage Service]
        OS[(Object Storage<br/>S3 / GCS / Magic Pocket)]
        CI[Chunk Index<br/>Hash → Storage Location]
    end

    subgraph SyncPlane["Sync & Notification"]
        NS[Notification Service<br/>Long Poll / WebSocket]
        SQ[Sync Queue<br/>Kafka]
    end

    L --> AG
    M --> AG
    W --> AG
    AG --> MS
    AG --> BS
    AG --> NS
    MS --> MDB
    MS --> MC
    MS --> SQ
    BS --> OS
    BS --> CI
    SQ --> NS

    style MetadataPlane fill:#dbeafe,stroke:#1e40af
    style DataPlane fill:#dcfce7,stroke:#166534
    style SyncPlane fill:#fef3c7,stroke:#92400e

High-level architecture showing the separation of metadata plane (SQL), data plane (object storage), and sync plane (notification service)

The Two-Plane Rule

Always separate metadata from data in storage system design. Metadata is small, relational, and requires strong consistency (SQL). Data is large, opaque, and requires high durability and throughput (object storage). Mixing them — for example, storing file content as BLOBs in the SQL database — creates a system that is bad at both: too slow for metadata queries (table scans over huge BLOBs) and too expensive for bulk storage (SQL storage costs 10-50x more than object storage per GB).

The block storage service provides a simple API: PUT(chunk_hash, chunk_data) and GET(chunk_hash). When a client uploads a chunk, the block storage service writes it to object storage using the chunk hash as the key. When a client downloads a file, the metadata service returns the ordered list of chunk hashes, and the client requests each chunk from the block storage service. In practice, the block storage service often generates pre-signed URLs that allow clients to upload and download directly from S3/GCS, bypassing the block service for data transfer and reducing load on the application tier.

Key design decisions for the data plane

  • Pre-signed URLs for direct transfer — Generate time-limited, pre-signed S3/GCS URLs so clients upload chunks directly to object storage without routing data through the application servers. This reduces application server bandwidth requirements by 100x and eliminates a bottleneck.
  • Chunk naming convention — Chunks are stored with their SHA-256 hash as the object key: s3://chunks-bucket/{first-2-hex}/{next-2-hex}/{full-hash}. The prefix structure distributes chunks across S3 partitions for even load distribution.
  • Storage class selection — Frequently accessed chunks go to S3 Standard. Chunks that have not been accessed in 90 days transition to S3 Infrequent Access (50% cheaper). Version history chunks older than 1 year move to S3 Glacier (90% cheaper). Lifecycle policies automate this tiering.
  • Replication and durability — S3 automatically replicates across 3+ availability zones within a region, providing eleven 9s of durability. For cross-region protection against regional disasters, enable S3 Cross-Region Replication to a secondary region.
  • Encryption at rest — All chunks are encrypted using AES-256 with server-side encryption (SSE-S3 or SSE-KMS). For enterprise customers requiring end-to-end encryption, chunks are encrypted on the client before upload using customer-managed keys.

Why Dropbox Built Magic Pocket

At exabyte scale, even small inefficiencies in S3 pricing compound into hundreds of millions of dollars annually. Dropbox built Magic Pocket to optimize for their specific access patterns: write-once-read-many chunks with high deduplication ratios. Magic Pocket achieves better storage density through custom erasure coding, reduces latency by co-locating metadata and data, and eliminates the per-request API costs of S3. For most companies, S3 is the right choice. Only at Dropbox/Google/Microsoft scale does custom storage make economic sense.

The upload flow ties the metadata and data planes together. The client sends the chunk hash list to the metadata service, which identifies missing chunks. The metadata service generates pre-signed upload URLs for each missing chunk. The client uploads chunks in parallel directly to object storage. Once all chunks are confirmed, the metadata service commits the new file version. This two-phase approach (upload data, then commit metadata) ensures that incomplete uploads do not leave the system in an inconsistent state — if the client disconnects mid-upload, the uncommitted chunks are eventually garbage-collected.

AspectMetadata PlaneData Plane
Storage technologyMySQL / PostgreSQL (sharded)S3 / GCS / custom object store
Data size~1KB per file record~4MB per chunk
Consistency modelStrong consistency (ACID transactions)Eventual consistency (S3 read-after-write for new objects)
Scaling approachShard by user_idPartition by chunk hash prefix
Cost per GB$0.10 - $0.50/GB/month (RDS)$0.023/GB/month (S3 Standard)
Bottleneck riskQuery latency under high metadata loadNetwork bandwidth for bulk transfers

Sync Protocol — How Changes Propagate

The sync protocol is the heartbeat of a file storage service. It determines how quickly changes made on one device appear on all other devices. The core challenge is achieving near-real-time sync (under 5 seconds for online devices) while handling millions of concurrent connected clients, intermittent connectivity, and the possibility that multiple devices may modify the same file simultaneously.

The industry-standard approach for real-time sync notification is long polling (or WebSockets for web clients). Each client opens a persistent connection to the notification service and waits for the server to push change events. When a file is modified, the metadata service enqueues a notification, and the notification service delivers it to all connected devices that have access to the changed file. The client then fetches the updated metadata and downloads any new or modified chunks.

sequenceDiagram
    participant D1 as Device 1 (Editor)
    participant API as API / Metadata Service
    participant BS as Block Storage (S3)
    participant NS as Notification Service
    participant D2 as Device 2 (Syncer)

    Note over D2,NS: Device 2 has an open long-poll connection

    D1->>API: 1. Upload chunk hashes for modified file
    API-->>D1: 2. Return list of missing chunks + pre-signed URLs
    D1->>BS: 3. Upload missing chunks directly to S3
    BS-->>D1: 4. Confirm chunk upload
    D1->>API: 5. Commit new file version (atomic metadata update)
    API->>API: 6. Update files table, insert file_versions row
    API->>NS: 7. Enqueue sync notification for all devices
    NS-->>D2: 8. Push notification: "file X updated, new cursor: 42"
    D2->>API: 9. Fetch changes since cursor 41
    API-->>D2: 10. Return changed file metadata + chunk list
    D2->>BS: 11. Download new/modified chunks
    D2->>D2: 12. Reassemble file from chunks, update local state
    D2->>API: 13. Acknowledge sync, advance cursor to 42

Complete sync protocol sequence: from file edit on Device 1 to sync completion on Device 2

Long Polling vs WebSockets vs Server-Sent Events

Long polling: client sends an HTTP request, server holds it open until there is a change (or timeout after 60 seconds), then responds. Client immediately opens a new long-poll. Simple, works through all proxies and firewalls, but wastes some bandwidth on reconnection. WebSockets: persistent bidirectional connection. Lower overhead per message, but harder to load-balance and some corporate firewalls block them. Server-Sent Events (SSE): server pushes events over a persistent HTTP connection. Simpler than WebSockets but unidirectional. Dropbox uses long polling for desktop clients. Google Drive uses a mix of long polling and WebSockets.

The sync cursor is the critical concept that makes incremental sync efficient. Each device maintains a cursor — a monotonically increasing number (or timestamp) that represents the last change it has processed. When a device reconnects after being offline, it sends its cursor to the metadata service: "Give me all changes since cursor 37." The metadata service returns all change events (file created, file modified, file deleted, file moved) with cursor values greater than 37. The device processes these changes in order and advances its cursor. This is identical to the offset-based consumption pattern used in Kafka and other event streaming systems.

Sync notification architecture

  • Notification service cluster — A fleet of lightweight servers (Go or Rust for low memory per connection) that maintain open long-poll connections with all online clients. Each server handles 100K-500K concurrent connections. The cluster is horizontally scalable — add more servers to handle more connections.
  • Change event queue (Kafka) — When the metadata service commits a file change, it publishes an event to a Kafka topic partitioned by user_id. The notification service consumes from Kafka and routes events to the correct connected clients. Kafka provides ordering guarantees per partition and durability.
  • Device routing table — The notification service maintains an in-memory mapping of device_id to connection. When a change event arrives for user_id=123, the service looks up all devices for that user and pushes the notification to each. If a device is offline, the event is not lost — the device will catch up via cursor-based sync on reconnection.
  • Batching and debouncing — If a user saves a document 10 times in 30 seconds (e.g., auto-save), the notification service batches these into a single sync event rather than pushing 10 separate notifications. This reduces device battery consumption and network usage.

Sync State Machine on the Client

The client maintains a local database (SQLite) with the state of every file: synced, modified-locally, modified-remotely, conflicting. A background sync engine continuously processes the sync queue: upload local changes, download remote changes, detect conflicts. The sync engine runs independently from the user-facing file system — users interact with their files normally while sync happens in the background.

Handling offline-to-online transitions is particularly important. When a device comes online after an extended offline period, it may have hundreds of pending local changes and hundreds of remote changes to process. The sync engine must process these efficiently: upload all local changes first (to avoid conflicts where the server has newer data), then download remote changes, then run conflict detection on any files that were modified both locally and remotely during the offline period. The order matters — uploading first establishes the client's changes on the server, giving them a timestamp, which simplifies conflict resolution.

Offline-to-online sync recovery process

→

01

Device connects to the notification service and sends its last-known sync cursor.

→

02

Notification service responds with the total number of pending remote changes (e.g., "4,217 changes since your last sync").

→

03

Client begins uploading all locally-modified files in parallel (prioritizing files the user recently accessed).

→

04

For each uploaded file, the metadata service checks for conflicts: if the file was also modified remotely since the client's last sync, it flags a conflict.

→

05

After all local changes are uploaded, the client downloads remote changes in cursor order, applying them to the local filesystem.

06

Conflicting files are handled per the conflict resolution strategy (next section): either auto-merged, duplicated with conflict markers, or presented to the user.

1

Device connects to the notification service and sends its last-known sync cursor.

2

Notification service responds with the total number of pending remote changes (e.g., "4,217 changes since your last sync").

3

Client begins uploading all locally-modified files in parallel (prioritizing files the user recently accessed).

4

For each uploaded file, the metadata service checks for conflicts: if the file was also modified remotely since the client's last sync, it flags a conflict.

5

After all local changes are uploaded, the client downloads remote changes in cursor order, applying them to the local filesystem.

6

Conflicting files are handled per the conflict resolution strategy (next section): either auto-merged, duplicated with conflict markers, or presented to the user.

At Dropbox scale with 200 million concurrent connections, the notification service is a separate infrastructure tier with its own scaling characteristics. It is designed to be extremely lightweight — each connection consumes minimal memory (just a socket and a cursor value). The actual data transfer happens directly between the client and block storage via pre-signed URLs, so the notification service never handles file content, only small event messages.

Conflict Resolution Strategies

Conflict resolution is the hardest problem in file synchronization. It occurs when two or more devices modify the same file concurrently — that is, both devices make changes between sync cycles, and when they sync, the system must decide what to do with both sets of changes. There is no universally correct answer; every resolution strategy involves trade-offs between simplicity, correctness, and user experience.

The fundamental challenge is that a distributed file system is a replicated state machine, and concurrent modifications create divergent replicas. The CAP theorem tells us we cannot have both strong consistency and availability during network partitions (which include devices being offline). File sync services choose availability — they allow offline edits — which means they must handle the resulting conflicts when partitions heal.

StrategyHow It WorksProsConsUsed By
Last-Writer-Wins (LWW)Compare timestamps; most recent edit overwrites the otherSimple to implement, no user interventionSilently loses the older edit; clock skew can cause wrong winnerOneDrive (default), many S3-based systems
Duplicate with conflict markerKeep both versions as separate files with conflict suffixNo data loss; user decides which version to keepClutters filesystem; users may ignore conflict files indefinitelyDropbox (primary strategy)
Three-way mergeCompare both versions against the common ancestor to auto-merge non-overlapping changesBest user experience when it works; handles independent edits wellComplex to implement; not possible for binary files; may still produce conflicts for overlapping editsGoogle Docs (for structured documents)
Version vectors / vector clocksTrack a logical clock per device; detect concurrent modifications without relying on wall-clock timeCorrectly identifies concurrency without clock synchronizationAdditional metadata overhead; does not resolve conflicts, only detects themUsed internally by many systems for detection, combined with another strategy for resolution
Operational Transform (OT)Transform concurrent operations so they can be applied in any order and converge to the same resultReal-time collaboration without conflicts; the gold standard for text editingExtremely complex to implement correctly; limited to structured data typesGoogle Docs, Google Wave (historical)
CRDT (Conflict-free Replicated Data Types)Data structures that are mathematically guaranteed to converge when mergedNo coordination needed; always converges; works offlineLimited data types; complex implementation; space overheadApple Notes, Figma, some collaborative tools

Dropbox's Pragmatic Approach

Dropbox uses the "conflicted copy" strategy because it prioritizes zero data loss above all else. When Device A and Device B both modify report.docx between syncs, Device A's version becomes the primary (it synced first), and Device B's version is saved as "report (User B's conflicted copy 2024-01-15).docx". This is not elegant, but it has a critical property: no user data is ever silently overwritten. The user can then manually review and merge the two versions. For a consumer file sync product, this pragmatism is the right call.

Conflict detection requires knowing whether two edits happened concurrently (neither device saw the other's edit) versus sequentially (Device B saw Device A's edit and then made its own change). The standard technique is version vectors: each device increments its own counter in a vector (a map of device_id to version_number) whenever it makes an edit. When syncing, the server compares the vectors. If one vector dominates the other (every entry is greater-than-or-equal), the dominant version wins — it is a sequential edit. If neither vector dominates (Device A has a higher value for its own entry, Device B has a higher value for its own entry), the edits are concurrent and a conflict exists.

graph TD
    A["File: report.docx<br/>Version: {D1:3, D2:2}"] --> B{"New edit received"}

    B -->|"From D1: {D1:4, D2:2}"| C{"Compare vectors"}
    B -->|"From D2: {D1:3, D2:3}"| D{"Compare vectors"}

    C -->|"D1:4 > D1:3 and D2:2 = D2:2<br/>D1 dominates"| E["Sequential update<br/>Accept D1's version<br/>No conflict"]

    D -->|"D2:3 > D2:2 and D1:3 = D1:3<br/>D2 dominates"| F["Sequential update<br/>Accept D2's version<br/>No conflict"]

    B -->|"Both arrive:<br/>D1: {D1:4, D2:2}<br/>D2: {D1:3, D2:3}"| G{"Compare vectors"}
    G -->|"D1:4 > D1:3 BUT D2:2 < D2:3<br/>Neither dominates"| H["CONCURRENT edits<br/>CONFLICT detected!"]

    H --> I["Resolution Strategy"]
    I --> J["LWW: Pick by timestamp"]
    I --> K["Dropbox: Create conflicted copy"]
    I --> L["Merge: 3-way diff against ancestor"]

    style H fill:#fca5a5,stroke:#991b1b
    style E fill:#bbf7d0,stroke:#166534
    style F fill:#bbf7d0,stroke:#166534

Conflict detection using version vectors: sequential edits are safe, concurrent edits trigger the resolution strategy

Choosing the right conflict resolution strategy

  • Consumer file sync (Dropbox, OneDrive) — Use conflicted copies or LWW. Users expect simplicity and data safety. The cost of a lost edit (LWW) or a duplicate file (Dropbox) is low compared to the complexity of auto-merge for arbitrary binary files like photos, PDFs, and zip archives.
  • Real-time collaborative editing (Google Docs) — Use OT or CRDTs. These structured data types (text documents, spreadsheets) can be decomposed into operations (insert character at position X, delete range Y) that can be transformed and merged automatically. The user experience is seamless — both users see each other's changes in real time.
  • Enterprise document management — Use pessimistic locking (check-out/check-in). When a user opens a document for editing, it is locked; no one else can edit until the lock is released. This eliminates conflicts entirely but blocks collaboration. Used by SharePoint and traditional document management systems.
  • Code repositories (Git) — Use three-way merge with manual conflict resolution for overlapping changes. Developers are accustomed to resolving merge conflicts. The structured nature of code (line-based text) makes three-way merge practical.

Binary Files Cannot Be Auto-Merged

Three-way merge and OT work for structured text, but most files on a file sync service are binary: photos (JPG, PNG), PDFs, compiled documents (DOCX), videos, and archives. There is no meaningful way to auto-merge two versions of a JPEG. For binary files, the only practical options are LWW (lose one version) or conflicted copy (keep both). This is why Dropbox's conflicted copy approach is the pragmatic default — it handles all file types safely.

For the interview, the recommended approach is: use version vectors for conflict detection, use conflicted copies as the default resolution for binary files, and offer optional three-way merge for text files when the client application supports it. Mention that real-time collaboration (Google Docs-style) requires a fundamentally different architecture (OT/CRDT with a collaboration server) that is separate from the file sync system.

Sharing & Permissions

Sharing and permissions transform a personal file storage service into a collaboration platform. At its core, the sharing system must answer one question with sub-millisecond latency: "Does user X have permission Y on resource Z?" This question is asked on every API call — every file download, every folder listing, every sync operation — so it must be extremely fast. At Dropbox scale with 35,000 file operations per second, the permission check is on the hot path of every request.

The standard approach is an Access Control List (ACL) model. Each file and folder has an ACL — a list of (principal, permission) pairs that specify who can do what. Principals can be individual users, groups, or "anyone with the link." Permissions are typically: owner (full control, can delete and share), editor (can read and write), viewer (read-only), and commenter (can read and add comments but not modify).

Sharing mechanisms

  • Direct sharing (email) — User A shares a file with User B by email address. The system creates a sharing record linking the resource to User B's account with a specified permission level. User B sees the shared file in their "Shared with me" view.
  • Link sharing — User A generates a shareable link for a file or folder. Anyone with the link can access the resource at the specified permission level (view or edit). Links can be password-protected, set to expire, or restricted to specific email domains.
  • Team/organization sharing — Enterprise accounts define teams and departments. Sharing with a team grants access to all current and future members. This requires a group membership service that the permission system queries.
  • Public sharing — The resource is accessible to anyone on the internet. Used for published documents, shared portfolios, and public downloads. Requires rate limiting and abuse prevention.

Permission Inheritance in Folder Trees

When a folder is shared with User B as editor, all files and subfolders within that folder inherit the same permission. This inheritance must cascade correctly through arbitrarily deep folder hierarchies. Moving a file from a shared folder to a private folder should revoke the inherited permission. Moving a file into a shared folder should grant the inherited permission. This tree-based inheritance is the source of many permission bugs and is a favorite topic in system design interviews.

Permission inheritance creates a performance challenge. To check if User B can access a deeply nested file, the system must walk up the folder tree checking permissions at each level: does User B have direct permission on this file? If not, check the parent folder. If not, check the grandparent. Continue until the root. This recursive lookup is expensive for deep hierarchies. The solution is to materialize effective permissions: when a sharing record is created on a folder, eagerly compute and cache the effective permissions for all descendants. This trades write-time computation for read-time speed.

Permission ModelHow It WorksPerformance Characteristic
Recursive tree walkCheck each ancestor folder for matching ACL entry at query timeO(depth) per access check; slow for deep hierarchies; no caching benefit
Materialized permissionsPre-compute and store effective permissions for every file-user pairO(1) per access check; expensive writes when sharing a folder with many children
Hybrid with cachingMaterialize permissions on first access, cache in Redis with TTLFast reads after first access; eventual consistency (permission changes take seconds to propagate)
Google Zanzibar modelGraph-based ACL with relationship tuples and recursive expansionHandles complex relationships (teams, nested groups); highly scalable with caching at each level

Google Zanzibar: The State of the Art

Google built Zanzibar as a unified authorization system that handles permissions for Google Drive, YouTube, Google Cloud, and other services. It models permissions as a graph of relationship tuples (user:alice, viewer, document:123) and evaluates access by traversing the graph. Zanzibar handles millions of authorization checks per second with sub-10ms latency using aggressive caching and denormalization. The open-source implementations (SpiceDB, OpenFGA, Ory Keto) bring this pattern to non-Google teams.

Link sharing introduces additional complexity. A share link is essentially a bearer token — anyone who possesses the link can access the resource. This means link security depends on link secrecy. Best practices include: generating cryptographically random link tokens (not sequential IDs), supporting password protection, expiration dates, and download-count limits, logging all link accesses for audit trails, and allowing the owner to revoke a link at any time (which invalidates the token immediately).

Permission check flow for a file access request

→

01

Request arrives: GET /files/{file_id}. Extract user_id from the authentication token.

→

02

Check the permissions cache (Redis): key = "perm:{user_id}:{file_id}". If a cached result exists and has not expired, return it immediately (cache hit path: < 1ms).

→

03

Cache miss: query the sharing table for a direct permission on this file for this user. If found, cache the result and return.

→

04

No direct permission: look up the file's parent_folder_id and recursively check the sharing table for each ancestor folder.

→

05

If an inherited permission is found, cache the effective permission for the original file and return.

06

If no permission is found after walking to the root, deny access with a 403 Forbidden response.

1

Request arrives: GET /files/{file_id}. Extract user_id from the authentication token.

2

Check the permissions cache (Redis): key = "perm:{user_id}:{file_id}". If a cached result exists and has not expired, return it immediately (cache hit path: < 1ms).

3

Cache miss: query the sharing table for a direct permission on this file for this user. If found, cache the result and return.

4

No direct permission: look up the file's parent_folder_id and recursively check the sharing table for each ancestor folder.

5

If an inherited permission is found, cache the effective permission for the original file and return.

6

If no permission is found after walking to the root, deny access with a 403 Forbidden response.

For the interview, emphasize that the sharing and permission system is a cross-cutting concern that impacts every other component: the sync service must filter notifications to only devices with access, the search index must enforce permissions on query results, the storage layer must validate upload permissions, and the sharing UI must display accurate permission states. A centralized permission service (like Zanzibar) that all other services call is the correct architectural pattern.

Scaling, Reliability & Cost Optimization

At exabyte scale, the operational concerns of a file storage service dominate the engineering effort. Storing data is the easy part; keeping it durable, available, and affordable over decades is the hard part. This section covers the techniques that production file storage services use to achieve eleven 9s of durability, four 9s of availability, and cost-effective storage at petabyte-to-exabyte scale.

Data durability — the guarantee that stored data will not be lost — is the single most important property of a file storage service. Users trust these services with irreplaceable files: family photos, legal documents, medical records, and business-critical data. Losing even a single file due to a storage system failure would be catastrophic for user trust. The industry standard is eleven 9s of durability (99.999999999%), which means losing less than 1 object per 100 billion objects per year.

Durability and reliability techniques

  • Replication (3-way) — Store three copies of each chunk across three different storage nodes in different failure domains (different racks, different power supplies). If one copy is lost (disk failure), the system automatically re-replicates from the remaining two copies. Simple but expensive: 3x storage overhead.
  • Erasure coding — Encode each chunk into N data fragments and M parity fragments (e.g., 6+3 Reed-Solomon coding). Any 6 of the 9 fragments can reconstruct the original chunk. Storage overhead: 1.5x (compared to 3x for replication). Used by S3, Azure, and Dropbox Magic Pocket for cold data.
  • Checksumming and scrubbing — Compute a CRC32 or SHA-256 checksum for every stored chunk. Periodically read back stored chunks and verify checksums to detect silent bit rot (data corruption without disk failure). If a corrupted chunk is detected, reconstruct it from replicas or erasure-coded fragments.
  • Cross-region replication — Replicate critical data to a geographically distant region to survive regional disasters (earthquake, flooding, extended power outage). Adds latency to writes but provides the ultimate durability guarantee.

Erasure Coding vs Replication: The Cost Trade-off

For hot data (frequently accessed), 3-way replication is preferred because any single replica can serve a read request, maximizing read throughput. For warm and cold data (infrequently accessed), erasure coding is preferred because the 1.5x overhead (vs 3x for replication) saves enormous amounts of storage at petabyte scale. The trade-off: erasure-coded reads require reading from multiple nodes and reconstructing the data, which is slower than reading a single replica. Most production systems use replication for the hot tier and erasure coding for warm and cold tiers.

Storage TierAccess PatternDurability TechniqueCost (per GB/month)Use Case
Hot (S3 Standard)Accessed daily3-way replication$0.023Recently uploaded files, frequently accessed documents
Warm (S3 IA)Accessed monthlyErasure coding (6+3)$0.0125Files not accessed in 90 days
Cold (S3 Glacier)Accessed yearlyErasure coding (10+4)$0.004Version history, old backups, archived accounts
Deep ArchiveAccessed rarelyErasure coding (10+4), tape backup$0.00099Legal hold, compliance archives, 7+ year retention

Tiered storage is the primary cost optimization lever. A file storage service with 2 exabytes of data where 90% of chunks have not been accessed in 90 days can move that 1.8 exabytes from S3 Standard ($0.023/GB/month) to S3 Glacier ($0.004/GB/month), saving approximately $34 million per month. Lifecycle policies automatically transition chunks between tiers based on last-access time. When a user accesses a cold file, it is restored to the hot tier with a brief delay (minutes for Glacier, hours for Deep Archive).

Garbage Collection of Orphaned Chunks

When files are deleted or versions expire, the associated chunks may become orphaned — no metadata record references them. A garbage collection (GC) process must periodically scan the chunk index, identify chunks with ref_count = 0, and delete them from object storage. This is a safety-critical operation: deleting a chunk that is still referenced by a live file causes data loss. GC should use a tombstone period (mark as deletable, wait 30 days, then delete) and cross-reference against multiple metadata shards before final deletion. Dropbox runs GC as a batch job with extensive safety checks, not as a real-time process.

Cross-region sync architecture for global availability

→

01

Primary region (e.g., US-East) handles all writes. The metadata service and block storage master copies live here.

→

02

Metadata changes are replicated to secondary regions (EU-West, AP-Southeast) via asynchronous MySQL replication or Kafka-based change data capture (CDC).

→

03

Block storage chunks are replicated to secondary regions via S3 Cross-Region Replication (CRR) with a typical lag of 15 minutes to a few hours for large files.

→

04

Clients connect to the nearest region for reads (low latency). Read replicas of the metadata database serve folder listings and file downloads from local storage.

→

05

Writes are routed to the primary region. For users far from the primary region, write latency is higher (cross-region round trip: 100-200ms). Optimistic local-first writes mitigate this: the client commits locally and syncs to the primary asynchronously.

06

In a disaster scenario (primary region goes down), the secondary region is promoted to primary. This involves a DNS failover (TTL: 60 seconds), promoting the MySQL replica to primary, and accepting that any unreplicated changes in the last few minutes may be lost (RPO: typically < 5 minutes).

1

Primary region (e.g., US-East) handles all writes. The metadata service and block storage master copies live here.

2

Metadata changes are replicated to secondary regions (EU-West, AP-Southeast) via asynchronous MySQL replication or Kafka-based change data capture (CDC).

3

Block storage chunks are replicated to secondary regions via S3 Cross-Region Replication (CRR) with a typical lag of 15 minutes to a few hours for large files.

4

Clients connect to the nearest region for reads (low latency). Read replicas of the metadata database serve folder listings and file downloads from local storage.

5

Writes are routed to the primary region. For users far from the primary region, write latency is higher (cross-region round trip: 100-200ms). Optimistic local-first writes mitigate this: the client commits locally and syncs to the primary asynchronously.

6

In a disaster scenario (primary region goes down), the secondary region is promoted to primary. This involves a DNS failover (TTL: 60 seconds), promoting the MySQL replica to primary, and accepting that any unreplicated changes in the last few minutes may be lost (RPO: typically < 5 minutes).

Bandwidth cost optimization is another critical concern. Data transfer out of cloud providers (egress) is expensive: $0.09/GB for AWS. At 500TB/day of downloads (Dropbox scale), that is $45,000 per day in egress costs alone. Strategies to reduce this include: aggressive client-side caching (chunks already on the device are never re-downloaded), delta sync (only transfer the changed portions of a file), CDN integration for frequently-shared files, and peering agreements with major ISPs to reduce transit costs.

Delta Sync: The Ultimate Bandwidth Optimization

Instead of re-uploading entire changed chunks, advanced sync clients compute a binary delta between the old and new version of a chunk using algorithms like rsync or xdelta3. If a user modifies 100 bytes in a 4MB chunk, only a ~200-byte delta is transmitted instead of the full 4MB chunk. Dropbox implemented this optimization and reported a 50-80% reduction in sync bandwidth for document-type files. The trade-off is CPU cost on the client to compute deltas, which is acceptable for modern devices.

How this might come up in interviews

File storage and sync is one of the most frequently asked system design questions at Dropbox (obviously), Google, Meta, Microsoft, and Amazon. It appears as "Design Dropbox" or "Design Google Drive" and is used to calibrate candidates from L4 (basic upload/download architecture) to L7 (delta sync algorithms, erasure coding trade-offs, cross-region conflict resolution). Interviewers expect you to start with requirements and estimation, proceed to the chunk-based storage architecture, discuss metadata design, and then deep-dive into sync and conflict resolution. The question reveals whether you understand data vs metadata separation, can reason about distributed state synchronization, and appreciate the cost optimization challenges at scale.

Common questions:

  • L4: Design a basic file upload and download service. How would you store files, support multiple users, and handle concurrent uploads? [Tests: basic API design, S3 usage, metadata schema, authentication]
  • L4-L5: Explain how file chunking works and why it is important. How do you decide on chunk size? Walk me through the upload protocol with deduplication. [Tests: chunking trade-offs, SHA-256 hashing, has-list protocol, content-addressable storage understanding]
  • L5: Design the sync protocol for a Dropbox-like service. How does a change on one device propagate to another? What happens when a device is offline for a week and then reconnects? [Tests: long polling vs WebSockets, sync cursors, offline-to-online recovery, notification fanout]
  • L5-L6: Two users edit the same file simultaneously while offline. How does your system detect and resolve this conflict? Compare at least three conflict resolution strategies. [Tests: version vectors, conflict detection vs resolution, LWW vs conflicted copies vs OT, trade-off analysis]
  • L6: Design the permission system for shared folders with nested inheritance. How do you check permissions efficiently when a file is 10 levels deep in a shared folder hierarchy? [Tests: ACL models, permission inheritance, materialized permissions, caching strategies, Zanzibar awareness]
  • L6-L7: Your service stores 2 exabytes of data. Design the storage tier architecture, garbage collection system, and cross-region replication strategy. Calculate the cost savings from erasure coding vs 3-way replication. [Tests: erasure coding math, lifecycle policies, GC safety, cross-region consistency, cost analysis with real numbers]

Key takeaways

  • Separate metadata (SQL, strongly consistent) from file data (object storage, high durability). This two-plane architecture is the foundation of every production file storage system and allows each plane to scale independently.
  • File chunking with content-addressable storage (SHA-256 hashing) enables three critical capabilities simultaneously: resumable uploads, cross-user deduplication (60-75% storage savings), and efficient sync (only changed chunks are transferred).
  • The sync protocol uses a monotonic cursor per device and long-poll notifications to achieve near-real-time propagation. When a file changes, only the devices with access are notified, and they fetch only the new or modified chunks.
  • Conflict resolution is the hardest problem. Use version vectors for detection and conflicted copies as the safe default for binary files. Three-way merge and OT/CRDTs are only practical for structured text data in real-time collaboration scenarios.
  • Cost optimization through tiered storage (hot, warm, cold, archive) and garbage collection of orphaned chunks is essential at scale. Moving 85% of data from hot to cold storage can save hundreds of millions of dollars annually at exabyte scale.
Before you move on: can you answer these?

A user modifies 500 bytes in the middle of a 200MB file. Walk through what happens from the client detecting the change to all devices being in sync. How many bytes are actually transferred?

The client detects the change via filesystem watcher, re-chunks the file into 50 chunks of 4MB each, and recomputes SHA-256 hashes. Only the one chunk containing the 500-byte modification has a different hash. The client sends all 50 hashes to the server, which responds that 49 are already stored. The client uploads only the 1 modified chunk (4MB). With delta sync, only the binary diff (~1KB) is transmitted instead of the full 4MB chunk. The metadata service updates the file version record, publishes a notification to Kafka, and the notification service pushes the change to all other connected devices, which download the single new chunk.

Why does Dropbox use a SQL database for metadata instead of a NoSQL database like DynamoDB or Cassandra? Under what circumstances might NoSQL be acceptable?

Metadata operations require strong consistency and transactional guarantees: creating a file requires atomic inserts across files, file_versions, and chunks tables. Folder moves require serialized updates to prevent cycles. SQL databases provide ACID transactions that make these operations safe. NoSQL databases offer eventual consistency, which would allow users on different devices to see inconsistent folder structures. NoSQL could be acceptable for immutable, non-relational data like activity logs or analytics, where eventual consistency is tolerable. The chunk index (hash to storage location) could potentially use DynamoDB since lookups are simple key-value reads with no transactions needed.

Your file storage service has 2 exabytes of data. 85% of chunks have not been accessed in 6 months. Estimate the monthly cost savings from implementing tiered storage with lifecycle policies.

Total data: 2 EB = 2,000 PB = 2,000,000 TB. Hot data (15%): 300,000 TB at S3 Standard ($0.023/GB/month) = $6,900,000/month. Cold data (85%): 1,700,000 TB. Without tiering: 1,700,000 TB at $0.023/GB = $39,100,000/month. With tiering (S3 Glacier at $0.004/GB): 1,700,000 TB at $0.004/GB = $6,800,000/month. Monthly savings: $39,100,000 - $6,800,000 = $32,300,000/month (~$387M/year). This is why Dropbox built Magic Pocket — even small improvements in storage cost at this scale translate to hundreds of millions in annual savings.

🧠Mental Model

💡 Analogy

A file storage and sync service works like a LEGO brick warehouse. Files are broken into standardized bricks (chunks), each stamped with a unique code (SHA-256 hash). The warehouse only stores one copy of each unique brick — if two customers bring identical bricks, only one is shelved. Your instruction manual (metadata) tells you which bricks in which order reconstruct your file. When you update your model, only the changed bricks are swapped out; the unchanged ones stay on the shelf. Sync is like two people building from the same manual — when one adds a brick, the warehouse notifies the other so they can update their local copy. If both people add different bricks to the same spot at the same time, the warehouse flags a conflict and keeps both versions until the builders sort it out.

⚡ Core Idea

A file storage and sync service separates file content from file metadata. Content is chunked, hashed, deduplicated, and stored in object storage. Metadata — file paths, version history, permissions, and sync state — lives in a strongly consistent SQL database. Sync works by tracking a monotonic cursor per device and pushing change notifications via long polling. Conflicts are detected using version vectors and resolved by creating conflicted copies (preserving all data) or auto-merging where possible. The entire system is optimized for bandwidth efficiency (delta sync, dedup) and storage cost (tiered storage, erasure coding).

🎯 Why It Matters

Understanding file storage and sync architecture is essential because it combines nearly every distributed systems concept into one design: content-addressable storage, metadata consistency, real-time event propagation, conflict resolution, permission management at scale, and storage cost optimization. These same patterns appear in databases, version control systems, backup services, and any system that must synchronize state across distributed nodes. In interviews, this question reveals whether you can reason about data vs metadata separation, handle concurrent modifications, and optimize for real-world constraints like bandwidth and cost.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.