Interactive Explainer

🎯Key Takeaways

A video platform is a write-once, read-millions system shaped by extreme read-write asymmetry. The upload path is asynchronous and compute-intensive (transcoding), while the viewing path is synchronous, latency-sensitive, and bandwidth-intensive (CDN delivery). Design these paths as independent systems connected by a message queue.

The codec ladder (encoding each video into multiple resolution/bitrate variants) and adaptive bitrate streaming (HLS/DASH) work together to deliver the best possible quality for each viewer given their device and network conditions. Segment-based streaming with 4-6 second segments enables seamless quality switching without rebuffering.

CDN architecture is the core of the viewing experience. A three-tier cache hierarchy (edge POP, regional shield, origin) with popularity-based pre-positioning ensures that 90%+ of requests are served from edge caches under 20ms away from the viewer. Request collapsing at the shield tier prevents thundering herd on viral content.

Codec selection (H.264 vs H.265 vs AV1) is a strategic decision trading off compression efficiency against encoding cost, device support, and licensing. The practical approach is dual-codec: H.264 as universal baseline plus AV1 for high-traffic content where bandwidth savings justify the encoding cost.

Cost optimization at video scale requires tiered storage (hot/warm/cold/archive lifecycle), spot instances for transcoding, selective encoding (full codec ladder only for popular content), and per-title encoding. Bandwidth is the dominant cost, making CDN efficiency and codec optimization the highest-leverage investments.

Designing a Video Streaming Platform

Design a video platform that handles 500 hours of uploads per minute and serves 1 billion hours of video per day to a global audience. Cover the full pipeline from upload through transcoding, CDN delivery, adaptive bitrate streaming, metadata search, content moderation, and cost optimization at YouTube/Netflix scale.

~44 min read

Be the first to complete!

What you'll learn

A video platform is a write-once, read-millions system shaped by extreme read-write asymmetry. The upload path is asynchronous and compute-intensive (transcoding), while the viewing path is synchronous, latency-sensitive, and bandwidth-intensive (CDN delivery). Design these paths as independent systems connected by a message queue.
The codec ladder (encoding each video into multiple resolution/bitrate variants) and adaptive bitrate streaming (HLS/DASH) work together to deliver the best possible quality for each viewer given their device and network conditions. Segment-based streaming with 4-6 second segments enables seamless quality switching without rebuffering.
CDN architecture is the core of the viewing experience. A three-tier cache hierarchy (edge POP, regional shield, origin) with popularity-based pre-positioning ensures that 90%+ of requests are served from edge caches under 20ms away from the viewer. Request collapsing at the shield tier prevents thundering herd on viral content.
Codec selection (H.264 vs H.265 vs AV1) is a strategic decision trading off compression efficiency against encoding cost, device support, and licensing. The practical approach is dual-codec: H.264 as universal baseline plus AV1 for high-traffic content where bandwidth savings justify the encoding cost.
Cost optimization at video scale requires tiered storage (hot/warm/cold/archive lifecycle), spot instances for transcoding, selective encoding (full codec ladder only for popular content), and per-title encoding. Bandwidth is the dominant cost, making CDN efficiency and codec optimization the highest-leverage investments.

Lesson outline

The Scale of Video — Why YouTube Is the Hardest System to Design

Video is the single most demanding workload in all of computing. YouTube alone accounts for roughly 15% of all internet downstream traffic globally, placing it alongside Netflix as the largest consumer of bandwidth on the planet. Users upload over 500 hours of video every single minute, which translates to roughly 720,000 hours of new content every day. On the consumption side, viewers watch over 1 billion hours of video per day across YouTube. These numbers make a video platform one of the most challenging systems to design at scale, touching every layer of the stack from storage to networking to compute.

To put the storage challenge in perspective, a single hour of raw 4K video at 60fps produces approximately 300-500 GB of data before compression. After encoding at reasonable quality, one hour of 4K content still requires roughly 7-10 GB per resolution. When you factor in the codec ladder (encoding each video into multiple resolutions and bitrates for adaptive streaming), a single one-hour upload might produce 20-40 GB of transcoded output across all variants. At 500 hours uploaded per minute, the platform must ingest and process roughly 150-300 TB of raw video per day and produce 1-3 PB of transcoded output daily.

The bandwidth dominance of video

Video streaming constitutes over 65% of all global internet traffic. A single 4K stream at 20 Mbps consumes as much bandwidth as 200 concurrent web browsing sessions. This is why CDN architecture is not an afterthought in video platform design — it is the core of the system. Without an efficient global CDN, no video platform can exist at scale.

The compute requirements are equally staggering. Transcoding a single minute of video from raw upload to the full codec ladder (6-8 resolution/bitrate variants) takes approximately 2-10 minutes of CPU time depending on codec and hardware acceleration. At 500 hours uploaded per minute, you need thousands of transcoding workers running in parallel just to keep up with the ingest rate. During viral events or live streaming spikes, this demand can surge 5-10x, requiring elastic compute that can scale up within minutes.

Why video platform design is uniquely hard

Massive storage scale — Petabytes of new transcoded content per day, with historical archive measured in exabytes. YouTube stores an estimated 1 exabyte or more of video data.
Compute-intensive processing — Transcoding is CPU/GPU-bound. A single 4K video may require hours of CPU time across the codec ladder, and the platform processes hundreds of thousands of videos daily.
Global low-latency delivery — Viewers expect video to start playing within 2 seconds anywhere in the world. This requires edge caches on every continent and intelligent routing to the nearest point of presence.
Adaptive quality in real time — The player must continuously adjust video quality based on network conditions, switching between resolutions seamlessly without buffering or visible artifacts.
Content moderation at scale — Every uploaded video must be scanned for copyright violations, harmful content, and policy violations before or shortly after it becomes publicly available.
Cost optimization pressure — Storage, bandwidth, and compute for video cost orders of magnitude more than text-based applications. A 1% efficiency improvement can save millions of dollars annually.

What makes video platform design a favorite system design interview question is that it forces candidates to reason across every dimension of distributed systems simultaneously. You cannot design a video platform without understanding object storage, message queues, worker pools, CDNs, streaming protocols, database design, and cost optimization. It is a true end-to-end system design problem that separates candidates who can reason about system interactions from those who only understand individual components.

Interview framing strategy

When asked to design a video platform, immediately clarify scope: is this YouTube (user-generated content with upload, search, recommendations) or Netflix (licensed content with curated catalog)? The upload pipeline, moderation requirements, and content discovery systems differ dramatically. For most interviews, YouTube-style UGC platforms are the expected scope.

The system must handle an enormous asymmetry between upload and viewing patterns. A typical video platform has a read-to-write ratio of roughly 200:1 to 1000:1 in terms of bandwidth. A single popular video might be uploaded once but viewed millions of times. This asymmetry fundamentally shapes the architecture: the upload path can be somewhat slow and asynchronous, but the viewing path must be blazing fast, globally distributed, and highly available. The upload pipeline is a background processing system; the viewing pipeline is a real-time content delivery network.

Metric	Approximate Value	Design Implication
Video uploads per minute	500 hours	Asynchronous processing pipeline with large worker pools
Daily video watched	1 billion hours	Massive CDN with thousands of edge locations globally
Storage per hour (4K transcoded)	7-10 GB per resolution	Exabyte-scale object storage with tiered lifecycle policies
Transcoding time per minute of video	2-10 CPU-minutes per variant	Elastic compute fleet (spot instances) for transcoding workers
Read-to-write bandwidth ratio	200:1 to 1000:1	Optimize heavily for read path; CDN caching is critical
Time to first byte (viewer expectation)	Under 2 seconds	Pre-positioned edge caches and segment pre-fetching

Requirements & Capacity Estimation

Before drawing architecture diagrams, you must establish the requirements and run the numbers. Capacity estimation is not just a formality in video platform design — the numbers are so large that they directly determine which architectural patterns are feasible. A wrong assumption about upload volume or storage requirements can lead to an architecture that collapses under real load.

Functional requirements

Video upload — Users upload videos up to 1 hour long in common formats (MP4, MOV, AVI, MKV). The system accepts the raw file, validates it, and queues it for processing.
Video transcoding — Each uploaded video is transcoded into multiple resolution/bitrate pairs (the codec ladder) to support adaptive bitrate streaming across devices.
Video streaming — Viewers watch videos via adaptive bitrate streaming (HLS or DASH). The player selects the appropriate quality based on network conditions.
Video search and discovery — Users search by title, tags, and description. The system supports recommendation feeds based on viewing history and engagement signals.
Content moderation — Automated scanning for copyright (fingerprinting), harmful content, and policy violations. Human review pipeline for edge cases.
Thumbnail generation — Automatic extraction of representative frames. Optional: AI-powered selection of the most engaging thumbnail.

Non-functional requirements

Availability — 99.99% for the viewing path. Upload processing can tolerate brief delays (99.9% SLA) since it is asynchronous.
Latency — Time to first video frame under 2 seconds globally. Upload acknowledgment within 5 seconds. Transcoding completion within 30 minutes for standard videos.
Durability — 11 nines (99.999999999%) for stored video. Losing a video is unacceptable — creators depend on the platform for their livelihood.
Scalability — Handle 10x traffic spikes during viral events or live streams without degradation of the viewing experience.
Cost efficiency — Storage and bandwidth are the dominant costs. Architecture must support tiered storage and intelligent caching to control expenses.

Now let us work through the back-of-the-envelope numbers. These calculations will drive every architectural decision in the sections that follow.

Storage estimation

→

Daily uploads: 500 hours/minute x 60 minutes x 24 hours = 720,000 hours of video per day.

→

Average video length: Assume most uploads are short (5-15 minutes), with an average of 10 minutes. That gives us 720,000 x 6 = 4,320,000 individual video uploads per day.

→

Raw upload size: Average upload is 10 minutes at 1080p, approximately 1-2 GB. Total raw ingest: roughly 4-8 PB per day.

→

Transcoded output per video: Each video is encoded into 6-8 variants (240p through 4K). A 10-minute video produces roughly 500 MB to 2 GB of total transcoded output across all variants. Average: ~1 GB per video.

→

Daily transcoded storage: 4,320,000 videos x 1 GB = approximately 4.3 PB of new transcoded content per day.

Annual storage growth: 4.3 PB/day x 365 = approximately 1.6 EB per year of transcoded content alone. After 10 years, the system holds 10+ exabytes.

Daily uploads: 500 hours/minute x 60 minutes x 24 hours = 720,000 hours of video per day.

Average video length: Assume most uploads are short (5-15 minutes), with an average of 10 minutes. That gives us 720,000 x 6 = 4,320,000 individual video uploads per day.

Raw upload size: Average upload is 10 minutes at 1080p, approximately 1-2 GB. Total raw ingest: roughly 4-8 PB per day.

Daily transcoded storage: 4,320,000 videos x 1 GB = approximately 4.3 PB of new transcoded content per day.

Annual storage growth: 4.3 PB/day x 365 = approximately 1.6 EB per year of transcoded content alone. After 10 years, the system holds 10+ exabytes.

Bandwidth estimation

→

Daily viewing: 1 billion hours of video watched per day.

→

Average bitrate served: Assume a mix of resolutions. Average effective bitrate is approximately 5 Mbps (weighted average across mobile 480p and desktop 1080p).

→

Total egress bandwidth: 1 billion hours x 3600 seconds/hour x 5 Mbps = 18 exabits per day = 208 Tbps average sustained throughput.

→

Peak bandwidth (assume 3x average): approximately 624 Tbps. This is why YouTube operates its own CDN backbone (Google Global Cache) rather than relying on third-party CDNs.

Cost implication: At commercial CDN rates of $0.01-0.02 per GB, 1 billion hours per day would cost over $200 million per month in bandwidth alone. This is why large platforms build their own CDN infrastructure.

Daily viewing: 1 billion hours of video watched per day.

Average bitrate served: Assume a mix of resolutions. Average effective bitrate is approximately 5 Mbps (weighted average across mobile 480p and desktop 1080p).

Total egress bandwidth: 1 billion hours x 3600 seconds/hour x 5 Mbps = 18 exabits per day = 208 Tbps average sustained throughput.

Peak bandwidth (assume 3x average): approximately 624 Tbps. This is why YouTube operates its own CDN backbone (Google Global Cache) rather than relying on third-party CDNs.

Let the numbers drive codec selection

The bandwidth numbers reveal why codec efficiency matters enormously. AV1 achieves 30-50% better compression than H.264 at equivalent quality. At 208 Tbps of sustained throughput, a 30% bandwidth reduction saves over 60 Tbps of capacity and tens of millions of dollars per month. This is why YouTube invested heavily in AV1 adoption despite its higher encoding cost.

Resource	Daily Volume	Yearly Volume	Cost Driver
Raw uploads ingested	4-8 PB	1.5-3 EB	Temporary storage (deleted after transcoding)
Transcoded output stored	~4.3 PB	~1.6 EB	Permanent object storage with tiered lifecycle
Viewing bandwidth	208 Tbps average	~924 EB transferred	CDN egress — the single largest cost
Transcoding compute	~72M CPU-hours/day	~26B CPU-hours/year	Spot/preemptible instances for cost savings
Metadata records	~4.3M new videos/day	~1.6B videos/year	Database storage and indexing

These numbers make it clear that the three pillars of video platform economics are bandwidth (CDN costs), storage (object store lifecycle management), and compute (transcoding fleet). Every architectural decision in the following sections is ultimately about optimizing one of these three cost centers while maintaining user experience.

Video Upload Pipeline

The upload pipeline is the entry point for all content entering the platform. Despite being asynchronous (users do not expect their video to be available instantly), the upload path must be robust, resumable, and capable of handling files that range from 10 MB phone clips to 100+ GB professional 4K productions. A dropped or corrupted upload wastes user time, bandwidth, and trust.

The upload flow begins at the client. The client application (web, mobile, or API) splits the video file into chunks (typically 5-10 MB each) and uploads them in parallel over HTTPS. Chunked upload is essential for large files because it enables resumability: if the network drops during a 50 GB upload, the client can resume from the last successful chunk rather than restarting from zero. The upload service reassembles the chunks in the correct order and writes the complete file to object storage.

graph LR
    A["Client App"] -->|"Chunked Upload<br/>HTTPS"| B["Upload Service<br/>(API Gateway)"]
    B -->|"Reassemble & Write"| C["Object Storage<br/>(Raw Video)"]
    B -->|"Create Record"| D["Metadata DB<br/>(PostgreSQL)"]
    C -->|"Upload Complete Event"| E["Message Queue<br/>(Kafka)"]
    E -->|"Dequeue"| F["Transcoding<br/>Orchestrator"]
    F -->|"Spawn Workers"| G["Transcoding<br/>Worker Pool"]
    G -->|"Write Variants"| H["Object Storage<br/>(Transcoded)"]
    G -->|"Update Status"| D
    H -->|"Propagate to CDN"| I["CDN Edge<br/>Caches"]

    style A fill:#e1f5fe
    style C fill:#fff3e0
    style H fill:#fff3e0
    style E fill:#f3e5f5
    style I fill:#e8f5e9

End-to-end upload pipeline: from client chunked upload through object storage, message queue, transcoding workers, and CDN propagation.

The upload service sits behind an API gateway and handles authentication, rate limiting, and file validation. Before accepting the upload, it verifies the user is authenticated, checks upload quotas (free users might be limited to 15-minute videos, premium users to 12 hours), validates the file header to confirm it is a supported video container format, and reserves a unique video ID. The service then orchestrates the chunked upload protocol, tracking which chunks have been received and requesting retransmission of any missing or corrupted chunks.

Upload processing sequence

→

Client requests an upload session from the Upload Service, providing video metadata (title, description, tags). The service returns a unique upload_id and a list of pre-signed URLs for each chunk.

→

Client splits the video file into 5-10 MB chunks and uploads each chunk in parallel (typically 4-8 concurrent uploads) to the pre-signed URLs, writing directly to object storage.

→

Upload Service tracks chunk completion. Once all chunks are received, it triggers a server-side assembly process that concatenates chunks into the final raw video file in object storage.

→

Upload Service validates the assembled file: checks container integrity (is the MP4 valid?), scans for malware, and extracts basic metadata (duration, resolution, codec, framerate).

→

Upload Service writes a metadata record to PostgreSQL (video_id, owner_id, title, status=PROCESSING, upload_timestamp, raw_file_path) and publishes a VIDEO_UPLOADED event to Kafka.

The Transcoding Orchestrator consumes the event and begins the transcoding pipeline (covered in the next section).

Client requests an upload session from the Upload Service, providing video metadata (title, description, tags). The service returns a unique upload_id and a list of pre-signed URLs for each chunk.

Client splits the video file into 5-10 MB chunks and uploads each chunk in parallel (typically 4-8 concurrent uploads) to the pre-signed URLs, writing directly to object storage.

Upload Service tracks chunk completion. Once all chunks are received, it triggers a server-side assembly process that concatenates chunks into the final raw video file in object storage.

Upload Service validates the assembled file: checks container integrity (is the MP4 valid?), scans for malware, and extracts basic metadata (duration, resolution, codec, framerate).

Upload Service writes a metadata record to PostgreSQL (video_id, owner_id, title, status=PROCESSING, upload_timestamp, raw_file_path) and publishes a VIDEO_UPLOADED event to Kafka.

The Transcoding Orchestrator consumes the event and begins the transcoding pipeline (covered in the next section).

Pre-signed URLs for direct-to-storage upload

Use pre-signed URLs (S3 pre-signed or GCS signed URLs) so that clients upload chunks directly to object storage, bypassing your upload service for the actual data transfer. This prevents your upload service from becoming a bandwidth bottleneck and eliminates double-copying (client to service, service to storage). The upload service only handles metadata and orchestration, not the multi-gigabyte video data itself.

Resumable uploads are a non-negotiable requirement. The Google Upload Protocol (used by YouTube) and the TUS protocol are industry standards for resumable uploads. The client tracks which chunks have been successfully uploaded (confirmed by the storage service), and on network failure, it queries the server for the upload state and resumes from the last confirmed chunk offset. For very large files (50+ GB), uploads may span hours or even be paused and resumed across sessions.

Handling duplicate uploads

Users sometimes upload the same video twice accidentally. Compute a content hash (SHA-256) of the raw file during assembly. If the hash matches an existing video owned by the same user, prompt them rather than creating a duplicate. However, do not deduplicate across users — two different creators may legitimately upload the same clip, and each owns their copy.

Error handling in the upload pipeline must be thorough. If the assembled file fails validation (corrupted container, unsupported codec, zero-duration), the system must notify the user with a clear error message and clean up the partial upload from object storage. If the video passes validation but transcoding later fails, the raw file should be retained for a retry window (typically 72 hours) before garbage collection deletes it. The metadata record tracks the video through every state: UPLOADING, VALIDATING, PROCESSING, READY, FAILED.

Video State	Trigger	Visible to Viewer?	Next Action
UPLOADING	Client begins chunk upload	No	Track chunk progress, wait for completion
VALIDATING	All chunks received	No	Assemble file, validate container and codec
PROCESSING	Validation passed, Kafka event published	No	Transcoding workers encode all variants
READY	All transcoded variants complete	Yes	Video available for viewing via CDN
FAILED	Validation or transcoding error	No (creator sees error)	Notify creator, retain raw file for retry

Video Transcoding & Encoding

Transcoding is the computational heart of a video platform. It converts the raw uploaded video — which may be in any format, resolution, or codec — into a standardized set of output variants optimized for streaming. Each variant targets a specific combination of resolution, bitrate, and codec, forming what is known as the codec ladder or encoding ladder. The goal is to produce a version of the video for every combination of device capability and network condition a viewer might have.

A typical codec ladder for a YouTube-scale platform includes 6-8 variants per video. For a 1080p upload, the output ladder might include: 2160p (4K upscale only if source is 4K), 1080p at 4-6 Mbps, 720p at 2.5-4 Mbps, 480p at 1-2 Mbps, 360p at 0.5-1 Mbps, and 240p at 0.3-0.5 Mbps. For each resolution, the video is split into small segments (typically 2-6 seconds each) for adaptive bitrate streaming. Each segment is independently decodable, allowing the player to switch between quality levels at segment boundaries without any glitch or buffering.

Resolution	Typical Bitrate (H.264)	Typical Bitrate (AV1)	Target Device
2160p (4K)	15-20 Mbps	8-12 Mbps	Smart TVs, high-end desktops
1080p	4-6 Mbps	2-4 Mbps	Desktops, tablets, flagship phones
720p	2.5-4 Mbps	1.5-2.5 Mbps	Mid-range phones, average broadband
480p	1-2 Mbps	0.5-1 Mbps	Mobile on cellular, slow connections
360p	0.5-1 Mbps	0.3-0.5 Mbps	Very slow connections, emerging markets
240p	0.3-0.5 Mbps	0.15-0.3 Mbps	Ultra-low bandwidth, feature phones

The choice of codec has enormous implications for quality, bandwidth, and compute cost. H.264 (AVC) is the universal baseline: every device supports it, encoding is fast, but compression efficiency is the worst of the modern codecs. H.265 (HEVC) offers 30-40% better compression than H.264 at equivalent quality, but licensing costs are complex and fragmented (multiple patent pools), and not all browsers support it natively. VP9 (developed by Google) provides similar compression gains to H.265 without licensing fees and is the primary codec used by YouTube. AV1 (developed by the Alliance for Open Media, including Google, Netflix, Meta, and Amazon) achieves 30-50% better compression than H.264 and is royalty-free, but encoding is 10-100x slower than H.264, making it extremely compute-expensive.

The practical codec strategy for 2024+

Encode everything in H.264 as the baseline (universal playback support). Additionally encode popular or high-traffic videos in AV1 for bandwidth savings on supporting devices. Use VP9 as a middle ground for content that does not justify AV1 encoding cost. This tiered approach balances compute cost against bandwidth savings — AV1 encoding is expensive, but for a viral video watched millions of times, the bandwidth savings vastly outweigh the one-time encoding cost.

The transcoding architecture must handle massive parallelism. Rather than processing each video serially, the system uses a DAG (Directed Acyclic Graph) task pipeline. When a new video arrives for transcoding, the orchestrator breaks it into a graph of dependent tasks: first, extract audio and perform audio transcoding; in parallel, probe the video to determine source resolution and codec; then, for each target resolution, create an encoding task that depends on the probe result. Each encoding task is further parallelized by splitting the video into temporal chunks (e.g., 10-second segments) and encoding them in parallel across multiple workers, then concatenating the results.

graph TD
    A["VIDEO_UPLOADED Event"] --> B["Transcoding Orchestrator"]
    B --> C["Probe: Extract Source<br/>Metadata"]
    B --> D["Extract & Transcode<br/>Audio (AAC)"]
    C --> E{"Source Resolution<br/>≥ 1080p?"}
    E -->|Yes| F["Encode 1080p<br/>(H.264 + AV1)"]
    E -->|Yes| G["Encode 720p<br/>(H.264)"]
    E -->|Yes| H["Encode 480p<br/>(H.264)"]
    E -->|No| G
    E -->|No| H
    E -->|Always| I["Encode 360p<br/>(H.264)"]
    E -->|Always| J["Encode 240p<br/>(H.264)"]
    F --> K["Generate Segments<br/>& Manifest"]
    G --> K
    H --> K
    I --> K
    J --> K
    D --> K
    K --> L["Thumbnail<br/>Extraction"]
    L --> M["Mark Video READY<br/>Update Metadata"]

    style A fill:#f3e5f5
    style B fill:#e1f5fe
    style M fill:#e8f5e9

DAG-based transcoding pipeline. The orchestrator creates dependent tasks: probe first, then parallel encoding at each resolution, followed by segment packaging and thumbnail extraction.

FFmpeg is the industry-standard tool at the core of nearly every transcoding pipeline. It handles container parsing, codec decoding/encoding, resolution scaling, audio extraction, and segment packaging. Production systems wrap FFmpeg in a worker service that receives tasks from the message queue, executes the FFmpeg command, uploads the output to object storage, and reports completion back to the orchestrator. Hardware acceleration via NVIDIA GPUs (NVENC) or Intel Quick Sync can speed up H.264/H.265 encoding by 5-10x compared to CPU-only encoding.

Transcoding optimization techniques

Temporal chunking — Split the video into 10-30 second chunks, encode each chunk on a separate worker in parallel, and concatenate the results. This reduces wall-clock transcoding time from hours to minutes for long videos.
Two-pass encoding — First pass analyzes the video to build a bitrate map. Second pass uses this map to allocate bits efficiently (more bits to complex scenes, fewer to static ones). Produces 15-20% better quality at the same bitrate versus single-pass.
Per-title encoding — Netflix pioneered this: analyze each video individually to determine the optimal bitrate for each resolution, rather than using a fixed codec ladder. An animated film needs fewer bits than a fast-action sports clip at the same resolution.
Hardware acceleration — NVIDIA NVENC GPUs encode H.264 at 10-20x CPU speed. Use GPU instances for H.264/H.265 baseline encoding, CPU instances for AV1 (GPU AV1 encoders are still maturing).
Priority queues — High-subscriber-count creators and trending videos get priority transcoding. A video from a creator with 10M subscribers should be ready in 5 minutes, not 30.
Spot instance fleet — Transcoding is embarrassingly parallel and fault-tolerant (a failed chunk is simply re-queued). Use spot/preemptible instances at 60-80% cost savings. Design workers to checkpoint progress and handle preemption gracefully.

The AV1 encoding cost trap

AV1 encoding is 10-100x slower than H.264. Encoding every video in AV1 would require an enormous compute fleet and dramatically increase transcoding costs. The practical approach is selective AV1: encode in AV1 only when the expected view count justifies the compute cost. A video viewed 10 million times saves far more in bandwidth than the one-time encoding cost. Videos with low expected viewership stay H.264-only.

The orchestrator tracks each video through the transcoding DAG and handles failures gracefully. If a single encoding task fails (worker crash, spot instance preemption), the orchestrator retries that specific task on a new worker without restarting the entire pipeline. Once all variants for a video are complete, the orchestrator triggers manifest generation (creating the HLS/DASH playlist files that reference all segments across all quality levels), updates the metadata database to mark the video as READY, and the video becomes available for viewing.

Content Delivery — CDN Architecture

The CDN is the delivery backbone of a video platform. Without it, every viewer request would travel back to the origin data center, creating unbearable latency for distant users and overwhelming the origin with traffic that no single datacenter could handle. A well-designed CDN for video serves over 90% of requests from edge caches located physically close to the viewer, reducing round-trip latency to under 20ms and shielding the origin from the vast majority of traffic.

A video CDN operates as a hierarchical caching system with three tiers. The first tier is the edge Points of Presence (POPs), deployed in hundreds of locations worldwide, often co-located in ISP datacenters or Internet Exchange Points (IXPs). These are the caches closest to the viewer and handle the vast majority of requests. The second tier is the regional shield (also called mid-tier cache or origin shield), which sits between the edge POPs and the origin. When an edge POP has a cache miss, it requests the content from the regional shield rather than going directly to the origin. This aggregates demand from multiple edge POPs in the same region and dramatically reduces origin load. The third tier is the origin storage, the authoritative source of all transcoded video segments.

graph TD
    A["Viewer in Tokyo"] -->|"Request Segment"| B["Edge POP<br/>Tokyo"]
    C["Viewer in Osaka"] -->|"Request Segment"| D["Edge POP<br/>Osaka"]
    B -->|"Cache Miss"| E["Regional Shield<br/>Asia-Pacific"]
    D -->|"Cache Miss"| E
    E -->|"Cache Miss"| F["Origin Storage<br/>(us-east-1)"]
    F -->|"Segment Data"| E
    E -->|"Cache & Forward"| B
    E -->|"Cache & Forward"| D
    B -->|"Stream"| A
    D -->|"Stream"| C

    G["Viewer in London"] -->|"Request Segment"| H["Edge POP<br/>London"]
    H -->|"Cache HIT"| G

    style A fill:#e1f5fe
    style C fill:#e1f5fe
    style G fill:#e1f5fe
    style B fill:#e8f5e9
    style D fill:#e8f5e9
    style H fill:#e8f5e9
    style E fill:#fff3e0
    style F fill:#fce4ec

Three-tier CDN: viewers hit edge POPs first, cache misses go to regional shield, and only cold content reaches the origin. The London viewer gets a cache hit — the most common case for popular content.

Cache-fill optimization is one of the most important aspects of video CDN design. When a new video goes viral, hundreds of edge POPs worldwide will simultaneously experience cache misses for the same segments. Without coordination, they would all send requests to the origin simultaneously, creating a thundering herd that could overwhelm origin bandwidth. The regional shield solves this by collapsing duplicate requests: if the shield is already fetching segment X from origin, additional requests for segment X from other edge POPs in the region wait for the first fetch to complete and then receive the cached copy. This is called request collapsing or request coalescing.

Pre-warming the CDN for predictable viral events

For scheduled live events (sports finals, music premieres), pre-push content to edge caches before the event starts. For live streams, push the first few segments to all major POPs proactively rather than waiting for viewer-driven cache fills. This eliminates the cold-start latency spike when millions of viewers tune in simultaneously.

Geographic routing determines which edge POP serves each viewer. DNS-based routing (GeoDNS) resolves the CDN hostname to the IP address of the nearest POP based on the viewer's DNS resolver location. Anycast routing announces the same IP address from multiple POPs and relies on BGP routing to direct traffic to the closest one. In practice, large video platforms use a combination: GeoDNS for initial connection routing and anycast for failover. The routing decision also considers POP health and load: if the nearest POP is at capacity or experiencing issues, traffic is redirected to the next closest healthy POP.

CDN cache management strategies for video

Segment-level caching — Cache individual 2-6 second video segments rather than entire videos. This allows partial caching of long videos and fine-grained eviction based on which segments are actually watched (most viewers do not watch the entire video).
LRU with frequency boosting — Standard LRU eviction augmented with a frequency counter. Segments from popular videos are retained longer. A segment watched 1000 times per hour is far more valuable in cache than one watched once per day.
Popularity-based pre-positioning — The top 1% of videos by view count generate 80%+ of traffic (power-law distribution). Pre-position these segments at all edge POPs globally, even before viewers request them.
Tiered resolution caching — Cache all resolutions for popular content but only 480p and 720p for long-tail content at edge POPs. Higher resolutions for unpopular videos are served from the regional shield or origin on demand.
TTL and invalidation — Video segments are immutable once transcoded: the same segment file never changes. This means cache TTLs can be very long (30-90 days). Invalidation is needed only if a video is deleted or re-transcoded.
ISP-embedded caches — YouTube deploys Google Global Cache (GGC) servers directly inside ISP networks. This serves popular content without the video traffic ever leaving the ISP network, reducing bandwidth costs for both the ISP and the platform.

The economics of video CDN are dominated by bandwidth costs. At YouTube scale, building your own CDN (private backbone, ISP-embedded caches) is far cheaper than using commercial CDN providers. For smaller platforms, a multi-CDN strategy (using Cloudflare, Akamai, and Fastly simultaneously) provides geographic redundancy and cost optimization through competitive pricing. The CDN selection layer routes each viewer request to the CDN with the best combination of cost and performance for that specific geographic region.

The 90/10 rule of video CDN

In a well-optimized video CDN, 90% of viewer requests should be served from edge caches with under 20ms latency. Only 10% of requests (for unpopular or brand-new content) should reach the regional shield, and less than 1% should reach the origin. If your origin is serving more than 5% of total traffic, your caching strategy needs improvement.

Adaptive Bitrate Streaming — HLS & DASH

Adaptive Bitrate (ABR) streaming is the technology that allows a video player to seamlessly switch between quality levels in real time based on the viewer's network conditions. Without ABR, a viewer on a variable mobile connection would experience constant buffering as the player attempts to download segments at a fixed bitrate that exceeds available bandwidth. ABR solves this by breaking the video into small segments and allowing the player to choose a different quality level for each segment independently.

The two dominant ABR protocols are HLS (HTTP Live Streaming, developed by Apple) and DASH (Dynamic Adaptive Streaming over HTTP, an open ISO standard). Both work on the same fundamental principle: the server provides a manifest file that lists all available quality levels and the URLs of every segment at each quality level. The client player downloads the manifest, monitors network conditions, and selects the appropriate quality for each subsequent segment. The key difference is format: HLS uses M3U8 playlist files and MPEG-TS or fMP4 segments, while DASH uses an XML-based MPD (Media Presentation Description) file and fMP4 segments. In practice, modern platforms produce both HLS and DASH manifests from the same underlying fMP4 segments.

Feature	HLS	DASH
Developer	Apple	MPEG (ISO standard)
Manifest format	M3U8 (plaintext playlist)	MPD (XML)
Segment format	MPEG-TS or fMP4	fMP4
DRM support	FairPlay	Widevine, PlayReady
Browser support	Safari native, others via hls.js	Via dash.js, Shaka Player
Typical segment duration	4-6 seconds	2-4 seconds
Live streaming	Excellent (Apple-native)	Excellent (low-latency extensions)
Industry adoption	Dominant for Apple devices	Dominant for Android and web

The manifest file is the control plane of adaptive streaming. A master HLS manifest (multi-variant playlist) lists all available quality levels with their resolution, bitrate, and codec. For each quality level, there is a media playlist that lists the URLs of individual segments. When the player starts, it downloads the master manifest, selects an initial quality level (often the lowest for fast startup), and begins downloading segments. As it accumulates data on available bandwidth, it switches to higher quality levels.

The ABR algorithm in the client player is where the intelligence lives. The algorithm must balance three competing objectives: maximize video quality (use the highest bitrate the network can sustain), minimize rebuffering (never let the playback buffer run dry), and minimize quality oscillation (avoid rapidly switching between low and high quality, which is visually distracting). The most common approach is buffer-based ABR: the algorithm maintains a target buffer level (typically 30-60 seconds of video) and adjusts quality based on how full the buffer is. When the buffer is high (well above target), the algorithm selects higher quality. When the buffer is draining (below target), it switches to lower quality aggressively to prevent rebuffering.

ABR algorithm components

Throughput estimation — Measure the download speed of each segment. Use a weighted moving average (recent segments weighted more heavily) to estimate available bandwidth. Be conservative: use the 70th percentile of recent measurements rather than the average to account for variability.
Buffer-level monitoring — Track how many seconds of video are buffered ahead of the playback position. Define thresholds: panic zone (under 5 seconds — switch to lowest quality immediately), low zone (5-15 seconds — prefer lower quality), target zone (15-40 seconds — maintain current quality), high zone (above 40 seconds — can increase quality).
Quality selection logic — Choose the highest bitrate that is sustainable given current throughput estimate AND is appropriate for the current buffer level. Never select a bitrate higher than 80% of estimated throughput to maintain a safety margin.
Startup optimization — During initial playback, start at low quality for fast time-to-first-frame, then ramp up quality aggressively over the first 3-5 segments as bandwidth estimates stabilize. Users tolerate 3-5 seconds of low quality at startup far better than 3-5 seconds of buffering.
Quality lock-in — Once the algorithm selects a quality level, require it to sustain that level for at least 3-4 segments before switching again. This prevents rapid oscillation that is visually jarring.
Viewer bandwidth prediction — For returning viewers, use historical bandwidth data for their network to select a better initial quality level rather than always starting at the lowest.

Using average throughput for ABR decisions

A common mistake is using the arithmetic mean of recent throughput measurements for quality selection. Network throughput is bursty and often has a heavy tail of slow measurements. Using the average leads to over-estimating available bandwidth, selecting too-high quality, and causing rebuffering. Use the harmonic mean or a conservative percentile (e.g., 70th percentile) instead.

Segment duration is a critical design parameter with cascading effects. Shorter segments (2 seconds) enable faster quality switching and lower latency for live streams, but they increase the number of HTTP requests (higher overhead), increase manifest file size, and slightly reduce encoding efficiency (each segment must start with a keyframe, and shorter segments have a higher percentage of keyframes). Longer segments (6-10 seconds) improve encoding efficiency and reduce request overhead but make quality switching sluggish and increase live-stream latency. The industry consensus for on-demand video is 4-6 second segments, with 2-4 seconds for live streams where latency matters.

Low-latency live streaming extensions

Standard HLS/DASH with 6-second segments introduces 15-30 seconds of end-to-end latency for live streams (3-5 segments of buffer). Low-Latency HLS (LL-HLS) and Low-Latency DASH use partial segments (sub-second chunks pushed to the player before the full segment is complete) to reduce live latency to 2-5 seconds. This requires the CDN to support chunked transfer encoding and the player to handle partial segment downloads.

In an interview, discussing ABR demonstrates understanding of client-server interaction, the tension between quality and reliability, and how user experience metrics (rebuffer rate, startup time, average quality) drive system design decisions. The best candidates discuss how ABR algorithms are tuned using A/B testing with real viewer experience metrics.

Video Metadata & Search

Every video on the platform is more than just a media file. It is accompanied by a rich metadata record that powers search, recommendations, content moderation, and the viewing experience. The metadata system is the connective tissue of the platform, linking the video content in object storage to the user-facing features in the application. Designing this layer correctly is essential because metadata operations happen at much higher frequency than video operations: every page load, every search query, and every recommendation request reads metadata, while video files are written once and read through the CDN.

The metadata record for each video includes core fields (video_id, owner_id, title, description, tags, category, upload_timestamp, duration, status), technical fields (source_resolution, source_codec, source_framerate, transcoded_variants with paths to each quality level in object storage), engagement fields (view_count, like_count, comment_count, share_count), and derived fields (thumbnail_urls, auto_generated_captions, content_fingerprint_hash). This record lives in a relational database (PostgreSQL) for transactional integrity, with denormalized copies in Elasticsearch for full-text search and in Redis for low-latency serving.

Metadata Category	Fields	Storage	Access Pattern
Core metadata	video_id, title, description, tags, owner, status	PostgreSQL (primary)	Read on every page view, written on upload
Technical metadata	resolution, codec, duration, segment paths	PostgreSQL + object store manifest	Read by player at stream start
Engagement counters	views, likes, comments, shares	Redis (hot) + PostgreSQL (durable)	Incremented on every interaction, read on every page view
Search index	title, description, tags, captions	Elasticsearch	Queried on every search, updated on upload/edit
Content fingerprint	Perceptual hash, audio fingerprint	Dedicated fingerprint DB	Queried during upload for copyright detection
Thumbnails	auto_thumbnails[], custom_thumbnail_url	Object storage + CDN	Read on every video card render in feeds/search

Thumbnail generation deserves special attention because thumbnails are the primary visual element driving click-through rates. The system generates thumbnails at multiple stages. During transcoding, the pipeline extracts frames at regular intervals (every 5-10 seconds) and selects 3-5 candidate thumbnails based on visual quality heuristics: prefer frames with good contrast, clear subjects, and no motion blur. These candidates are stored in object storage and presented to the creator for selection. Advanced systems use ML models trained on click-through rate data to rank thumbnails by predicted engagement, automatically selecting the thumbnail most likely to attract views.

Thumbnail optimization is a business-critical feature

YouTube internal data shows that thumbnails drive 90% of the click decision for video recommendations. A better thumbnail can increase a video click-through rate by 30-50%. This is why YouTube invests heavily in thumbnail analysis ML models and provides creators with detailed thumbnail performance analytics. In your system design, thumbnails should be treated as first-class objects with their own CDN caching and A/B testing infrastructure.

Search is powered by Elasticsearch indexing the video metadata. When a user searches for "how to cook pasta", the query hits an Elasticsearch cluster that matches against the title, description, tags, and auto-generated captions (produced by speech-to-text during transcoding). Results are ranked by a combination of text relevance (BM25 score), engagement signals (videos with higher view counts and engagement rates rank higher), recency (newer videos get a freshness boost), and creator authority (channels with more subscribers rank higher for their topic area). The search pipeline also handles query understanding: spelling correction, query expansion (adding synonyms), and intent classification (is the user looking for a tutorial, a review, or entertainment?).

Search and discovery pipeline components

Full-text search index — Elasticsearch cluster indexing title, description, tags, and auto-captions. Supports fuzzy matching, phrase matching, and field-weighted scoring.
Query understanding — NLP layer that processes the raw search query: spelling correction, entity recognition (is this a channel name or a topic?), query expansion with synonyms, and language detection.
Ranking model — ML ranking model that combines text relevance score with engagement signals, creator authority, freshness, and personalization signals to produce the final result ordering.
Recommendation engine — Collaborative filtering (users who watched X also watched Y) combined with content-based features (video embeddings from visual and audio analysis). Powers the homepage feed and "up next" suggestions.
Content-based features — Video embedding vectors generated by analyzing visual frames and audio content during transcoding. Enable "more like this" recommendations even for new videos with no engagement data.
Real-time engagement updates — Engagement counters (views, likes) are updated in near-real-time via a Kafka pipeline. The search index is refreshed periodically (every 5-15 minutes) to reflect updated engagement scores.

Engagement counters (view count, like count) present a classic distributed systems challenge. View counts must be accurate enough for creator analytics and monetization but are updated at extreme rates (a viral video may receive thousands of views per second). The solution is a write-behind pattern: increment a per-video counter in Redis (or an in-memory counter on the application server), and flush the accumulated count to PostgreSQL periodically (every 30-60 seconds) in a batch write. This reduces database write load by orders of magnitude while keeping the displayed count accurate to within a minute of real-time.

Separating the hot path from the cold path

Video metadata has two distinct access patterns. The hot path serves real-time page views and needs sub-10ms latency: use Redis with the most frequently accessed fields. The cold path serves search, analytics, and admin operations: use PostgreSQL and Elasticsearch. Never block the hot path on a cold-path query. Denormalize aggressively for the hot path and accept eventual consistency between the two.

Content Moderation & Copyright

Content moderation is both a legal requirement and an existential concern for any video platform. Without effective moderation, the platform becomes a vector for copyright infringement, hate speech, child exploitation, terrorism content, and misinformation. Governments worldwide have enacted laws requiring platforms to remove certain categories of content within hours of notification, with severe penalties for non-compliance. At the same time, over-moderation (removing legitimate content) alienates creators and harms free expression. The moderation system must balance accuracy, speed, and fairness at massive scale.

Copyright detection is the most technically mature aspect of content moderation. YouTube's Content ID system is the gold standard: it maintains a database of reference files (audio and video fingerprints) submitted by copyright holders. Every uploaded video is fingerprinted and compared against this reference database. If a match is found, the copyright holder's policy is applied automatically: block the video, monetize it (run ads and share revenue with the rights holder), or track its viewership. Content ID processes over 500 hours of video per minute and makes a match decision within the transcoding pipeline, before the video becomes publicly available.

graph TD
    A["New Video Upload"] --> B["Transcoding Pipeline"]
    B --> C["Content Fingerprinting<br/>(Audio + Visual Hash)"]
    C --> D{"Match in Copyright<br/>Reference DB?"}
    D -->|"Match Found"| E{"Copyright Holder<br/>Policy?"}
    D -->|"No Match"| F["Automated Safety<br/>Scanning (ML)"]
    E -->|"Block"| G["Video Blocked<br/>Creator Notified"]
    E -->|"Monetize"| H["Video Published<br/>Ads Enabled for Rights Holder"]
    E -->|"Track"| I["Video Published<br/>Analytics Shared"]
    F --> J{"Safety Flags<br/>Detected?"}
    J -->|"High Confidence Violation"| K["Auto-Remove<br/>+ Creator Strike"]
    J -->|"Low Confidence / Edge Case"| L["Human Review Queue"]
    J -->|"Clean"| M["Video Published"]
    L --> N{"Human Reviewer<br/>Decision"}
    N -->|"Violation Confirmed"| K
    N -->|"False Positive"| M

    style G fill:#fce4ec
    style K fill:#fce4ec
    style M fill:#e8f5e9
    style H fill:#fff3e0
    style L fill:#fff9c4

Content moderation flow: copyright fingerprinting runs first, then automated safety scanning, with human review for edge cases. Multiple outcomes are possible depending on policy and confidence level.

Audio fingerprinting works by extracting a compact perceptual hash from the audio track. The algorithm analyzes spectral features (frequency components over time) and produces a fingerprint that is robust to common transformations: re-encoding, volume changes, slight pitch shifts, and even background noise. The reference database stores fingerprints for millions of songs, movies, TV shows, and other copyrighted content. The matching algorithm compares the upload's fingerprint against the reference database using locality-sensitive hashing (LSH) to find approximate nearest neighbors efficiently, even at the scale of millions of reference tracks.

Visual fingerprinting complements audio matching by analyzing the video frames themselves. It computes perceptual hashes of keyframes that are invariant to resolution changes, color adjustments, cropping, and mild geometric transformations. This catches cases where the audio is replaced or muted to evade audio fingerprinting. The combination of audio and visual fingerprinting catches over 98% of exact and near-exact copies of copyrighted content.

The DMCA notice and counter-notice workflow

When automated systems miss a copyright violation, the Digital Millennium Copyright Act (DMCA) provides a legal framework. The copyright holder sends a takedown notice. The platform must remove the content expeditiously (typically within 24-48 hours). The uploader can file a counter-notice if they believe the claim is invalid. The platform then restores the content after 10-14 business days unless the claimant files a lawsuit. This workflow must be implemented as a first-class feature with tracking, notifications, and audit trails.

Safety moderation components

Nudity and explicit content detection — CNN-based image classifier running on sampled frames. Flag videos exceeding threshold for human review or automatic age-restriction.
Violence and graphic content — ML models trained on labeled examples of violent imagery. Different thresholds for news/documentary context versus gratuitous violence.
Hate speech detection — NLP models analyzing auto-generated captions and video metadata for hateful language. Multilingual support is critical and challenging.
Terrorist content identification — Hash-sharing consortium (GIFCT) provides shared database of known terrorist content hashes. Platform checks uploads against this database in addition to internal models.
Child safety (CSAM) — Highest-priority moderation category. Uses PhotoDNA hash matching and ML classifiers. Any match is immediately blocked, reported to NCMEC (legal requirement in the US), and the account is flagged for investigation.
Misinformation detection — Most challenging category due to subjectivity. Typically handled through information panels (linking to authoritative sources) rather than removal, except for health misinformation during pandemics.

The human review pipeline handles the cases where automated systems are not confident. Reviewers are presented with flagged content alongside the ML confidence score and the reason for flagging. They make a binary decision (violation/not-violation) with a category label. Reviewer decisions are used as training data to improve the ML models over time. The human review queue must be prioritized: child safety content is reviewed within 1 hour, terrorist content within 4 hours, and general policy violations within 24 hours. Large platforms employ thousands of content reviewers, often through third-party contractors, with strict guidelines for reviewer well-being (mandatory breaks, counseling access) given the disturbing nature of much flagged content.

Moderation should not block the upload pipeline

Run copyright and safety scanning in parallel with transcoding, not sequentially. If moderation scanning takes longer than transcoding, the video enters a LIMITED state where it is technically ready to stream but visible only to the uploader. Once moderation clears, it transitions to PUBLIC. This prevents moderation from adding latency to the creator experience while still preventing violating content from reaching a broad audience.

Appeals and creator communication complete the moderation system. Creators whose content is removed must be able to appeal the decision, understand why it was removed, and receive a timely response. A creator who receives three copyright strikes within 90 days has their channel terminated. This strike system, while imperfect, provides a framework for proportional enforcement. The appeals process should include both automated re-review (re-running the ML classifier with a lower threshold) and human escalation for disputed cases.

Scaling Storage, Compute & Cost Optimization

At video platform scale, cost optimization is not a nice-to-have — it is an existential requirement. YouTube reportedly operates at near break-even despite generating over $30 billion in annual revenue, because the infrastructure costs of storing, transcoding, and delivering billions of hours of video are staggering. Every architectural decision must be evaluated through the lens of cost per hour of video stored, transcoded, and delivered.

Storage costs are managed through a tiered lifecycle policy. Not all videos are equal: the top 1% of videos by view count generate roughly 80% of all views (power-law distribution). Yet the long tail of rarely-watched videos accounts for the vast majority of storage volume. The solution is to move videos between storage tiers based on access frequency. Hot storage (SSD-backed or high-IOPS object storage) holds videos with recent views (last 30 days). Warm storage (standard object storage like S3 Standard) holds videos with occasional views (30-180 days since last view). Cold storage (S3 Glacier, S3 Glacier Deep Archive) holds videos that have not been viewed in over 6 months. The transition between tiers is automatic, driven by a lifecycle policy that monitors per-video access patterns.

Storage Tier	Cost per GB/Month	Access Latency	Use Case	Percentage of Videos
Hot (SSD / S3 Standard)	$0.023	Milliseconds	Actively watched videos, recently uploaded	5-10%
Warm (S3 Standard-IA)	$0.0125	Milliseconds	Videos with occasional views (weekly/monthly)	15-25%
Cold (S3 Glacier Instant)	$0.004	Milliseconds	Rarely watched but still accessible instantly	30-40%
Archive (S3 Glacier Deep Archive)	$0.00099	Hours (retrieval required)	Videos with zero views in 6+ months	30-40%

The cost difference between tiers is enormous: hot storage costs roughly 23x more than deep archive per GB. For a platform storing 1 exabyte of video, moving 30% of content from hot to cold storage saves approximately $5-6 million per month. The key challenge is predicting which videos will be accessed. A video dormant for a year might suddenly go viral due to a cultural event or social media trend. The system must handle retrieval from cold storage gracefully: when a viewer requests a cold-stored video, the system initiates an asynchronous retrieval, serves a "video loading" message, and the video becomes playable within minutes (for Glacier Instant) to hours (for Deep Archive). Simultaneously, the video is promoted back to warm or hot storage based on the access pattern.

Keep one variant hot for cold-stored videos

Even for deeply archived videos, keep the lowest quality variant (240p or 360p) in warm storage. When a viewer requests a cold-stored video, serve the low-quality variant immediately while retrieving the higher-quality variants from cold storage in the background. This provides an acceptable viewing experience within seconds rather than making the viewer wait for the full retrieval.

Transcoding compute costs are the second-largest expense. The key optimization is using spot instances (AWS) or preemptible VMs (GCP) for the transcoding worker fleet. Spot instances offer 60-90% cost savings compared to on-demand pricing, with the trade-off that they can be terminated with 2 minutes notice when cloud provider needs the capacity. Transcoding workloads are ideal for spot instances because they are embarrassingly parallel, stateless, and fault-tolerant. If a spot instance is reclaimed while encoding a segment, the orchestrator simply re-queues that segment to another worker. The worker writes completed segments to object storage incrementally, so no work is lost beyond the current segment in progress.

Cost per hour of video: a worked example

→

Transcoding cost: A 1-hour video requires approximately 6-10 CPU-hours across all variants. At spot pricing of $0.01/CPU-hour, transcoding costs roughly $0.06-0.10 per hour of video.

→

Storage cost: The transcoded output for a 1-hour video across all variants is approximately 3-5 GB. At blended storage cost of $0.008/GB/month (mix of tiers), annual storage cost is approximately $0.29-0.48 per hour of video per year.

→

Bandwidth cost (viewing): If the average video is watched 100 times at an average bitrate of 3 Mbps, total egress per hour of video is roughly 135 GB. At CDN cost of $0.01/GB, viewing bandwidth costs $1.35 per hour of video over its lifetime.

→

Moderation cost: Automated scanning costs approximately $0.001-0.01 per video. Human review (for flagged content) costs approximately $0.50-1.00 per video but applies to only 1-5% of uploads.

Total cost per hour of video: Approximately $0.10 (transcode) + $0.40/year (storage) + $1.35 (bandwidth over lifetime) = roughly $1.85 in the first year. Bandwidth dominates, which is why CDN optimization is the highest-leverage cost reduction effort.

Transcoding cost: A 1-hour video requires approximately 6-10 CPU-hours across all variants. At spot pricing of $0.01/CPU-hour, transcoding costs roughly $0.06-0.10 per hour of video.

Moderation cost: Automated scanning costs approximately $0.001-0.01 per video. Human review (for flagged content) costs approximately $0.50-1.00 per video but applies to only 1-5% of uploads.

Advanced cost optimization strategies

Selective encoding — Do not produce 4K or AV1 variants for videos unlikely to get significant views. Encode the full ladder only when a video crosses a view threshold (e.g., 1000 views). Most videos never reach this threshold.
Content-aware encoding (per-title) — Netflix-style per-title encoding analyzes each video to determine the minimum bitrate needed at each resolution. Animated content needs 40-60% fewer bits than live action. This saves 20-30% bandwidth on average.
Deduplication at segment level — Identical video segments (common in re-uploads and compilations) can be deduplicated at the storage level using content-hash indexing, saving 10-15% of total storage.
Regional encoding priority — Encode videos primarily for the geographic region of the creator first (if a video is uploaded from India, prioritize codec/resolution combinations popular in India). Encode for other regions only if the video gains traction there.
Off-peak transcoding — Queue non-urgent transcoding jobs for off-peak hours when spot instance prices are lowest. A video uploaded at 3 PM can be transcoded during the 2-6 AM window when compute demand (and cost) drops significantly.
CDN traffic shaping — During peak hours, slightly reduce the maximum served bitrate (e.g., cap at 1080p instead of 4K) to reduce bandwidth costs. Netflix famously throttled bitrate during COVID-19 peak to manage bandwidth pressure across the entire internet.

Do not over-optimize storage at the expense of retrieval latency

Moving too much content to deep archive creates a poor experience when those videos are accessed. A viewer who clicks on a link to a year-old video and sees "this video is being retrieved, please wait several hours" will leave and never return. Set aggressive lifecycle policies only for videos with truly zero traffic, and always keep at least a low-quality variant instantly accessible.

Monitoring and attribution are essential for cost optimization. The system must track cost per video, cost per view, and cost per creator. This enables business decisions like: which creators generate positive unit economics? Which content categories are most expensive to serve? Are there videos consuming disproportionate resources (very long, very high resolution) that should be subject to different upload policies? Cost attribution data feeds into capacity planning, pricing decisions, and the design of creator incentive programs.

How this might come up in interviews

Designing a video platform is one of the most comprehensive system design interview questions at Google, Netflix, Meta, and Amazon. It appears as "Design YouTube", "Design Netflix", "Design a Video Streaming Service", or "Design a Live Streaming Platform". Interviewers love this question because it tests breadth across storage, compute, networking, streaming protocols, content moderation, and cost optimization simultaneously. At L4-L5, candidates should demonstrate the upload-transcode-store-deliver pipeline, adaptive bitrate streaming basics, and capacity estimation. At L6+, candidates must go deeper into codec selection trade-offs, CDN cache hierarchy design, transcoding DAG orchestration, content moderation architecture, and cost optimization strategies with real numbers.

Common questions:

L4-L5: Design the core video upload and viewing pipeline for a platform with 10 million daily active users. Walk through how a video gets from the creator's device to a viewer's screen, including upload, storage, transcoding, and delivery. [Tests: end-to-end pipeline understanding, object storage, message queues, basic CDN concepts, ability to estimate storage and bandwidth requirements]
L5: Explain how adaptive bitrate streaming works. What happens when a viewer's network degrades from 10 Mbps to 2 Mbps while watching a 1080p video? Walk through the manifest file, segment structure, and quality switching logic. [Tests: HLS/DASH protocol understanding, segment-based streaming, ABR algorithm basics, buffer management, user experience trade-offs]
L5-L6: Your video platform is spending $5M per month on CDN bandwidth. Design an optimization strategy to reduce this by 30% without degrading viewer experience. [Tests: CDN cache hierarchy (edge, shield, origin), popularity-based caching, codec efficiency (AV1 migration), tiered resolution caching, ISP-embedded cache strategies]
L6: Design the transcoding pipeline for a platform processing 500 hours of video per minute. How do you parallelize transcoding, handle failures, prioritize creator tiers, and manage compute costs? [Tests: DAG-based task orchestration, temporal chunking, spot instance strategy, priority queue design, codec ladder decisions]
L6: A live 4K stream event is scheduled to have 10 million concurrent viewers. How do you ensure the CDN can handle the traffic without degradation? What pre-event preparation is needed? [Tests: CDN pre-warming, capacity reservation, live vs on-demand infrastructure isolation, regional shield design, graceful degradation strategies]
L6-L7: Design the complete YouTube system end-to-end: upload, transcoding, storage, CDN, adaptive streaming, metadata and search, content moderation (copyright + safety), recommendation engine integration, and cost optimization. Walk through capacity planning for 500 hours uploaded per minute and 1 billion hours watched per day. [Tests: holistic system architecture, ability to make trade-offs across all components, cost analysis with real numbers, content moderation pipeline, codec strategy, storage lifecycle management]

Key takeaways

A video platform is a write-once, read-millions system shaped by extreme read-write asymmetry. The upload path is asynchronous and compute-intensive (transcoding), while the viewing path is synchronous, latency-sensitive, and bandwidth-intensive (CDN delivery). Design these paths as independent systems connected by a message queue.
The codec ladder (encoding each video into multiple resolution/bitrate variants) and adaptive bitrate streaming (HLS/DASH) work together to deliver the best possible quality for each viewer given their device and network conditions. Segment-based streaming with 4-6 second segments enables seamless quality switching without rebuffering.
CDN architecture is the core of the viewing experience. A three-tier cache hierarchy (edge POP, regional shield, origin) with popularity-based pre-positioning ensures that 90%+ of requests are served from edge caches under 20ms away from the viewer. Request collapsing at the shield tier prevents thundering herd on viral content.
Codec selection (H.264 vs H.265 vs AV1) is a strategic decision trading off compression efficiency against encoding cost, device support, and licensing. The practical approach is dual-codec: H.264 as universal baseline plus AV1 for high-traffic content where bandwidth savings justify the encoding cost.
Cost optimization at video scale requires tiered storage (hot/warm/cold/archive lifecycle), spot instances for transcoding, selective encoding (full codec ladder only for popular content), and per-title encoding. Bandwidth is the dominant cost, making CDN efficiency and codec optimization the highest-leverage investments.

Before you move on: can you answer these?

Your video platform serves 100 million daily active viewers. A new viral trend causes upload volume to spike to 3x normal for 48 hours. How does your system handle this without impacting viewer experience?

The upload pipeline is decoupled from the viewing pipeline via the message queue, so upload spikes do not directly impact viewers. The transcoding worker fleet auto-scales based on queue growth rate, provisioning additional spot instances within 2-5 minutes of detecting the spike. If spot capacity is insufficient, the system falls back to on-demand instances at higher cost. Priority queues ensure high-subscriber-count creator videos are processed first. If the backlog exceeds the SLA threshold (30 minutes), non-critical transcoding tasks (AV1 re-encoding, thumbnail regeneration) are paused to free capacity for primary encoding. The viewing path is unaffected because it is served entirely from CDN edge caches and does not depend on the transcoding pipeline.

Explain the trade-off between segment duration in adaptive bitrate streaming. Why does YouTube use approximately 5-second segments for on-demand video?

Shorter segments (2 seconds) enable faster quality adaptation and lower live-stream latency because the player can switch quality at each segment boundary. However, shorter segments increase HTTP request overhead (more requests per video), increase manifest file size, and reduce encoding efficiency because each segment must begin with an IDR keyframe. Longer segments (10 seconds) improve encoding efficiency and reduce request overhead but make quality switching sluggish and increase live-stream latency. The 5-second sweet spot balances quick adaptation (quality switch within 5 seconds of network change) with acceptable encoding overhead (keyframe overhead is roughly 2-3% of segment size at 5 seconds). For live streaming where latency matters more, 2-3 second segments or partial segment delivery (LL-HLS) is preferred despite the overhead.

How would you design the storage tier lifecycle policy for a platform with 500 million videos where 80% have not been viewed in the past 6 months?

Implement a four-tier lifecycle: Hot (viewed in last 30 days) on S3 Standard, Warm (viewed in 30-180 days) on S3 Standard-IA, Cold (no views in 180 days) on S3 Glacier Instant Retrieval, and Archive (no views in 365+ days) on Glacier Deep Archive. For cold and archive videos, keep the lowest quality variant (240p) on Standard-IA for instant playback. When a cold video is accessed, serve the 240p variant immediately while retrieving higher-quality variants asynchronously. Promote the video back to warm tier on access. Monitor for seasonal access patterns — some videos spike annually (holiday content, tax tutorials) and should be pre-promoted before their expected access period. The 80% figure means roughly 400 million videos on cold or archive storage, saving approximately 85-90% on storage costs for that content compared to keeping everything on hot storage.

🧠Mental Model

💡 Analogy

A video platform is like a television broadcast station. The upload is like filming a show — the raw footage arrives at the studio in whatever format the camera crew used. Transcoding is like the production department converting that footage into multiple broadcast formats: SD for old TVs, HD for modern screens, and 4K for premium viewers. Each format is cut into short segments like commercial breaks, so the broadcast can be assembled on the fly. The CDN is the network of transmission towers across the country, each caching popular shows locally so that a viewer in any city gets crystal-clear reception without signal traveling back to the central studio. Adaptive bitrate streaming is like the television automatically adjusting picture quality based on your antenna signal strength — when the signal is strong, you get HD; when a storm rolls in, it gracefully degrades to SD rather than losing the picture entirely. The metadata service is the TV guide that lets you search for shows, and content moderation is the standards and practices department that reviews every show before it airs.

⚡ Core Idea

A video platform is fundamentally a write-once, read-millions pipeline. Videos are uploaded once, transcoded into multiple quality variants (the codec ladder), stored in tiered object storage, and delivered globally through a hierarchical CDN. The player uses adaptive bitrate streaming to select the optimal quality for each segment based on real-time network conditions. The architecture is shaped by extreme asymmetry: the upload path is asynchronous and compute-heavy (transcoding), while the viewing path is synchronous and bandwidth-heavy (CDN delivery). Every design decision — codec selection, segment duration, caching strategy, storage tiering — is a trade-off between user experience and cost at exabyte scale.

🎯 Why It Matters

Video platform design is the ultimate system design interview question because it touches every major distributed systems concept simultaneously: object storage, message queues, worker pools, DAG-based task orchestration, CDN caching hierarchies, streaming protocols, database design, ML-powered content moderation, and cost optimization. Understanding how these components fit together to serve a billion hours of video per day teaches you to reason about systems holistically — not just individual components, but the interactions, trade-offs, and economic pressures that shape real-world architecture at the largest scale in computing.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

Designing a Video Streaming Platform

The Scale of Video — Why YouTube Is the Hardest System to Design

Requirements & Capacity Estimation

Video Upload Pipeline

Video Transcoding & Encoding

Content Delivery — CDN Architecture

Adaptive Bitrate Streaming — HLS & DASH

Video Metadata & Search

Content Moderation & Copyright

Scaling Storage, Compute & Cost Optimization

Discussion

In-app Q&A

Designing a Video Streaming Platform

The Scale of Video — Why YouTube Is the Hardest System to Design

Requirements & Capacity Estimation

Video Upload Pipeline

Video Transcoding & Encoding

Content Delivery — CDN Architecture

Adaptive Bitrate Streaming — HLS & DASH

Video Metadata & Search

Content Moderation & Copyright

Scaling Storage, Compute & Cost Optimization

Discussion

In-app Q&A