Design a search autocomplete system that returns relevant suggestions within 100 milliseconds as users type, handling billions of queries per day with personalization, trending detection, multi-language support, and real-time content filtering at Google/Bing scale.
Design a search autocomplete system that returns relevant suggestions within 100 milliseconds as users type, handling billions of queries per day with personalization, trending detection, multi-language support, and real-time content filtering at Google/Bing scale.
Lesson outline
Search autocomplete looks simple on the surface: the user types a few characters, and a dropdown appears with suggestions. But behind that dropdown lies one of the most demanding real-time systems in all of software engineering. Google processes roughly 8.5 billion searches per day, which translates to approximately 99,000 searches per second on average. Each keystroke a user types triggers a new autocomplete request, meaning the actual autocomplete QPS is often 3-5x the search QPS itself, since users type multiple characters before selecting a suggestion. That puts autocomplete request volume in the range of 300,000-500,000 requests per second at peak.
The latency budget is brutally tight. Users expect suggestions to appear within 100 milliseconds of pressing a key. If your autocomplete takes 200ms, users perceive it as laggy. At 500ms, they ignore the suggestions entirely and just hit Enter. Research from Google and Microsoft consistently shows that autocomplete adoption drops by 30-40% when latency exceeds 200ms. This means your entire pipeline from keystroke to rendered suggestions must fit within 100ms, including network round-trip time (typically 20-50ms), server processing (target under 10ms), and client-side rendering (10-20ms).
The 100ms Latency Budget
The 100ms end-to-end budget breaks down roughly as: 20-50ms network round trip, 5-10ms server-side lookup and ranking, 5-10ms serialization and transfer, 10-20ms client-side rendering. This leaves almost zero room for database queries. Autocomplete must be served entirely from in-memory data structures or fast caches.
Beyond raw speed, the quality bar is extraordinarily high. Suggestions must be relevant to the user's intent, not just their prefix. If a user types "app", they might mean "apple stock price", "apple pie recipe", or "application form". The system must rank these intelligently based on global popularity, trending topics, the user's search history, geographic location, and time of day. A user searching at 8 AM on a weekday likely wants different suggestions than the same user at 10 PM on a Saturday.
Content safety adds another dimension of complexity. Autocomplete systems must filter out offensive, defamatory, sexually explicit, and dangerous suggestions in real time. This is not just a nice-to-have; Google has faced lawsuits in multiple countries over autocomplete suggestions that were deemed defamatory. The filtering must happen at serving time with near-zero latency overhead, which rules out heavyweight ML inference for every request.
Why autocomplete is harder than it looks
The deceptive simplicity of autocomplete is what makes it such a popular system design interview question. It tests a candidate's ability to reason about latency budgets, choose appropriate data structures, design data pipelines, and handle the tension between batch processing and real-time updates. A weak answer describes a trie and stops. A strong answer addresses ranking, personalization, content safety, and the operational challenges of running the system at scale.
Interview framing
When an interviewer asks you to design autocomplete, they are testing whether you can reason about latency-sensitive systems, choose the right data structure for prefix matching, design a data pipeline for continuous updates, and handle the ranking/filtering layer. Start by clarifying scale and constraints before jumping into data structures.
Before drawing a single architecture box, you need to nail down the requirements and run the numbers. This is where most candidates either rush through (and design an under-powered system) or over-engineer (and waste interview time on unnecessary components). The goal is to establish clear functional requirements, non-functional requirements, and back-of-the-envelope estimates that drive architectural decisions.
Functional requirements
Non-functional requirements
Now let us run the numbers. These estimates will directly determine whether we need sharding, how much memory our trie requires, and what kind of data pipeline we need.
Back-of-the-envelope estimation
01
Daily search volume: Assume 5 billion searches per day (Google-scale). Average QPS = 5B / 86,400 = ~58,000 search QPS. Peak QPS (3x) = ~174,000 search QPS.
02
Autocomplete QPS: Each search involves the user typing an average of 4-5 characters before selecting a suggestion. Each keystroke triggers an autocomplete request (with debouncing, roughly 2-3 actual requests per search). Autocomplete QPS = 174,000 x 3 = ~520,000 requests/sec at peak.
03
Unique query volume: Out of 5 billion daily searches, approximately 15-20% are unique queries never seen before (the "long tail"). That means roughly 1 billion unique queries per day, and the trie must handle tens of billions of known prefixes.
04
Trie size estimation: If we store the top 10 million most popular queries (covering 90%+ of traffic via power-law distribution), with an average query length of 20 characters and 8 bytes of metadata per node, a compressed trie requires roughly 200-500 MB of memory. This fits comfortably in a single server's RAM.
05
Suggestion payload size: Each autocomplete response returns 5-10 suggestions, each a string of 20-50 characters. With metadata, each response is roughly 1-2 KB. At 500K QPS, that is 500-1000 MB/sec of egress bandwidth.
06
Storage for query logs: 5 billion search queries per day at an average of 50 bytes per query = 250 GB/day of raw log data. Over a year, that is roughly 90 TB. This feeds the offline trie-building pipeline.
Daily search volume: Assume 5 billion searches per day (Google-scale). Average QPS = 5B / 86,400 = ~58,000 search QPS. Peak QPS (3x) = ~174,000 search QPS.
Autocomplete QPS: Each search involves the user typing an average of 4-5 characters before selecting a suggestion. Each keystroke triggers an autocomplete request (with debouncing, roughly 2-3 actual requests per search). Autocomplete QPS = 174,000 x 3 = ~520,000 requests/sec at peak.
Unique query volume: Out of 5 billion daily searches, approximately 15-20% are unique queries never seen before (the "long tail"). That means roughly 1 billion unique queries per day, and the trie must handle tens of billions of known prefixes.
Trie size estimation: If we store the top 10 million most popular queries (covering 90%+ of traffic via power-law distribution), with an average query length of 20 characters and 8 bytes of metadata per node, a compressed trie requires roughly 200-500 MB of memory. This fits comfortably in a single server's RAM.
Suggestion payload size: Each autocomplete response returns 5-10 suggestions, each a string of 20-50 characters. With metadata, each response is roughly 1-2 KB. At 500K QPS, that is 500-1000 MB/sec of egress bandwidth.
Storage for query logs: 5 billion search queries per day at an average of 50 bytes per query = 250 GB/day of raw log data. Over a year, that is roughly 90 TB. This feeds the offline trie-building pipeline.
Let the numbers drive your architecture
The key insight from estimation: the top 10 million queries fit in under 1 GB of memory, meaning a single server can hold the entire trie. This means sharding is needed for throughput (handling 500K QPS), not for data size. This fundamentally changes the architecture: you replicate the full trie to many servers rather than partitioning it.
| Metric | Value | Architectural Implication |
|---|---|---|
| Peak autocomplete QPS | ~500,000 req/sec | Need 50-100 serving nodes (each handling ~5-10K QPS) |
| Trie memory footprint | 200-500 MB | Fits in single server RAM; replicate rather than shard |
| P99 latency target | < 10ms server-side | Must serve from in-memory data structures, no disk I/O |
| Unique queries per day | ~1 billion | Long tail requires robust handling of unknown prefixes |
| Query log storage | 250 GB/day | Batch pipeline processes logs to rebuild trie periodically |
| Suggestion payload | 1-2 KB per response | 500-1000 MB/sec egress bandwidth at peak |
These numbers reveal a critical architectural insight: autocomplete is a read-heavy, compute-light workload. The bottleneck is not CPU but rather memory and network throughput. Each request does a simple prefix traversal (microseconds of CPU) but must be served from RAM with minimal serialization overhead. This makes autocomplete an ideal candidate for in-memory data structures replicated across many read-only serving nodes.
The trie (pronounced "try", from "retrieval") is the foundational data structure for autocomplete. It is a tree where each node represents a character, and paths from root to leaf represent complete strings. The key property that makes tries ideal for autocomplete is that all strings sharing a common prefix share the same path in the tree, enabling prefix lookups in O(L) time where L is the length of the prefix, regardless of how many strings are stored.
Consider storing the queries "tree", "trie", "try", "trip", and "top". In a basic trie, the root has a child "t", which has children "r" and "o". The "r" node has children "e" and "i" and "y". Each path from root to a terminal node represents a complete query. To find all completions for the prefix "tr", you traverse root -> t -> r, then enumerate all descendants: "tree", "trie", "trip", "try".
graph TD
ROOT["(root)"] --> T["t"]
T --> R["r"]
T --> O["o"]
R --> E["e"]
R --> I["i"]
R --> Y["y ★"]
E --> E2["e ★ (tree)"]
I --> E3["e ★ (trie)"]
I --> P["p ★ (trip)"]
O --> P2["p ★ (top)"]
style ROOT fill:#f9f,stroke:#333
style Y fill:#bfb,stroke:#333
style E2 fill:#bfb,stroke:#333
style E3 fill:#bfb,stroke:#333
style P fill:#bfb,stroke:#333
style P2 fill:#bfb,stroke:#333Basic trie storing "tree", "trie", "try", "trip", "top". Star nodes mark complete words. Searching prefix "tr" traverses root->t->r and finds all descendants.
A basic trie has a significant memory overhead problem. Each node stores a character plus an array of child pointers. With a 26-character English alphabet, each node might use 26 x 8 = 208 bytes just for child pointers, even if most are null. At scale with millions of queries, this wastes gigabytes of memory. The solution is a compressed trie (also called a Patricia trie or radix tree), which collapses chains of single-child nodes into single edges with multi-character labels.
Compressed trie memory savings
In a compressed trie, the path r->e->c->o->m->m->e->n->d becomes a single edge labeled "recommend". This reduces node count by 70-80% in practice, because most natural language queries share long common prefixes and have long unique suffixes. A basic trie for 10M queries might use 5-8 GB; a compressed trie fits the same data in 200-500 MB.
Each node in our autocomplete trie stores more than just characters. The node structure includes: the character or compressed label (string edge), an array of child pointers, a boolean flag indicating whether this node is a terminal (represents a complete query), and crucially, the top-K suggestions for this prefix along with their scores. Pre-computing the top-K at each node is the key optimization that makes autocomplete fast. Instead of traversing all descendants to find the best suggestions at query time, we store them at build time.
| Field | Type | Size | Purpose |
|---|---|---|---|
| edge_label | string | 1-20 bytes | Compressed character sequence for this edge |
| children | pointer array | 8 bytes per child | Pointers to child nodes (sparse) |
| is_terminal | boolean | 1 byte | True if this node completes a valid query |
| top_suggestions | array of (string, score) | ~200 bytes for top-10 | Pre-computed top-K completions and their ranking scores |
| frequency | uint32 | 4 bytes | Search frequency count for this exact query |
| last_updated | uint32 | 4 bytes | Timestamp for recency decay calculation |
The prefix traversal algorithm for serving autocomplete is straightforward once the trie is built with pre-computed suggestions. Given an input prefix, traverse the trie character by character from the root. If at any point a character has no matching child, return empty results (the prefix matches nothing). If you reach the end of the prefix and land on a node, return that node's pre-computed top_suggestions array. This is an O(L) operation where L is the prefix length, typically 5-15 characters. The lookup itself takes microseconds.
Trie lookup algorithm
01
Start at the root node. Set current_node = root.
02
For each character c in the input prefix: find the child of current_node whose edge starts with c. If no such child exists, return empty suggestions (prefix not found). If the edge label is longer than one character, verify the remaining characters match the prefix. Advance current_node to the child.
03
After consuming all prefix characters, current_node points to the prefix node.
04
Return current_node.top_suggestions (pre-computed top-K results with scores).
05
Total time complexity: O(L) where L is prefix length. No traversal of descendants needed at query time.
Start at the root node. Set current_node = root.
For each character c in the input prefix: find the child of current_node whose edge starts with c. If no such child exists, return empty suggestions (prefix not found). If the edge label is longer than one character, verify the remaining characters match the prefix. Advance current_node to the child.
After consuming all prefix characters, current_node points to the prefix node.
Return current_node.top_suggestions (pre-computed top-K results with scores).
Total time complexity: O(L) where L is prefix length. No traversal of descendants needed at query time.
Why pre-computing top-K matters
Without pre-computed suggestions, a prefix like "a" would require traversing millions of descendants to find the top-10 results. With pre-computation, it is a single pointer dereference. The trade-off is build time: constructing the trie with pre-computed suggestions takes O(N log K) per node during the offline build phase, but this runs in a batch pipeline, not on the serving path.
Memory analysis for a production trie: with 10 million unique queries, average query length of 20 characters, and a compression ratio of 3-4x (compressed trie vs basic trie), the total node count is approximately 50 million nodes. At an average of 40 bytes per node (compressed edge + children pointers + metadata), that is 2 GB. Adding pre-computed top-10 suggestions at each node (roughly 200 bytes per node) brings the total to approximately 12 GB. With further optimization such as string interning (storing suggestion strings once and using pointers), the total can be reduced to 3-5 GB. This fits comfortably in the RAM of a modern server.
Having the right data structure is only half the battle. The other half is ranking: given a prefix that matches thousands or millions of possible completions, which 5-10 do you show? Naive frequency-based ranking (show the most searched completions) is a reasonable starting point but falls short in practice. A system that only ranks by frequency would always suggest "apple" before "apple stock price" for the prefix "app", even when the user clearly intends a financial query based on their history.
The ranking score for each suggestion is a weighted combination of multiple signals. At its simplest, the score formula looks like: score = w1 * frequency + w2 * recency + w3 * personalization + w4 * trending + w5 * quality. Each component captures a different dimension of relevance.
Ranking signals breakdown
Scoring at build time vs serve time
Pre-compute the combined score for each (prefix, suggestion) pair during the offline trie build. At serve time, the trie node already contains the top-K suggestions sorted by score. Only the personalization and trending components need real-time adjustment. This keeps serve-time computation to a minimum: look up the pre-ranked list, apply a lightweight personalization re-rank, and return.
Personalization deserves special attention because it introduces state into an otherwise stateless serving path. There are three common approaches, each with different trade-offs. First, server-side personalization: store the user's recent queries in a fast key-value store (Redis or Memcached) and blend them with global suggestions at serve time. This adds 1-3ms of latency per request for the Redis lookup. Second, client-side personalization: the client caches the user's recent queries locally and blends them with server suggestions before rendering. This adds zero server latency but requires client logic. Third, segment-based personalization: assign users to behavioral segments (tech enthusiast, sports fan, news reader) and pre-compute per-segment suggestion rankings. This avoids per-user state entirely but provides coarser personalization.
| Approach | Latency Impact | Personalization Quality | Privacy | Complexity |
|---|---|---|---|---|
| Server-side (per-user Redis) | +1-3ms | High: uses full search history | Lower: server stores user data | High: requires user-keyed cache |
| Client-side blending | +0ms server | Medium: limited to recent queries on device | Higher: data stays on device | Medium: client logic needed |
| Segment-based | +0ms | Low-Medium: coarse behavioral segments | Higher: no individual tracking | Low: pre-computed in batch |
For ML re-ranking, large search engines deploy lightweight neural models that run in the serving path. These models take the top 50-100 candidates from the trie and re-rank them based on contextual features: time of day, device type, previous query in the session, geographic location, and the partial query itself. The model must be extremely fast (under 2ms inference) so it fits within the latency budget. Common architectures include small feed-forward networks or gradient-boosted trees (XGBoost/LightGBM) that can score 100 candidates in under 1ms. Heavy transformer models are too slow for the autocomplete serving path.
ML re-ranking latency trap
A common mistake is proposing a BERT-based re-ranker in the autocomplete serving path. BERT inference takes 10-50ms per batch even on a GPU, which consumes the entire latency budget. Use lightweight models (GBDT, small MLPs with under 100K parameters) for serve-time re-ranking. Save heavy models for offline scoring during trie building.
The trending boost system requires a separate real-time pipeline. Query volumes are monitored in a sliding window (typically 5-minute buckets). When the current window volume exceeds 5-10x the baseline (7-day moving average), the query is flagged as trending. Trending queries are injected into the serving layer via a lightweight overlay that augments the batch-built trie. This overlay is small (typically hundreds to low thousands of entries) and can be pushed to serving nodes in seconds.
The autocomplete trie is only as good as the data feeding it. The data collection pipeline is the backbone of the system: it ingests billions of search queries, aggregates them into frequency counts, applies content filtering, computes ranking scores, and builds the trie that serving nodes load. This pipeline runs continuously in batch mode (every few hours) with a separate real-time stream for trending queries.
The pipeline starts with search query logs. Every search query executed on the platform is logged with metadata: the query string, timestamp, user ID (anonymized), geographic region, device type, and whether the user clicked a result. These logs are written to a distributed log system (Kafka, Kinesis, or Google Pub/Sub) for both real-time processing and batch archival to object storage (S3, GCS).
graph LR
A["Search Frontend"] -->|query logs| B["Kafka / Kinesis"]
B -->|real-time stream| C["Flink / Spark Streaming"]
C -->|trending queries| D["Trending Overlay Store"]
B -->|batch archive| E["S3 / GCS Data Lake"]
E -->|hourly/daily| F["Spark / MapReduce
Aggregation Job"]
F -->|frequency counts| G["Query Frequency Store"]
G --> H["Trie Builder"]
H -->|serialized trie| I["Trie Artifact Store
(S3/GCS)"]
I -->|pull on startup| J["Autocomplete
Serving Nodes"]
D -->|push updates| J
style A fill:#e1f5fe
style J fill:#e8f5e9
style H fill:#fff3e0Data pipeline: search logs flow through batch aggregation to build the trie, while a parallel streaming path handles trending queries in real time.
The batch aggregation job is the workhorse of the pipeline. Running on Spark or MapReduce, it processes the last N days of query logs (typically 30-90 days) and produces a table of (query, frequency, recency_score, quality_score). The job involves several stages.
Batch pipeline stages
01
Ingestion: Read raw query logs from the data lake (S3/GCS). Deduplicate by (user_id, query, timestamp_bucket) to prevent one user spamming a query from inflating counts. Parse and normalize query strings (lowercase, strip extra whitespace, normalize Unicode).
02
Aggregation: Group by normalized query string. Count total occurrences, compute time-weighted frequency (recent queries weighted more), and calculate click-through rate (what percentage of times this query was searched did the user click a result).
03
Content filtering: Run each unique query through content safety classifiers (keyword blocklists, ML toxicity classifiers, named entity recognition for public figures). Flag or remove queries that violate content policies. This step is critical after the 2013 Google defamation lawsuits.
04
Scoring: Compute the composite ranking score for each query using the formula: score = log(frequency) * recency_decay * quality_multiplier. Output a sorted list of (query, score) pairs.
05
Trie construction: Build a compressed trie from the scored query list. At each node, propagate the top-K suggestions from descendants. Serialize the trie into a compact binary format (Protocol Buffers or FlatBuffers) for efficient loading by serving nodes.
06
Artifact publication: Upload the serialized trie to artifact storage (S3/GCS) with a version number and checksum. Notify serving nodes that a new trie version is available.
Ingestion: Read raw query logs from the data lake (S3/GCS). Deduplicate by (user_id, query, timestamp_bucket) to prevent one user spamming a query from inflating counts. Parse and normalize query strings (lowercase, strip extra whitespace, normalize Unicode).
Aggregation: Group by normalized query string. Count total occurrences, compute time-weighted frequency (recent queries weighted more), and calculate click-through rate (what percentage of times this query was searched did the user click a result).
Content filtering: Run each unique query through content safety classifiers (keyword blocklists, ML toxicity classifiers, named entity recognition for public figures). Flag or remove queries that violate content policies. This step is critical after the 2013 Google defamation lawsuits.
Scoring: Compute the composite ranking score for each query using the formula: score = log(frequency) * recency_decay * quality_multiplier. Output a sorted list of (query, score) pairs.
Trie construction: Build a compressed trie from the scored query list. At each node, propagate the top-K suggestions from descendants. Serialize the trie into a compact binary format (Protocol Buffers or FlatBuffers) for efficient loading by serving nodes.
Artifact publication: Upload the serialized trie to artifact storage (S3/GCS) with a version number and checksum. Notify serving nodes that a new trie version is available.
Incremental trie building
Rather than rebuilding the entire trie from scratch every cycle, use incremental updates. The Spark job computes deltas (new queries, changed frequencies) and applies them to the existing trie. This reduces build time from hours to minutes for the common case where 95% of queries are unchanged.
The content filtering stage deserves emphasis because it is where many real-world autocomplete systems have failed catastrophically. Blocklists are the first line of defense: a curated list of terms and patterns that must never appear in suggestions. But blocklists are inherently reactive, covering known bad terms but missing novel offensive combinations. ML classifiers (trained on labeled examples of toxic, defamatory, and dangerous content) provide a second layer. Named entity recognition (NER) identifies public figures and applies stricter filtering rules for completions involving real people. Even with these layers, new edge cases constantly emerge, which is why a real-time feedback mechanism (users can report bad suggestions) feeds back into the pipeline.
The trie build process itself is a classic MapReduce/Spark problem. The Map phase emits every prefix of every query as a key, with the full query and its score as the value. For example, for the query "machine learning" with score 0.95, the Map emits: ("m", "machine learning", 0.95), ("ma", "machine learning", 0.95), ("mac", "machine learning", 0.95), and so on. The Reduce phase groups by prefix and keeps only the top-K suggestions per prefix. This is essentially a distributed top-K aggregation. The output is then assembled into a trie structure in a final single-node step.
Optimizing the MapReduce trie build
Emitting every prefix of every query creates massive intermediate data (10M queries x 20 avg length = 200M key-value pairs). Optimize with: (1) Use a combiner to keep only top-K per prefix in the Map phase, reducing shuffle data by 99%. (2) Process in frequency-descending order so the combiner can prune early. (3) Use a custom partitioner that routes prefixes sharing the first 2 characters to the same reducer, enabling local trie construction.
The serving architecture is where the latency budget is won or lost. Every component on the serving path must be optimized for minimal latency: the client sends a request, it hits the API gateway, routes to an autocomplete service node, which does an in-memory trie lookup, optionally applies a personalization re-rank, and returns the response. The entire server-side path must complete in under 10 milliseconds at P99.
graph LR
Client["Client
(Browser/App)"] -->|"prefix: 'mach'"| GW["API Gateway
/ Load Balancer"]
GW -->|route| AS1["Autocomplete Service
Node 1"]
GW -->|route| AS2["Autocomplete Service
Node 2"]
GW -->|route| AS3["Autocomplete Service
Node N"]
AS1 -->|lookup| T1["In-Memory Trie
(~3-5 GB)"]
AS2 -->|lookup| T2["In-Memory Trie
(~3-5 GB)"]
AS3 -->|lookup| T3["In-Memory Trie
(~3-5 GB)"]
AS1 -.->|optional| PS["Personalization
Cache (Redis)"]
AS1 -.->|optional| TO["Trending Overlay
Cache"]
style Client fill:#e1f5fe
style T1 fill:#e8f5e9
style T2 fill:#e8f5e9
style T3 fill:#e8f5e9
style PS fill:#fff3e0
style TO fill:#fff3e0Serving architecture: each node holds a full copy of the trie in memory. The load balancer distributes requests across nodes. Personalization and trending overlays are optional fast lookups.
A critical architectural decision is replication vs sharding. Since the compressed trie fits in 3-5 GB of memory, every serving node holds a complete copy. This is a replication strategy, not sharding. The advantages are enormous: any node can handle any request (no routing logic), load balancing is trivial (round-robin), and losing a node has zero data impact. If the trie were too large for a single server (say, 100+ GB), you would shard by prefix range (a-f on shard 1, g-m on shard 2, etc.), but this adds routing complexity and cross-shard query headaches.
| Aspect | Full Replication | Prefix-Based Sharding |
|---|---|---|
| When to use | Trie fits in single server RAM (< 50 GB) | Trie exceeds single server RAM (> 50 GB) |
| Routing complexity | None: any node handles any prefix | Requires prefix-to-shard mapping at the gateway |
| Fault tolerance | Excellent: lose any node, others serve all traffic | Must replicate each shard for fault tolerance |
| Scaling | Add more identical nodes for throughput | Add replicas per shard or split shards further |
| Trie updates | Push new trie to all nodes | Push shard-specific updates to relevant nodes |
| Operational simplicity | High: homogeneous fleet | Lower: heterogeneous shard management |
The trie loading process is carefully orchestrated to avoid downtime. When a new trie version is published by the batch pipeline, serving nodes do not restart. Instead, each node downloads the new trie artifact in the background, deserializes it into a secondary memory buffer, and then atomically swaps the pointer from the old trie to the new one. The old trie is garbage-collected after all in-flight requests drain. This blue-green trie deployment ensures zero-downtime updates with no latency impact.
Blue-green trie deployment
Each serving node maintains two trie slots: active and standby. New trie loads into standby. An atomic pointer swap makes standby the new active. In-flight requests on the old trie complete normally. Memory for the old trie is freed after a grace period (typically 30 seconds). This pattern is borrowed from database page management and ensures zero-downtime, zero-latency-impact updates.
The API gateway layer handles rate limiting, authentication (if personalization requires user identity), and geographic routing. For a global service, autocomplete requests are routed to the nearest datacenter using anycast DNS or a global load balancer. Each datacenter runs its own fleet of autocomplete serving nodes with a region-specific trie (a global trie augmented with region-specific trending and popular queries). This ensures low latency worldwide while allowing regional customization.
Read replicas and caching layers sit in front of the trie for additional performance. A CDN or edge cache can cache autocomplete responses for the most popular prefixes (the top 1,000 prefixes account for a disproportionate share of traffic). A 1-minute TTL on cached responses is acceptable because autocomplete freshness within 1 minute is indistinguishable from real-time for most users. This CDN layer can absorb 50-70% of autocomplete traffic, dramatically reducing load on serving nodes.
Request flow from keystroke to suggestions
01
User types a character in the search box. Client-side debounce waits 100-150ms for the next keystroke. If no keystroke arrives, the client sends an HTTP GET request: GET /autocomplete?q=mach&lang=en®ion=us.
02
The request hits the CDN/edge cache. If a cached response exists for this exact prefix+region combination and the TTL has not expired, return it immediately (cache hit, ~5ms end-to-end).
03
On cache miss, the request reaches the API gateway, which routes it to the nearest autocomplete serving node via round-robin load balancing.
04
The serving node performs an in-memory trie lookup for prefix "mach" in O(4) character comparisons. Returns the pre-computed top-10 suggestions with scores.
05
If personalization is enabled and a user ID is present, the node fetches the user's recent queries from Redis (1-2ms) and blends them with global suggestions using a weighted merge.
06
The node checks the trending overlay for any trending queries matching the prefix. If found, they are inserted into the result set with a boosted score.
07
The final ranked list of 5-10 suggestions is serialized as JSON and returned. Total server-side time: 2-5ms.
User types a character in the search box. Client-side debounce waits 100-150ms for the next keystroke. If no keystroke arrives, the client sends an HTTP GET request: GET /autocomplete?q=mach&lang=en®ion=us.
The request hits the CDN/edge cache. If a cached response exists for this exact prefix+region combination and the TTL has not expired, return it immediately (cache hit, ~5ms end-to-end).
On cache miss, the request reaches the API gateway, which routes it to the nearest autocomplete serving node via round-robin load balancing.
The serving node performs an in-memory trie lookup for prefix "mach" in O(4) character comparisons. Returns the pre-computed top-10 suggestions with scores.
If personalization is enabled and a user ID is present, the node fetches the user's recent queries from Redis (1-2ms) and blends them with global suggestions using a weighted merge.
The node checks the trending overlay for any trending queries matching the prefix. If found, they are inserted into the result set with a boosted score.
The final ranked list of 5-10 suggestions is serialized as JSON and returned. Total server-side time: 2-5ms.
Capacity planning for the serving fleet: if each node can handle 10,000 QPS and peak autocomplete QPS is 500,000, you need 50 nodes minimum. Add 50% headroom for traffic spikes and rolling deployments: 75 nodes. With the CDN absorbing 60% of traffic, the actual serving fleet only needs to handle 200,000 QPS, bringing the required node count to 20-30. This is a modest fleet for a Google-scale service.
The batch pipeline rebuilds the trie every few hours, which is sufficient for the long tail of stable queries. But trending topics change by the minute. When a major event breaks (an earthquake, a celebrity announcement, an election result), users immediately start searching for it. If autocomplete suggestions take hours to reflect these events, the system feels stale and unhelpful. This is where the real-time trending layer comes in.
The architecture uses a dual-layer approach: a batch-built trie (refreshed every 4-6 hours) serves as the stable foundation, and a lightweight real-time overlay handles trending queries. The overlay is a small, fast data structure (a hash map or mini-trie) containing only the queries that have recently spiked in volume. At serve time, the autocomplete node merges results from both layers: the batch trie provides the stable base, and the overlay injects trending items with boosted scores.
Why not just rebuild the trie in real-time?
Rebuilding a trie from scratch takes minutes to hours depending on data volume. Even incremental rebuilds take tens of seconds and consume significant CPU and memory during the build phase. The overlay approach avoids this entirely: trending queries are injected into a simple key-value structure that is merged at query time. The overlay is tiny (hundreds to thousands of entries) compared to the main trie (millions of entries), so the merge cost is negligible.
The trending detection system works as follows. A stream processor (Flink, Spark Streaming, or Kafka Streams) consumes the real-time query log from Kafka. It maintains sliding-window counters for every query observed in the last hour, using 5-minute tumbling windows. For each query, it compares the current window count to the historical baseline (7-day moving average for the same time-of-day and day-of-week). When the ratio exceeds a threshold (typically 5-10x), the query is flagged as trending.
Trending detection pipeline
01
Stream processor reads query events from Kafka in real-time. Each event contains: query_string, timestamp, region.
02
Maintain a sliding window counter per query using a Count-Min Sketch (probabilistic data structure) for memory efficiency. Update counts every 5 minutes.
03
For each query with count > minimum_threshold (e.g., 100 queries in the window), compute: trending_ratio = current_window_count / historical_baseline. Historical baseline is the 7-day average for this time slot.
04
If trending_ratio > 5x, flag the query as trending. Compute a trending_score = base_frequency * trending_ratio * decay_factor. The decay factor ensures that queries that have been trending for hours get less boost than queries that just started spiking.
05
Push the trending query with its score to the Trending Overlay Store (Redis or an in-memory broadcast). The overlay store is replicated to all serving nodes within seconds via pub/sub.
06
Serving nodes merge trending overlay results with batch trie results at query time. Trending queries with higher scores can displace batch suggestions in the final top-K.
Stream processor reads query events from Kafka in real-time. Each event contains: query_string, timestamp, region.
Maintain a sliding window counter per query using a Count-Min Sketch (probabilistic data structure) for memory efficiency. Update counts every 5 minutes.
For each query with count > minimum_threshold (e.g., 100 queries in the window), compute: trending_ratio = current_window_count / historical_baseline. Historical baseline is the 7-day average for this time slot.
If trending_ratio > 5x, flag the query as trending. Compute a trending_score = base_frequency * trending_ratio * decay_factor. The decay factor ensures that queries that have been trending for hours get less boost than queries that just started spiking.
Push the trending query with its score to the Trending Overlay Store (Redis or an in-memory broadcast). The overlay store is replicated to all serving nodes within seconds via pub/sub.
Serving nodes merge trending overlay results with batch trie results at query time. Trending queries with higher scores can displace batch suggestions in the final top-K.
| Aspect | Batch Trie | Trending Overlay |
|---|---|---|
| Update frequency | Every 4-6 hours | Every 5-30 seconds |
| Data volume | Millions of queries | Hundreds to thousands of queries |
| Build cost | Spark job, minutes to hours | Stream processor, continuous |
| Memory footprint | 3-5 GB per node | < 10 MB per node |
| Query coverage | 90%+ of autocomplete traffic | Only recently spiking queries |
| Ranking quality | High: uses full historical data | Medium: limited context, raw volume signal |
Handling hot queries requires special attention. When a single query goes massively viral (think a celebrity death or a major disaster), it can account for 1-5% of all autocomplete traffic in a short window. This creates a "thundering herd" problem where millions of requests for the same prefix arrive simultaneously. The mitigation is straightforward: the CDN/edge cache absorbs the bulk of identical requests (since the same prefix returns the same trending suggestions for all users), and the trending overlay ensures these suggestions are promoted without waiting for the batch trie rebuild.
Trending system abuse vector
Adversaries can attempt to game the trending system by coordinating automated searches to artificially spike a query. Defenses include: (1) deduplicate by anonymized user ID before counting, (2) require a minimum number of distinct users (not just queries), (3) apply velocity checks that flag suspiciously fast ramps, and (4) run content safety filters on trending queries before they enter the overlay.
The lifecycle of a trending query follows a predictable pattern: spike, plateau, decay. When a query is first detected as trending, it enters the overlay with a high boost. As it remains trending for hours, the boost gradually decays (because the query is now being picked up by more users organically and its frequency in the batch data is rising). Eventually, the next batch trie rebuild incorporates the query at its new natural frequency, and the overlay entry is removed. This handoff from real-time overlay to batch trie is seamless and invisible to the user.
For the interview, the key point to convey is that you understand the tension between data freshness and system complexity. A pure batch system is simple but stale. A pure real-time system is complex and expensive. The overlay approach gives you 95% of the freshness benefit at 5% of the complexity cost. This is a pattern that appears across many systems beyond autocomplete: batch processing for the stable majority, real-time processing for the volatile minority.
The client is not a passive consumer of autocomplete results. Smart client-side engineering can dramatically reduce server load, improve perceived latency, and create a smoother user experience. In fact, client-side optimizations often have more impact on perceived performance than server-side optimizations, because they eliminate network round trips entirely for many keystrokes.
Debouncing is the first and most impactful optimization. Without debouncing, every keystroke triggers an API call. A user typing "machine learning" at 6 characters per second would fire 16 requests in under 3 seconds. With a 100-150ms debounce, you wait for a pause in typing before sending the request. If the user types steadily, only 3-4 requests fire instead of 16. This reduces server load by 75% with virtually no perceptible delay to the user, since suggestions would not have rendered before the next keystroke anyway.
Adaptive debounce timing
A fixed 150ms debounce works well for average typists. But fast typists (200ms between keystrokes) experience noticeable delay, while slow typists (500ms between keystrokes) waste time waiting. Adaptive debouncing measures the user's typing speed and adjusts the debounce window: for fast typists, debounce at 50-80ms; for slow typists, debounce at 150-200ms. This is a small optimization that significantly improves perceived responsiveness.
Client-side caching is the second major optimization. If the user types "mac", gets suggestions, then backspaces to "ma" and retypes "mac", there is no reason to hit the server again. The client maintains a local cache (an in-memory map or LRU cache) of recent prefix-to-suggestions mappings. With a TTL of 5-10 minutes, this cache absorbs 30-50% of autocomplete requests. The cache key is the prefix string; the cache value is the suggestion list. For personalized results, include the user ID in the cache key.
| Optimization | Request Reduction | Latency Impact | Implementation Complexity |
|---|---|---|---|
| Debouncing (100-150ms) | 60-75% fewer requests | Adds 100-150ms to first suggestion, saves time on subsequent | Low: simple timer in keystroke handler |
| Client-side cache (LRU) | 30-50% of remaining requests | Instant response on cache hit (0ms network) | Low: in-memory map with TTL eviction |
| Prefetching next character | 20-30% of remaining requests | Suggestions appear to load instantly | Medium: predict likely next characters and pre-fetch |
| Request cancellation | No reduction, but saves server resources | Prevents stale results from overwriting fresh ones | Low: AbortController on each new keystroke |
| Progressive rendering | No reduction | Suggestions appear as they arrive, not all at once | Medium: streaming JSON parsing or chunked responses |
Prefetching is a more advanced optimization that borders on predictive behavior. When the user types "mac" and the server returns suggestions including "machine learning", the client can speculatively pre-fetch the suggestions for "mach", "machi", "machin" in the background. If the user continues typing in that direction, the suggestions are already cached and appear instantly. The risk is wasted bandwidth for prefetches the user never needs, so prefetching is typically limited to the top 2-3 most likely next prefixes based on the current suggestions.
Request cancellation is essential for correctness, not just performance. When the user types quickly, multiple autocomplete requests may be in flight simultaneously. Without cancellation, an older request (for prefix "ma") might return after a newer request (for prefix "mac"), causing the suggestions to flicker or show stale results. The solution is to cancel the previous in-flight request when a new keystroke arrives. In modern browsers, the AbortController API provides a clean cancellation mechanism for fetch requests.
Client-side autocomplete implementation pattern
01
On keystroke: clear the debounce timer and start a new one (100-150ms).
02
When the debounce timer fires: check the client-side cache for this prefix. If a cache hit, render suggestions immediately from cache.
03
On cache miss: cancel any in-flight request using AbortController.abort(). Send a new fetch request with the current prefix.
04
On response: store the result in the client-side cache with a TTL of 5 minutes. Render suggestions only if the response corresponds to the current input value (guard against race conditions where the user has typed further while the request was in flight).
05
On prefetch trigger: if the response includes suggestions, speculatively fetch the top 2-3 likely next prefixes in the background using low-priority requests (or requestIdleCallback).
06
On suggestion selection: log the selection event for quality feedback (which suggestion was clicked at which position). Clear the autocomplete UI and submit the search.
On keystroke: clear the debounce timer and start a new one (100-150ms).
When the debounce timer fires: check the client-side cache for this prefix. If a cache hit, render suggestions immediately from cache.
On cache miss: cancel any in-flight request using AbortController.abort(). Send a new fetch request with the current prefix.
On response: store the result in the client-side cache with a TTL of 5 minutes. Render suggestions only if the response corresponds to the current input value (guard against race conditions where the user has typed further while the request was in flight).
On prefetch trigger: if the response includes suggestions, speculatively fetch the top 2-3 likely next prefixes in the background using low-priority requests (or requestIdleCallback).
On suggestion selection: log the selection event for quality feedback (which suggestion was clicked at which position). Clear the autocomplete UI and submit the search.
Common client-side bug: race conditions
Without proper request cancellation and response validation, fast typists see flickering suggestions. The classic bug: user types "m", "ma", "mac" in quick succession. The "m" response (slow, large result set) arrives last and overwrites the "mac" response. Fix: tag each request with a sequence number and only render responses whose sequence number matches the latest request.
Progressive rendering improves perceived performance for slow connections. Instead of waiting for all 10 suggestions to arrive before rendering, show suggestions incrementally as they arrive. With HTTP/2 streaming or chunked transfer encoding, the server can push suggestions one at a time, and the client renders each one as it arrives. The first suggestion appears in 2-3ms, even if the full response takes 10ms. This technique is particularly effective on mobile networks where latency is higher and more variable.
The cumulative impact of these client-side optimizations is substantial. With debouncing reducing requests by 70%, client caching absorbing 40% of the remainder, and the CDN handling 60% of what reaches the network, the actual server load from autocomplete is roughly 7-8% of the naive "one request per keystroke" approach. For a system that would otherwise see 500K QPS, optimized client behavior reduces server-side load to under 50K QPS.
The final frontier of autocomplete complexity is handling the real world: multiple languages with different character sets, users who make typos, and personalization that respects privacy. Each of these adds a layer of complexity on top of the core trie-based system, and each is frequently probed in senior-level interviews to test depth of thinking.
Multi-language support is fundamentally more complex than it appears. English autocomplete operates on a 26-character alphabet with clear word boundaries (spaces). Chinese, Japanese, and Korean (CJK) have thousands of characters, no spaces between words, and users frequently input using phonetic systems (Pinyin for Chinese, Romaji for Japanese) that must be converted to the target script. Arabic and Hebrew are right-to-left with contextual character forms. A global autocomplete system must handle all of these simultaneously.
Multi-language challenges
Fuzzy matching addresses the reality that users make typos. If a user types "machne" (missing the "i" in "machine"), strict prefix matching returns zero results. Fuzzy matching uses edit distance (Levenshtein distance) to find suggestions within 1-2 edits of the input. The challenge is computational cost: computing edit distance between the input and millions of trie entries is too expensive for the latency budget.
Fuzzy matching within latency budget
Three approaches work in practice: (1) BK-trees alongside the trie, which enable edit-distance-bounded searches in O(log N) time. (2) Pre-computed spelling corrections: maintain a map of common misspellings to their corrections and rewrite the query before trie lookup. (3) N-gram indexing: break queries into character n-grams (e.g., "machine" becomes "mac", "ach", "chi", "hin", "ine") and use an inverted index to find candidates sharing n-grams with the input. Approach 2 is the most common in production because it adds zero latency (a single hash map lookup before trie traversal).
Phonetic matching goes beyond edit distance to handle queries that sound alike but are spelled differently. "Shawn" vs "Sean", "colour" vs "color", or transliterations like "Tchaikovsky" vs "Chaikovsky". Algorithms like Soundex, Metaphone, and Double Metaphone convert words to phonetic codes, and these codes are stored alongside the standard trie entries. When a prefix yields few or no results, the system falls back to phonetic matching. This is particularly important for name searches and music/movie queries where spelling varies widely.
| Technique | Handles | Latency Cost | Coverage |
|---|---|---|---|
| Exact prefix match | Correctly typed prefixes | 0ms (baseline) | 80-90% of queries |
| Pre-computed spelling correction | Common typos (teh->the, machne->machine) | +0ms (hash lookup) | 5-10% of queries |
| Edit distance (BK-tree) | Any single-character typo | +1-3ms | 3-5% of queries |
| Phonetic matching (Metaphone) | Phonetically similar queries | +1-2ms | 1-3% of queries |
| N-gram index fallback | Heavily misspelled queries | +2-5ms | 1-2% of queries |
Personalization must be balanced against privacy. The most effective personalization uses the user's full search history to boost suggestions matching their interests. However, storing and accessing individual search histories raises significant privacy concerns, especially under regulations like GDPR and CCPA. Production systems use a layered approach to personalization that progressively trades off effectiveness for privacy.
Privacy-aware personalization tiers
01
Tier 1 — Session context (no stored data): Use the current session's queries to personalize subsequent suggestions. If the user searched "Python tutorial" earlier in the session, boost Python-related suggestions for the rest of the session. No data is stored beyond the session. This is the most privacy-friendly approach.
02
Tier 2 — On-device personalization: Store the user's recent search history on their device only. The client blends on-device history with server-provided global suggestions locally. The server never sees the user's history. Apple uses this approach extensively.
03
Tier 3 — Anonymized segment-based: Assign users to behavioral cohorts (e.g., "frequently searches tech topics") without storing individual histories. Pre-compute per-cohort suggestion rankings. Google's FLoC/Topics API follows this pattern.
04
Tier 4 — Server-side per-user (most effective, least private): Store the user's search history server-side and use it for real-time personalization. This provides the best suggestions but requires user consent and robust data protection. This approach is common behind logged-in experiences.
Tier 1 — Session context (no stored data): Use the current session's queries to personalize subsequent suggestions. If the user searched "Python tutorial" earlier in the session, boost Python-related suggestions for the rest of the session. No data is stored beyond the session. This is the most privacy-friendly approach.
Tier 2 — On-device personalization: Store the user's recent search history on their device only. The client blends on-device history with server-provided global suggestions locally. The server never sees the user's history. Apple uses this approach extensively.
Tier 3 — Anonymized segment-based: Assign users to behavioral cohorts (e.g., "frequently searches tech topics") without storing individual histories. Pre-compute per-cohort suggestion rankings. Google's FLoC/Topics API follows this pattern.
Tier 4 — Server-side per-user (most effective, least private): Store the user's search history server-side and use it for real-time personalization. This provides the best suggestions but requires user consent and robust data protection. This approach is common behind logged-in experiences.
GDPR and autocomplete personalization
Under GDPR, search history is personal data. If you store it server-side for personalization, you need explicit user consent, the right to erasure (user can delete their history), and data portability. Design the personalization system with a clear data deletion path from day one. The trie itself should not contain per-user data; personalization should be a separate layer that can be wiped without affecting the global trie.
In the interview, multi-language and privacy-aware personalization separate L5 from L6+ answers. Discussing CJK input methods, phonetic matching, and the privacy spectrum of personalization demonstrates senior-level breadth.
Search autocomplete is a top-10 system design interview question at Google, Meta, Amazon, and Microsoft. It appears directly as "Design Search Autocomplete" or "Design Typeahead Suggestions" and indirectly in broader "Design Google Search" or "Design an E-commerce Search" questions. Interviewers love it because it tests data structure knowledge (tries), latency-sensitive system design, data pipeline thinking (batch plus streaming), and the ability to reason about ranking and personalization trade-offs. At L4-L5, candidates are expected to describe the trie data structure, basic serving architecture, and back-of-the-envelope estimation. At L6+, candidates must demonstrate deep knowledge of ranking models, content safety pipelines, multi-language challenges, client-side optimizations, and the batch-plus-streaming architecture for real-time freshness.
Common questions:
Key takeaways
Your autocomplete system serves 100K QPS with a P99 latency of 8ms. The product team wants to add personalized suggestions based on user search history. What is your approach, and how do you prevent latency regression?
Use client-side personalization or a fast Redis lookup. Client-side: store the user's recent 50 queries on-device and blend with server suggestions locally, adding zero server latency. Server-side: keep user profiles in a Redis cluster co-located with autocomplete nodes, adding 1-2ms per request. Set a strict timeout of 3ms on the Redis call with a fallback to unpersonalized results if Redis is slow. Monitor P99 latency continuously and disable personalization if it causes regression beyond 10ms.
Explain why autocomplete systems use batch-built tries with real-time overlays rather than a single real-time-updated trie. What are the trade-offs?
A real-time-updated trie requires write locks or copy-on-write mechanics during updates, which adds latency jitter to reads and complexity to the serving path. Batch-built tries are immutable during serving, enabling lock-free reads at maximum speed. The overlay handles the 0.1% of queries that need freshness (trending topics) as a small, separate structure merged at query time. The trade-off is a 4-6 hour delay for non-trending query updates, which is acceptable because 99%+ of suggestion relevance comes from long-term query patterns, not minute-by-minute changes.
How would you handle autocomplete for a language like Chinese where there are no word boundaries and users input via Pinyin (romanized phonetics)?
Maintain two parallel tries: one indexed by Pinyin prefixes and one by Chinese character prefixes. When the user types Pinyin (e.g., "ji qi xue xi" for machine learning), match against the Pinyin trie, which maps to Chinese query suggestions. When the user types Chinese characters directly (common on mobile), match against the character trie. The Pinyin trie is larger because one Chinese phrase can have multiple Pinyin representations (homophones). Use a disambiguation layer that ranks Pinyin-to-Chinese mappings by frequency. The character trie operates on individual characters as prefix elements rather than words, since Chinese has no spaces.
💡 Analogy
Search autocomplete is like a library card catalog with a helpful librarian. As you say each letter of the book title you are looking for, the librarian immediately narrows the drawer of index cards and shows you the top matches before you finish speaking. The librarian remembers which books are popular this week (trending) and has noted what types of books you personally enjoy reading (personalization). If you mispronounce a word, the librarian still understands what you mean (fuzzy matching). Behind the scenes, a team of catalogers updates the card system every night with new books and removes inappropriate entries (batch pipeline with content filtering).
⚡ Core Idea
Autocomplete is a latency-critical prefix-matching system powered by an in-memory trie data structure. The trie is built offline from aggregated search logs, pre-computes top-K suggestions at every node for O(prefix-length) lookups, and is replicated across a fleet of serving nodes. A real-time trending overlay handles breaking queries, while client-side debouncing and caching reduce server load by an order of magnitude.
🎯 Why It Matters
Autocomplete is one of the most visible features in any search product: every user interacts with it on every search. It drives 30-50% of search query volume (users select autocomplete suggestions instead of typing full queries), directly impacts user engagement metrics, and is a frequent target for adversarial manipulation. Understanding autocomplete architecture teaches you how to design latency-sensitive systems, choose appropriate data structures, build offline-to-online data pipelines, and balance personalization with privacy — skills that transfer to any real-time serving system.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.