How load balancers actually work inside. L4 vs L7. Round-robin, least connections, IP hash — when each algorithm is right. Health checks, sticky sessions, SSL termination. Real AWS ALB vs NLB configurations.
How load balancers actually work inside. L4 vs L7. Round-robin, least connections, IP hash — when each algorithm is right. Health checks, sticky sessions, SSL termination. Real AWS ALB vs NLB configurations.
Lesson outline
A load balancer sits between clients and a pool of backend servers, distributing incoming requests across available servers. It sounds simple, but load balancers are responsible for three critical functions that, if misconfigured, cause major outages: traffic distribution, health monitoring, and transparent failover.
Without a load balancer, adding more servers does nothing — you can't send all traffic to one server and also to multiple servers simultaneously. The load balancer is what makes horizontal scaling possible.
graph LR
C1[Client 1] --> LB[Load Balancer]
C2[Client 2] --> LB
C3[Client N] --> LB
LB -->|Round Robin| S1[Server 1
✓ Healthy]
LB -->|Skip| S2[Server 2
✗ Unhealthy]
LB -->|Round Robin| S3[Server 3
✓ Healthy]
S1 --> DB[(Database)]
S3 --> DB
note[LB continuously health-checks all servers.
Unhealthy servers removed automatically.]Load balancer distributes requests across healthy servers, automatically removing failed ones
There are two fundamentally different types of load balancers: Layer 4 (operates at the transport layer, TCP/UDP) and Layer 7 (operates at the application layer, HTTP/HTTPS). The difference matters enormously for latency, feature set, and use cases.
Three Things Load Balancers Do Simultaneously
1. Distribute load: spread requests across servers using an algorithm (round-robin, least-connections, etc.). 2. Health monitoring: continuously probe backend servers, automatically remove failed ones. 3. Session management: optionally maintain "sticky" connections from a client to the same server. Understanding all three prevents misconfiguration.
Load balancers also act as a security boundary: they terminate SSL/TLS, absorb DDoS attacks, and hide the number and identity of backend servers. The backend servers see only the load balancer's IP address, not the original client IP (unless the LB adds X-Forwarded-For headers).
The most important choice in load balancer configuration is L4 vs. L7. This single decision affects performance, routing capability, and cost.
An L4 (Layer 4) load balancer operates at the TCP/UDP transport layer. It routes based on IP address, port, and protocol only. It doesn't look at the HTTP content at all. It opens a TCP connection to a backend server and forwards packets. Because it doesn't interpret the content, it's extremely fast — typically handling millions of connections per second with sub-millisecond overhead.
An L7 (Layer 7) load balancer operates at the HTTP/HTTPS application layer. It terminates the client connection, reads the full HTTP request (method, path, headers, body), makes routing decisions based on that content, then forwards the request to a backend. It can do things L4 cannot: route /api requests to one server pool and /static to another, add/modify headers, perform authentication, block malicious requests.
| Dimension | L4 (TCP/UDP) | L7 (HTTP/HTTPS) |
|---|---|---|
| Routing basis | IP address + port only | URL path, hostname, headers, cookies |
| Content inspection | None — packets are opaque | Full HTTP request visibility |
| SSL termination | TCP passthrough (SSL handled by backend) | Yes — terminates SSL at LB, backends use HTTP |
| Connection handling | Persistent TCP tunnel to one backend | New connection per request (can use different backends) |
| Throughput | Millions of connections/sec (hardware LBs) | Hundreds of thousands req/sec (software overhead) |
| Latency added | < 0.1ms | 1-5ms per request (HTTP parsing, routing decision) |
| Stickiness | IP hash based | Cookie-based (more reliable) |
| Health checks | TCP connect | HTTP health check endpoint (/health) |
| AWS equivalent | NLB (Network Load Balancer) | ALB (Application Load Balancer) |
| Use case | TCP services, game servers, raw TCP, lowest latency | HTTP APIs, web apps, microservices routing |
Rule of Thumb: L7 for Web, L4 for Everything Else
Use L7 (ALB) for: REST APIs, web applications, microservices routing, WebSocket connections (long-lived L7 connections are handled better by ALB), GraphQL. Use L4 (NLB) for: TCP services that aren't HTTP (databases, game servers, custom protocols), workloads needing maximum throughput (NLB handles 100M req/sec per AZ), preserving client IP without X-Forwarded-For headers, SMTP/FTP/MQTT (non-HTTP protocols).
graph TB
C[Client] --> ALB[AWS ALB
Layer 7]
ALB -->|/api/*| API[API Servers
Node.js cluster]
ALB -->|/ws/*| WS[WebSocket Servers]
ALB -->|/static/*| S3[S3 + CloudFront
Static assets]
C2[Game Client] --> NLB[AWS NLB
Layer 4]
NLB -->|TCP:3000| GS1[Game Server 1
Stateful session]
NLB -->|TCP:3001| GS2[Game Server 2
Stateful session]L7 (ALB) for HTTP routing by path; L4 (NLB) for non-HTTP protocols needing TCP passthrough
Once the load balancer decides which server pool to send a request to, it must pick a specific server from that pool. The algorithm for this selection has a significant impact on performance and resource utilization.
The 5 Load Balancing Algorithms
| Algorithm | Best For | Worst For | AWS ALB Support |
|---|---|---|---|
| Round Robin | Short, uniform requests (REST APIs) | Variable duration requests (streaming) | Yes (default) |
| Least Requests | Mixed or long-duration requests | Uniform short requests (adds overhead) | Yes (Least Outstanding Requests) |
| IP Hash | Stateful protocols needing same server | NAT environments, clients behind proxies | No (use cookie-based sticky sessions instead) |
| Weighted | Canary deployments, mixed instance types | Identical server pools | Yes (via target group weights) |
| Random | Very high throughput (simple, fast) | Any workload where fairness matters | No (use round robin) |
Round Robin + Variable Duration = Uneven Load
Imagine 10 servers and requests that take 1ms or 10 seconds (API calls vs. file uploads). Round-robin sends the 10-second upload to Server 1. The next 9,999 short requests are round-robined — but Server 1 still has that upload running and its CPU is at 100%. Least-connections would have avoided this by not sending more requests to a busy server. For any workload with variable request duration, use least-connections instead of round-robin.
Health checks are how load balancers know whether a backend server is capable of serving requests. This sounds trivial but is one of the most commonly misconfigured aspects of load balancer setup. Wrong health check configuration causes two opposite failures: healthy servers being removed from rotation (false negatives), or failed servers continuing to receive traffic (false positives).
There are two types of health checks: active and passive.
Active vs. Passive Health Checks
The Health Check Misconfiguration That Causes Cascading Failures
Common mistake: health check endpoint /health returns 200 even when the database connection pool is exhausted. From the load balancer's perspective, the server is healthy. In reality, all requests are failing because the server can't reach the DB. The fix: deep health checks. Instead of return 200, check all dependencies: try db.ping(); try redis.ping(); only return 200 if all pass. With deep health checks, a server that can't reach the DB is correctly marked unhealthy and removed from rotation, directing traffic to healthy servers.
AWS ALB Health Check Best Practice Configuration
01
Health check path: /health (not / — the root path often requires auth or redirects). Your /health endpoint should verify: DB connectivity, Redis connectivity, any critical dependency. Return 200 if healthy, 503 if any dependency is down.
02
Healthy threshold: 2 consecutive checks must pass before a server is added back to rotation. Prevents flapping (server briefly healthy then unhealthy again).
03
Unhealthy threshold: 3 consecutive checks must fail before removing a server. Prevents false positives from temporary network hiccups.
04
Interval: 15-30 seconds. Shorter intervals detect failures faster but add overhead. For critical APIs, use 10 seconds.
05
Timeout: 5 seconds. Health check must respond within 5 seconds or it's counted as a failure.
06
Grace period (slow start): 120 seconds after instance launch before health checks begin. New JVM instances need time to warm up (JIT compilation). Node.js needs time to establish DB connections. Without grace period, new instances fail health checks immediately and are never added to rotation.
Health check path: /health (not / — the root path often requires auth or redirects). Your /health endpoint should verify: DB connectivity, Redis connectivity, any critical dependency. Return 200 if healthy, 503 if any dependency is down.
Healthy threshold: 2 consecutive checks must pass before a server is added back to rotation. Prevents flapping (server briefly healthy then unhealthy again).
Unhealthy threshold: 3 consecutive checks must fail before removing a server. Prevents false positives from temporary network hiccups.
Interval: 15-30 seconds. Shorter intervals detect failures faster but add overhead. For critical APIs, use 10 seconds.
Timeout: 5 seconds. Health check must respond within 5 seconds or it's counted as a failure.
Grace period (slow start): 120 seconds after instance launch before health checks begin. New JVM instances need time to warm up (JIT compilation). Node.js needs time to establish DB connections. Without grace period, new instances fail health checks immediately and are never added to rotation.
sequenceDiagram
participant LB as Load Balancer
participant S1 as Server 1 (healthy)
participant S2 as Server 2 (failing)
loop Every 15 seconds
LB->>S1: GET /health
S1-->>LB: 200 OK (DB=OK, Redis=OK)
LB->>S2: GET /health
S2-->>LB: 503 (DB connection failed)
end
Note over LB,S2: After 3 consecutive failures...
LB->>LB: Mark S2 unhealthy
LB->>LB: Remove S2 from rotation
Note over LB,S2: All traffic goes to healthy servers
Note over LB,S2: S2 DB connection recovered...
LB->>S2: GET /health
S2-->>LB: 200 OK
LB->>S2: GET /health (check 2)
S2-->>LB: 200 OK
LB->>LB: Mark S2 healthy, add back to rotationHealth check lifecycle: detect failure, remove from rotation, detect recovery, add back
Sticky sessions (also called session persistence or session affinity) ensure that a client's requests always go to the same backend server. This is needed when the server holds client state that isn't shared (in-memory sessions, local files, WebSocket connections).
But sticky sessions have serious problems. If the designated server fails, all those clients lose their sessions and must re-authenticate. Sticky sessions also cause uneven load: if some users do very long operations, their server accumulates work while other servers sit idle. And IP-based stickiness breaks behind NAT (many users share one IP, all get routed to the same server).
Shopify Black Friday 2013: The Sticky Session Disaster
Shopify used sticky sessions based on a cookie that pinned users to specific server groups. During Black Friday 2013, one server group received a disproportionate share of high-traffic merchants. That group's servers hit 100% CPU while other groups sat at 20%. Because of sticky sessions, the load balancer couldn't rebalance — moving a user to a different server would break their session. 70% of traffic was effectively stranded on overloaded servers for 90 minutes while other servers were idle. The root fix: move sessions to Redis so any server can serve any user, eliminating the need for stickiness entirely.
| Sticky Session Type | Mechanism | Failure Mode | Use Case |
|---|---|---|---|
| IP-Based Sticky | Hash client IP → assign to same server | All users behind same NAT go to same server. Server fail → session lost. | Legacy systems, non-HTTP protocols, last resort |
| Cookie-Based Sticky (ALB) | LB sets AWSALB cookie with server ID. Client sends cookie on every request. | Server fail → cookie becomes invalid → session lost (still stateful server problem) | Short-term: stateful server that can't be fixed quickly |
| No Stickiness (Redis) | All servers share session state via Redis. Any server handles any request. | None: server fail → next request goes to different server with same Redis data. | Correct long-term solution for any stateless-able application |
| Application-Level (WebSocket) | WS connection sticky via NLB IP hash or ALB connection draining | Server fail → WS connection drops → client reconnects (acceptable for WS) | WebSocket specifically — long-lived connections that must be maintained |
Connection Draining: Graceful Server Removal
When a server is removed from rotation (scale-in, deployment, failure), in-flight requests shouldn't be dropped. Connection draining (ALB calls it "deregistration delay") keeps the server in rotation for existing connections for up to 300 seconds while not sending new requests. This allows long-running requests (file uploads, streaming) to complete naturally. Set deregistration delay to 30-300 seconds depending on your longest acceptable request duration.
SSL/TLS termination means the load balancer decrypts incoming HTTPS traffic and forwards unencrypted HTTP to backend servers. This offloads the CPU-intensive cryptographic operations from application servers, which is significant: TLS handshake costs ~2-10ms and 1-5% CPU overhead for modern ciphers.
Three SSL Termination Patterns
AWS Certificate Manager: Free SSL at Scale
AWS ACM (Certificate Manager) provides free SSL certificates for any domain verified via DNS or email. Certificates auto-renew. Attach to ALB in seconds. This is why SSL termination at ALB is almost universal on AWS — zero cert management cost, auto-renewal, and CPU offload from app servers. One certificate can include up to 100 Subject Alternative Names (SANs) — domain names — so one cert covers api.example.com, app.example.com, static.example.com.
| Metric | Without SSL at LB (backend handles SSL) | With SSL Termination at ALB |
|---|---|---|
| TLS handshake CPU | Shared with application logic on every server | Handled entirely by ALB hardware accelerators |
| Certificate management | Deploy cert to every backend server | One cert on ALB, auto-renewed by ACM |
| L7 routing | Available (HTTP visible after decrypt) | Available + LB can see all headers for routing |
| Connection reuse | TLS sessions must be per-server | ALB maintains session resumption tickets centrally |
| Latency | +2-10ms per new TLS connection on each backend | ALB handles TLS, backends get HTTP (near-zero overhead) |
A load balancer that is itself a single point of failure defeats the purpose of having a load balancer. At production scale, the load balancer tier itself must be highly available. There are two patterns.
Active-Active vs. Active-Passive LB Pairs
Active-Active: two or more load balancers both handle traffic simultaneously. DNS round-robins between them. If one fails, DNS quickly routes all traffic to the surviving one. Higher throughput. More complex failover. Active-Passive: one primary LB handles all traffic. One standby LB is ready but idle. If primary fails, standby takes over the virtual IP (using keepalived/VRRP). Simpler failover but wasted standby capacity. AWS ALB is automatically active-active within a region: it runs multiple LB nodes across availability zones simultaneously. You don't manage this.
graph TB
DNS[DNS
example.com] -->|IP: 52.0.0.1| LB1[Load Balancer 1
AZ us-east-1a]
DNS -->|IP: 52.0.0.2| LB2[Load Balancer 2
AZ us-east-1b]
DNS -->|IP: 52.0.0.3| LB3[Load Balancer 3
AZ us-east-1c]
LB1 --> S1[Server 1
AZ 1a]
LB1 --> S2[Server 2
AZ 1b]
LB2 --> S1
LB2 --> S3[Server 3
AZ 1c]
LB3 --> S2
LB3 --> S3
note[AWS ALB automatically runs
active-active across AZs.
You get this free.]AWS ALB is inherently active-active across AZs — multiple LB nodes in each availability zone
Load Balancer HA: Key Configuration Choices
In practice, engineers choose between managed cloud load balancers (AWS ALB, NLB) and self-managed software load balancers (HAProxy, Nginx). Each has specific strengths.
| Load Balancer | Type | Max Throughput | Pricing Model | Best For |
|---|---|---|---|---|
| AWS ALB | L7 (HTTP/HTTPS/WebSocket) | ~1 million req/sec per region | Per hour + per LCU ($0.008/hr + $0.008/LCU-hr) | Most web apps, REST APIs, microservices |
| AWS NLB | L4 (TCP/UDP/TLS) | 100M req/sec per AZ (!) | Per hour + per NLCU ($0.006/hr + $0.006/NLCU-hr) | High-throughput non-HTTP, gaming, IoT, lowest latency |
| HAProxy | L4 + L7 (both modes) | Limited by hardware | Free (open source) | On-prem, Kubernetes ingress, custom routing logic |
| Nginx (nginx.conf) | L7 primarily | Limited by hardware | Free (or Nginx Plus: $3,500/yr) | |
| Cloudflare LB | L7 + Global Anycast | Unlimited (CDN-scale) | Per month ($50/mo+) | Global traffic, Anycast DNS, DDoS protection built in |
Real ALB Configuration for a Typical Microservices Setup
Listener rules (in priority order): 1. IF host = api.example.com AND path = /v2/* → Forward to target group: api-v2-servers (new version). 2. IF host = api.example.com AND path = /v1/* → Forward to target group: api-v1-servers (old version). 3. IF host = ws.example.com → Forward to target group: websocket-servers. 4. IF path = /health → Return 200 directly (fixed response — no backend needed). 5. Default → Forward to target group: main-api-servers. This single ALB handles versioning, WebSocket routing, and health checks — replacing what would require complex Nginx config.
The choice between ALB and self-managed (HAProxy/Nginx) comes down to two factors: operational cost and feature requirements. ALB eliminates operational overhead (no server management, auto-scaling, multi-AZ HA is built-in) but costs money per request. HAProxy/Nginx are free but require dedicated infrastructure, HA configuration, and ongoing maintenance. For most teams on AWS, ALB is the right default — the operational savings outweigh the per-request cost at all but the highest traffic levels.
Practical ALB Setup Checklist
01
Create a target group: configure target type (instances, IPs, or Lambda), port, protocol, and health check settings (path, interval, thresholds).
02
Configure health check: path = /health, interval = 15 sec, healthy threshold = 2, unhealthy threshold = 3, timeout = 5 sec. Create /health endpoint in your application that verifies DB + Redis connectivity.
03
Create the ALB: select VPC, subnets (at least 2 AZs), security group (allow 443 inbound), and enable access logging to S3.
04
Configure HTTPS listener on port 443: attach ACM certificate, set security policy to TLSv1.2 minimum (ELBSecurityPolicy-TLS-1-2-2017-01), add listener rule to forward to target group.
05
Add HTTP → HTTPS redirect: port 80 listener with redirect action to HTTPS. Never serve production traffic over plain HTTP.
06
Enable WAF: attach AWS WAF with AWS Managed Rules (AWSManagedRulesCommonRuleSet) as minimum. Adds rate limiting and common attack protection with zero configuration.
07
Test: run curl -I https://yourdomain.com/health and verify 200. Run load test at 2x expected peak to confirm health checks and auto-scaling work together.
Create a target group: configure target type (instances, IPs, or Lambda), port, protocol, and health check settings (path, interval, thresholds).
Configure health check: path = /health, interval = 15 sec, healthy threshold = 2, unhealthy threshold = 3, timeout = 5 sec. Create /health endpoint in your application that verifies DB + Redis connectivity.
Create the ALB: select VPC, subnets (at least 2 AZs), security group (allow 443 inbound), and enable access logging to S3.
Configure HTTPS listener on port 443: attach ACM certificate, set security policy to TLSv1.2 minimum (ELBSecurityPolicy-TLS-1-2-2017-01), add listener rule to forward to target group.
Add HTTP → HTTPS redirect: port 80 listener with redirect action to HTTPS. Never serve production traffic over plain HTTP.
Enable WAF: attach AWS WAF with AWS Managed Rules (AWSManagedRulesCommonRuleSet) as minimum. Adds rate limiting and common attack protection with zero configuration.
Test: run curl -I https://yourdomain.com/health and verify 200. Run load test at 2x expected peak to confirm health checks and auto-scaling work together.
Load balancers fail in non-obvious ways. The failure mode isn't "LB stopped working" — it's "LB is working but routing traffic incorrectly" or "LB health checks are passing but backends are failing." These require specific debugging approaches.
The 7 Most Common Load Balancer Failure Modes
The 503 "No Healthy Targets" Debugging Process
503 Service Unavailable from ALB usually means all targets in a target group are unhealthy. Debugging checklist: 1. Check target group in console — are targets "Healthy", "Unhealthy", or "Initial"? 2. If "Initial": grace period hasn't elapsed yet — wait 2 minutes. 3. If "Unhealthy": click the target to see the health check failure reason — timeout, bad HTTP code, connection refused? 4. If "connection refused": check security group on instances (does it allow TCP inbound from ALB security group on the app port?). 5. If HTTP 5xx from /health: check application logs for startup failure.
graph TD
Error[User sees 5xx error] --> Type{What error code?}
Type -->|503| T503[503: No Healthy Targets]
Type -->|504| T504[504: Backend Timeout]
Type -->|502| T502[502: Backend Invalid Response]
T503 --> Check1[Check Target Group health]
Check1 -->|Initial state| Wait[Wait for grace period
2-3 minutes]
Check1 -->|Unhealthy| Debug1[Check health check failure reason
in ALB console]
Debug1 -->|Connection refused| SG[Check security group:
Allow LB SG → app port]
Debug1 -->|HTTP 5xx from /health| AppLog[Check application logs
for startup errors]
T504 --> Check2[Check ALB idle timeout
vs request duration]
Check2 -->|Request > timeout| IncreaseTimeout[Increase ALB idle timeout
from 60s to 300s]
Check2 -->|Timeout OK| BEDown[Backend process crashed?
Check application logs]
T502 --> AppCrash[Backend crashed mid-response
Check for OOM, panics, exceptions]Load balancer error debugging decision tree: 503 vs 504 vs 502
Load balancers appear in virtually every system design interview because every scaled system needs one. Interviewers specifically test: (1) Do you know L4 vs L7 and when to use each? (2) Can you explain health check configuration and why shallow checks are dangerous? (3) Do you understand why sticky sessions are an anti-pattern? (4) Can you describe a failure mode (504, 503) and debug it? Strong candidates mention connection draining, health check grace periods, and at least one algorithm beyond round-robin.
Common questions:
Key takeaways
Your API has a /upload endpoint that accepts large file uploads (average 500MB, takes 3-10 minutes). Users are reporting that uploads fail partway through. The rest of the API works fine. What's causing this and how do you fix it?
The failure is almost certainly an ALB idle timeout. AWS ALB has a default idle timeout of 60 seconds. A 3-10 minute upload with no traffic flowing during processing (the upload is in-progress) will be killed by the LB after 60 seconds of inactivity. Fix: increase ALB idle timeout to 600-900 seconds (10-15 minutes) for this use case. Also: ensure the upload endpoint uses multipart uploads or chunked transfer encoding so the LB sees continuous traffic, preventing idle timeout. For very large files, the better architecture is to generate a pre-signed S3 URL and have the client upload directly to S3 (bypassing the LB entirely) — then the LB only needs to handle the short URL generation request, not the multi-minute upload.
You have 10 backend servers behind an ALB. One server is running slow (p99 latency = 5 seconds vs. 100ms normal). The round-robin LB doesn't know this. What happens to your users and how do you fix it?
With round-robin, 10% of requests will be routed to the slow server and experience 5-second latency. Users unlucky enough to hit this server will see degraded performance. The health check (if it's just a TCP check or a trivially fast /health endpoint) won't detect this because the server is still responding — just slowly. Fix #1: switch from round-robin to least-outstanding-requests (LOR) algorithm on ALB. LOR tracks in-flight requests per target — the slow server will quickly accumulate more in-flight requests than healthy servers, causing the LB to prefer the fast servers. Fix #2: implement a deep health check at /health that measures response time of a representative DB query — if > 500ms, return 503, which removes the slow server from rotation. Fix #3: configure ALB anomaly mitigation (enabled by default on newer ALBs) which automatically routes fewer requests to targets with elevated error rates or latency.
An interviewer asks you to design a load balancing layer for a real-time multiplayer game server. Players connect via WebSocket and must be routed to the same server for the duration of their game session. Sessions last 20-60 minutes. There are 10,000 concurrent game sessions. How do you design this?
This requires L4 (NLB) with IP-hash-based routing, not L7 (ALB). Reasons: (1) WebSocket connections are long-lived TCP connections — once established, all data flows over that connection. An L7 LB terminates HTTP connections per-request, which doesn't map to long-lived game sessions cleanly. NLB at L4 maintains persistent TCP connections. (2) The "same server" requirement: use IP-hash at L4, which consistently routes each client IP to the same game server. (3) Health check: NLB should use TCP health checks with a game-specific protocol echo. If the server doesn't respond, TCP connection fails and NLB stops routing new connections. (4) Connection draining: when a game server is being removed (deployment, failure), NLB should drain connections with a 3600-second (1 hour) timeout to allow in-progress games to finish. (5) Scale: 10,000 concurrent sessions with 60-minute duration = 2.78 sessions started/second. Much lower write throughput than average API — this is manageable on 5-10 game server instances. The real scale concern is memory per session on each game server, not LB throughput.
💡 Analogy
Think of a load balancer as a smart GPS router, not a traffic cop. A traffic cop just waves cars in one direction. A smart GPS router knows the current state of every road (health checks), chooses the fastest route to your destination right now (least connections), handles road closures automatically by rerouting (failover), and verifies you're authorized to be on the road before letting you through (WAF, authentication). The GPS router is constantly aware of the entire network state — not just the current intersection.
⚡ Core Idea
Load balancers do three things simultaneously: distribute load using an algorithm (round-robin, least-connections), monitor health of backends (active health checks), and handle transparent failover when backends fail. L4 is fast and protocol-agnostic. L7 is slower but can route by HTTP content. For most web applications, L7 (ALB) is the right choice. Sticky sessions are a symptom of stateful servers — the cure is externalizing state to Redis, not better stickiness.
🎯 Why It Matters
A misconfigured load balancer is one of the most common causes of production outages at companies that have otherwise correct architectures. Shallow health checks mean unhealthy servers keep receiving traffic. Wrong cooldowns on auto-scaling mean your LB routes to instances before they're ready. Sticky sessions without connection draining mean deployments log users out. Every load balancer configuration decision has a failure mode you must understand.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.