Circuit breaker, retries, timeouts, and bulkhead to avoid cascading failure and improve reliability.
Circuit breaker, retries, timeouts, and bulkhead to avoid cascading failure and improve reliability.
Lesson outline
When one dependency (DB, external API) slows down or fails, your app might hold connections or threads waiting. If many requests pile up, the app runs out of resources and fails too. Failure cascades. Resilience patterns limit the blast radius and give the system a chance to recover.
Key ideas: fail fast (timeout), retry with care (backoff, limit), stop calling when the dependency is down (circuit breaker), isolate resources (bulkhead).
Every outbound call (DB, HTTP, queue) should have a timeout. If the dependency does not respond in time, release the connection and return an error. Without timeouts, one slow dependency can exhaust your connection pool or threads and take down the app. Set timeouts to a value that allows normal success but fails before the client gives up (e.g. 2–10 seconds for APIs).
Retry transient failures (e.g. 503, connection reset). Use exponential backoff: wait 1s, then 2s, then 4s (with jitter to avoid thundering herd). Cap the number of retries (e.g. 3). Do not retry non-transient errors (4xx, validation). Consider idempotency so retries do not duplicate side effects.
Libraries (e.g. resilience4j, Polly, retry in Go) can wrap calls with configurable retry and backoff.
A circuit breaker has states: closed (calls go through), open (calls fail immediately; do not call the dependency), half-open (after a cooldown, try one call; on success close, on failure reopen). When the dependency is failing, opening the circuit stops hammering it and fails fast for callers. After a period, one probe checks if the dependency recovered.
Use a circuit breaker around external APIs and optionally around the DB if you see connection exhaustion. Tune threshold (e.g. 5 failures in 10s) and cooldown (e.g. 30s) to your context.
Bulkhead isolates resources: e.g. a thread pool or connection pool dedicated to one dependency. If that dependency slows down, it only consumes its pool; the rest of the app has its own pool and keeps working. Without bulkhead, one slow dependency can consume all threads and starve other work.
Apply to: HTTP client pools per backend, DB connection pools per service, or bounded queues per task type.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.