Skip to main content
Career Paths
Concepts
High Availability
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

High availability (HA) & resilience

Keep systems up when things fail-redundancy, failover, and simple targets.

🎯Key Takeaways
HA = system stays up when parts fail; use redundancy and failover.
RTO = how long you can be down; RPO = how much data loss is acceptable.
Spread across AZs/regions and use health checks + automatic failover.

High availability (HA) & resilience

Keep systems up when things fail-redundancy, failover, and simple targets.

~2 min read
Be the first to complete!
What you'll learn
  • HA = system stays up when parts fail; use redundancy and failover.
  • RTO = how long you can be down; RPO = how much data loss is acceptable.
  • Spread across AZs/regions and use health checks + automatic failover.

Lesson outline

What is high availability?

High availability (HA) = your system stays up even when parts fail. One server dies? Traffic goes to another. One datacenter has a blackout? Another region takes over.

Resilience = the system bends instead of breaking. It recovers from failures without you having to fix it by hand every time.

Redundancy

More than one of something. One fails, the other keeps serving.

Failover

Primary fails → traffic switches to backup automatically. No manual flip.

RTO & RPO

RTO = how long you can be down. RPO = how much data loss you accept.

Multi-AZ / multi-region

Run or replicate in more than one place. One AZ down? Another takes over.

HA = system stays up when parts fail. Set RTO and RPO, then design redundancy and failover to meet them.

Redundancy and failover

Redundancy = you have more than one of something. Two servers, two regions, two copies of data. If one fails, the other keeps serving.

Failover = when the primary fails, traffic or work automatically switches to the backup. No manual flip. Health checks detect failure; the system reroutes.

Cloud makes this easier: multiple availability zones (AZs), managed load balancers, and auto-scaling groups replace bad instances automatically.

RTO and RPO-simple targets

RTO (Recovery Time Objective) = how long you can afford to be down. "We need to be back within 1 hour." That drives how fast failover and restore must be.

RPO (Recovery Point Objective) = how much data loss you can accept. "We can lose at most 5 minutes of data." That drives how often you replicate or back up.

Set RTO and RPO based on business impact. Then design redundancy, backups, and failover to meet them.

Real-world scenario: database in one AZ

Expert scenario

Scenario: Your app runs in two AZs, but the database lives in only one. That AZ goes down.

Decision: The app tier fails over to the other AZ, but the database is gone. You are down until you restore from backup. To get HA, you need a multi-AZ database (replica in another AZ with automatic failover) or accept a longer RTO and restore from backups.

How this might come up in interviews

System design and SRE interviews: expect to discuss redundancy, failover, and how you would meet given RTO/RPO targets.

Common questions:

  • How would you design a highly available system?
  • What is the difference between RTO and RPO?
  • How do you achieve failover for a database?

Key takeaways

  • HA = system stays up when parts fail; use redundancy and failover.
  • RTO = how long you can be down; RPO = how much data loss is acceptable.
  • Spread across AZs/regions and use health checks + automatic failover.
Before you move on: can you answer these?

What is the difference between RTO and RPO?

RTO is recovery time (how long until you are back up); RPO is recovery point (how much data loss you can accept).

Related concepts

Explore topics that connect to this one.

  • What is networking?
  • Disaster Recovery (DR)
  • Observability

Suggested next

Often learned after this topic.

Disaster Recovery (DR)

Interview prep: 2 resources

Use these to reinforce this concept for interviews.

  • AWS Well-Architected Framework
  • System Design Primer (GitHub)
View all interview resources →

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Continue learning

Disaster Recovery (DR)

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.