Skip to main content
Career Paths
Concepts
Traffic Splitting Canary
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Traffic Splitting & Canary Deployments

How to do safe progressive delivery with Istio weight-based routing -- and the 99/1 typo that sent all traffic to the canary.

Relevant for:Mid-levelSeniorStaff
Why this matters at your level
Mid-level

Write VirtualService canary configs. Know the weight sum requirement. Understand header-based vs weight-based canary. Know how to roll back by setting canary weight to 0.

Senior

Set up Flagger with Prometheus metrics gates. Design the canary promotion criteria (error rate threshold, latency SLO). Implement dark launch (header-based) as the pre-production validation step.

Staff

Define the progressive delivery standard for the platform. Integrate Flagger into CI/CD as the deployment mechanism. Own the rollback SLA (time from alert to 0% canary). Design session-affinity handling for stateful canary scenarios.

Traffic Splitting & Canary Deployments

How to do safe progressive delivery with Istio weight-based routing -- and the 99/1 typo that sent all traffic to the canary.

~4 min read
Be the first to complete!
LIVEData Plane Misroute -- Weight Typo -- 100% Traffic to Buggy Canary
Breaking News
T+0

VirtualService applied -- weights: v1: 1, v2: 99 (typo -- should be 90/10)

CRITICAL
T+5m

SRE reviews dashboard -- misreads axis, approves canary

WARNING
T+40m

Error rate alert fires -- v2 has a bug affecting 99% of requests

CRITICAL
T+45m

VirtualService weight set to v1:100, v2:0 -- rollback complete

T+1h

Postmortem: weight validation + automated error-rate gate added to deploy pipeline

—Traffic hitting buggy v2 (intended: 10%)
—User impact duration
—Digit typo causing the incident
—Automated gates that would have caught it

The question this raises

How does weight-based canary routing work in Istio, what are the failure modes, and how do you build an automated promotion/rollback gate so a human misreading a dashboard cannot cause an outage?

Test your assumption first

Your team is deploying a canary with Istio. You apply a VirtualService with v1: 90%, v2: 10% weights. After 10 minutes, your monitoring shows v2 receiving 0% of requests. VirtualService is applied and shows SYNCED in istioctl proxy-status. What is the most likely cause?

Lesson outline

How canary deployment works with Istio

Istio canary is weight-based at the L7 proxy layer -- not at the pod level

Traditional Kubernetes canary (rolling update) controls traffic percentage by changing pod count ratios. Istio canary is independent of pod count -- you can send 1% of traffic to a single v2 pod while 99 v1 pods run. The weight is in the VirtualService and is applied by every Envoy proxy in the mesh, not by kube-proxy.


Traditional K8s canary (pod count ratio):
  10% canary = 1 v2 pod : 9 v1 pods
  Problem: you must scale both versions, cannot decouple traffic % from pod count

Istio canary (VirtualService weight):
  1 v2 pod + 99 v1 pods
  VirtualService: v1 weight: 99, v2 weight: 1
  -> 1% of all requests from all pods go to that single v2 pod
  Regardless of pod count -- weights are enforced at proxy layer

Canary progression:
  Stage 0: v2 deployed but not in VS  (0% traffic -- warm up)
  Stage 1: VS weight v1:99 v2:1       (1% -- verify basics)
  Stage 2: VS weight v1:90 v2:10      (10% -- watch error rate, latency)
  Stage 3: VS weight v1:50 v2:50      (50% -- confidence building)
  Stage 4: VS weight v1:0 v2:100      (100% -- canary becomes stable)
  Stage 5: Remove VS, rename v2 pods to v1, remove old v1 Deployment

Header-based canary for internal testing

Use for: Route traffic to canary only when a specific header is present (x-canary: true). This allows internal QA to test v2 while 100% of production traffic goes to v1. Safe first step before any weight-based split.

Canary automation with Flagger

Automated canary promotion flow with Flagger

→

01

Flagger watches a Deployment for new images -- detects when spec.template changes

→

02

Flagger creates the canary Deployment (v2 pods), sets VirtualService to 0% canary weight

→

03

Flagger starts analysis: increments weight by 10% every interval (e.g. 5 minutes)

→

04

At each step: checks Prometheus metrics -- error rate < threshold? latency < threshold?

→

05

If metrics pass: increment weight toward 100%

→

06

If metrics fail: set weight to 0%, scale down canary, alert (automated rollback)

07

At 100%: promote canary to stable -- update original Deployment, remove canary Deployment

1

Flagger watches a Deployment for new images -- detects when spec.template changes

2

Flagger creates the canary Deployment (v2 pods), sets VirtualService to 0% canary weight

3

Flagger starts analysis: increments weight by 10% every interval (e.g. 5 minutes)

4

At each step: checks Prometheus metrics -- error rate < threshold? latency < threshold?

5

If metrics pass: increment weight toward 100%

6

If metrics fail: set weight to 0%, scale down canary, alert (automated rollback)

7

At 100%: promote canary to stable -- update original Deployment, remove canary Deployment

Use Flagger or Argo Rollouts to automate canary -- never manual weight changes

Manual weight editing is error-prone (the 99/1 incident above). Automated progressive delivery tools add metric-based promotion gates that verify error rate and latency SLOs at each step before advancing. They also automatically roll back on SLO violation, making human error irrelevant.

canary-virtual-service.yaml
1# Manual canary VirtualService (validate weights before applying)
2apiVersion: networking.istio.io/v1beta1
3kind: VirtualService
4metadata:
5 name: payment-svc
6 namespace: production
7spec:
8 hosts:
9 - payment-svc
10 http:
11 # Step 1: header-based routing for internal testing (0% to production)
12 - match:
13 - headers:
14 x-canary:
15 exact: "true"
16 route:
17 - destination:
18 host: payment-svc
19 subset: v2
20 # Step 2: weight-based canary (increment over time)
21 - route:
22 - destination:
23 host: payment-svc
24 subset: v1
25 weight: 90 # ALWAYS verify: v1 + v2 weights = 100
26 - destination:
27 host: payment-svc
28 subset: v2
29 weight: 10 # weights must sum to 100
flagger-canary.yaml
1# Flagger Canary resource -- automated progressive delivery
2apiVersion: flagger.app/v1beta1
3kind: Canary
4metadata:
5 name: payment-svc
6 namespace: production
7spec:
8 targetRef:
9 apiVersion: apps/v1
10 kind: Deployment
11 name: payment-svc
12 progressDeadlineSeconds: 600
13 service:
14 port: 8080
15 targetPort: 8080
16 analysis:
17 interval: 5m # check metrics every 5 minutes
18 threshold: 5 # fail after 5 consecutive bad metric readings
19 maxWeight: 50 # max canary weight during analysis (50%)
20 stepWeight: 10 # increment by 10% each step
21 metrics:
22 - name: request-success-rate
23 thresholdRange:
24 min: 99 # must have >= 99% success rate
25 interval: 1m
26 - name: request-duration
27 thresholdRange:
28 max: 500 # P99 must be < 500ms
29 interval: 30s
30 webhooks:
31 - name: load-test # run load test during analysis
32 url: http://flagger-loadtester.test/
33 timeout: 5s
kubectl
1# Check canary status
2kubectl describe canary payment-svc -n production
3# Look for: Status: Progressing -> Succeeded or Failed
4
5# Watch canary progress
6kubectl get canary payment-svc -n production -w
7
8# Manual rollback (override Flagger)
9kubectl annotate canary payment-svc -n production flagger.app/rollback=true
10
11# Validate weights manually before applying
12# Weights must sum to 100 -- this is NOT enforced by Istio validation
13# Always double-check: v1 weight + v2 weight = 100

What breaks in production

Blast radius of canary configuration errors

  • Weights do not sum to 100 — Istio accepts it -- behavior is proportional distribution but may not match intent. Always validate.
  • VirtualService before DestinationRule subsets — Subset reference invalid until DR applied -- 503 for the canary percentage during the gap
  • Canary pod count insufficient — 1 canary pod handling 10% of 10,000 RPS = 1,000 RPS. If v2 has a memory leak, the single pod crashes fast.
  • No automated rollback gate — Manual weight changes + human dashboard review = typo incidents and delayed detection
  • Sticky sessions and canary — Session-sticky traffic bypasses weight rules -- users are locked to one version regardless of weight changes

Wrong canary weight direction -- 1% stable, 99% canary

Bug
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        subset: v1
      weight: 1    # TYPO: v1 (stable) gets 1%
    - destination:
        subset: v2
      weight: 99   # TYPO: v2 (canary) gets 99%
# Intended: v1: 90, v2: 10
# Impact: 99% of users on the unvalidated canary
Fix
# Correct canary weights -- always verify sum = 100 and direction
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  http:
  - route:
    - destination:
        subset: v1    # stable version -- majority traffic
      weight: 90
    - destination:
        subset: v2    # canary version -- small slice
      weight: 10
# Validation checklist:
# 1. v1 weight (stable) >> v2 weight (canary)
# 2. weights sum to 100
# 3. verify with: kubectl get vs <name> -o jsonpath='{.spec.http[0].route}'
# Better: use Flagger to manage weights automatically

Always verify VirtualService weights immediately after applying: kubectl get virtualservice <name> -o jsonpath='{.spec.http[0].route[*].weight}'. The numbers should reflect your intent. Adopt automated progressive delivery (Flagger/Argo Rollouts) to eliminate human error from weight management entirely.

Decision guide: canary strategy

Do you have automated metric-based promotion gates (Flagger, Argo Rollouts)?
YesUse automated canary with error rate and latency SLO gates -- eliminates manual weight management and human error
NoUse header-based routing first (x-canary: true) for internal validation before any weight-based split to production users

Canary strategies compared

StrategyHowRiskRollback speedBest for
K8s rolling update onlyGradual pod replacementAll users exposed during rolloutMinutes (scale back old)Stateless services with safe changes
Istio weight-basedVirtualService weight splitManual typo risk, slow detectionSeconds (change weight)Small % exposure before wider rollout
Flagger automatedAutomatic weight + metric gatesLow -- gates prevent bad promotionAutomatic (SLO violation)Production-grade continuous delivery
Header-based (dark launch)Only testers see new versionNone for production usersInstant (remove header rule)Pre-production validation with real backend
Traffic mirroringCopy production traffic to canaryNone -- canary responses not servedN/ALoad testing canary before any real traffic

Exam Answer vs. Production Reality

1 / 2

Canary deploy with Istio

📖 What the exam expects

Istio enables weight-based traffic splitting via VirtualService. You can route a percentage of traffic to a canary version without changing pod counts. Weights are enforced at the Envoy proxy layer.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Asked in any senior platform, reliability, or DevOps interview. "How would you do a safe canary deploy?" is standard. Follow-up: "What could go wrong with manual weight changes?"

Common questions:

  • What is the difference between a Kubernetes rolling update and an Istio canary?
  • How would you automate canary promotion with Istio?
  • What metric thresholds would you use to decide whether to promote a canary?
  • What happens if VirtualService weights do not sum to 100?
  • How would you handle session-sticky traffic in a canary deploy?

Strong answer: Immediately mentions automated promotion gates, knows the weight-sum-to-100 requirement, has a story about a canary incident, understands session affinity as a complication.

Red flags: Only knows rolling update as canary strategy, has not heard of Flagger or Argo Rollouts, does not mention automated rollback.

Related concepts

Explore topics that connect to this one.

  • VirtualServices & Traffic Routing
  • DestinationRules & Load Balancing
  • Deployment strategies

Suggested next

Often learned after this topic.

Circuit Breakers, Retries & Timeouts

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Continue learning

Circuit Breakers, Retries & Timeouts

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.