How to do safe progressive delivery with Istio weight-based routing -- and the 99/1 typo that sent all traffic to the canary.
Write VirtualService canary configs. Know the weight sum requirement. Understand header-based vs weight-based canary. Know how to roll back by setting canary weight to 0.
Set up Flagger with Prometheus metrics gates. Design the canary promotion criteria (error rate threshold, latency SLO). Implement dark launch (header-based) as the pre-production validation step.
Define the progressive delivery standard for the platform. Integrate Flagger into CI/CD as the deployment mechanism. Own the rollback SLA (time from alert to 0% canary). Design session-affinity handling for stateful canary scenarios.
How to do safe progressive delivery with Istio weight-based routing -- and the 99/1 typo that sent all traffic to the canary.
VirtualService applied -- weights: v1: 1, v2: 99 (typo -- should be 90/10)
CRITICALSRE reviews dashboard -- misreads axis, approves canary
WARNINGError rate alert fires -- v2 has a bug affecting 99% of requests
CRITICALVirtualService weight set to v1:100, v2:0 -- rollback complete
Postmortem: weight validation + automated error-rate gate added to deploy pipeline
The question this raises
How does weight-based canary routing work in Istio, what are the failure modes, and how do you build an automated promotion/rollback gate so a human misreading a dashboard cannot cause an outage?
Your team is deploying a canary with Istio. You apply a VirtualService with v1: 90%, v2: 10% weights. After 10 minutes, your monitoring shows v2 receiving 0% of requests. VirtualService is applied and shows SYNCED in istioctl proxy-status. What is the most likely cause?
Lesson outline
Istio canary is weight-based at the L7 proxy layer -- not at the pod level
Traditional Kubernetes canary (rolling update) controls traffic percentage by changing pod count ratios. Istio canary is independent of pod count -- you can send 1% of traffic to a single v2 pod while 99 v1 pods run. The weight is in the VirtualService and is applied by every Envoy proxy in the mesh, not by kube-proxy.
Traditional K8s canary (pod count ratio): 10% canary = 1 v2 pod : 9 v1 pods Problem: you must scale both versions, cannot decouple traffic % from pod count Istio canary (VirtualService weight): 1 v2 pod + 99 v1 pods VirtualService: v1 weight: 99, v2 weight: 1 -> 1% of all requests from all pods go to that single v2 pod Regardless of pod count -- weights are enforced at proxy layer Canary progression: Stage 0: v2 deployed but not in VS (0% traffic -- warm up) Stage 1: VS weight v1:99 v2:1 (1% -- verify basics) Stage 2: VS weight v1:90 v2:10 (10% -- watch error rate, latency) Stage 3: VS weight v1:50 v2:50 (50% -- confidence building) Stage 4: VS weight v1:0 v2:100 (100% -- canary becomes stable) Stage 5: Remove VS, rename v2 pods to v1, remove old v1 Deployment
Header-based canary for internal testing
Use for: Route traffic to canary only when a specific header is present (x-canary: true). This allows internal QA to test v2 while 100% of production traffic goes to v1. Safe first step before any weight-based split.
Automated canary promotion flow with Flagger
01
Flagger watches a Deployment for new images -- detects when spec.template changes
02
Flagger creates the canary Deployment (v2 pods), sets VirtualService to 0% canary weight
03
Flagger starts analysis: increments weight by 10% every interval (e.g. 5 minutes)
04
At each step: checks Prometheus metrics -- error rate < threshold? latency < threshold?
05
If metrics pass: increment weight toward 100%
06
If metrics fail: set weight to 0%, scale down canary, alert (automated rollback)
07
At 100%: promote canary to stable -- update original Deployment, remove canary Deployment
Flagger watches a Deployment for new images -- detects when spec.template changes
Flagger creates the canary Deployment (v2 pods), sets VirtualService to 0% canary weight
Flagger starts analysis: increments weight by 10% every interval (e.g. 5 minutes)
At each step: checks Prometheus metrics -- error rate < threshold? latency < threshold?
If metrics pass: increment weight toward 100%
If metrics fail: set weight to 0%, scale down canary, alert (automated rollback)
At 100%: promote canary to stable -- update original Deployment, remove canary Deployment
Use Flagger or Argo Rollouts to automate canary -- never manual weight changes
Manual weight editing is error-prone (the 99/1 incident above). Automated progressive delivery tools add metric-based promotion gates that verify error rate and latency SLOs at each step before advancing. They also automatically roll back on SLO violation, making human error irrelevant.
1# Manual canary VirtualService (validate weights before applying)2apiVersion: networking.istio.io/v1beta13kind: VirtualService4metadata:5name: payment-svc6namespace: production7spec:8hosts:9- payment-svc10http:11# Step 1: header-based routing for internal testing (0% to production)12- match:13- headers:14x-canary:15exact: "true"16route:17- destination:18host: payment-svc19subset: v220# Step 2: weight-based canary (increment over time)21- route:22- destination:23host: payment-svc24subset: v125weight: 90 # ALWAYS verify: v1 + v2 weights = 10026- destination:27host: payment-svc28subset: v229weight: 10 # weights must sum to 100
1# Flagger Canary resource -- automated progressive delivery2apiVersion: flagger.app/v1beta13kind: Canary4metadata:5name: payment-svc6namespace: production7spec:8targetRef:9apiVersion: apps/v110kind: Deployment11name: payment-svc12progressDeadlineSeconds: 60013service:14port: 808015targetPort: 808016analysis:17interval: 5m # check metrics every 5 minutes18threshold: 5 # fail after 5 consecutive bad metric readings19maxWeight: 50 # max canary weight during analysis (50%)20stepWeight: 10 # increment by 10% each step21metrics:22- name: request-success-rate23thresholdRange:24min: 99 # must have >= 99% success rate25interval: 1m26- name: request-duration27thresholdRange:28max: 500 # P99 must be < 500ms29interval: 30s30webhooks:31- name: load-test # run load test during analysis32url: http://flagger-loadtester.test/33timeout: 5s
1# Check canary status2kubectl describe canary payment-svc -n production3# Look for: Status: Progressing -> Succeeded or Failed45# Watch canary progress6kubectl get canary payment-svc -n production -w78# Manual rollback (override Flagger)9kubectl annotate canary payment-svc -n production flagger.app/rollback=true1011# Validate weights manually before applying12# Weights must sum to 100 -- this is NOT enforced by Istio validation13# Always double-check: v1 weight + v2 weight = 100
Blast radius of canary configuration errors
Wrong canary weight direction -- 1% stable, 99% canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
http:
- route:
- destination:
subset: v1
weight: 1 # TYPO: v1 (stable) gets 1%
- destination:
subset: v2
weight: 99 # TYPO: v2 (canary) gets 99%
# Intended: v1: 90, v2: 10
# Impact: 99% of users on the unvalidated canary# Correct canary weights -- always verify sum = 100 and direction
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
http:
- route:
- destination:
subset: v1 # stable version -- majority traffic
weight: 90
- destination:
subset: v2 # canary version -- small slice
weight: 10
# Validation checklist:
# 1. v1 weight (stable) >> v2 weight (canary)
# 2. weights sum to 100
# 3. verify with: kubectl get vs <name> -o jsonpath='{.spec.http[0].route}'
# Better: use Flagger to manage weights automaticallyAlways verify VirtualService weights immediately after applying: kubectl get virtualservice <name> -o jsonpath='{.spec.http[0].route[*].weight}'. The numbers should reflect your intent. Adopt automated progressive delivery (Flagger/Argo Rollouts) to eliminate human error from weight management entirely.
| Strategy | How | Risk | Rollback speed | Best for |
|---|---|---|---|---|
| K8s rolling update only | Gradual pod replacement | All users exposed during rollout | Minutes (scale back old) | Stateless services with safe changes |
| Istio weight-based | VirtualService weight split | Manual typo risk, slow detection | Seconds (change weight) | Small % exposure before wider rollout |
| Flagger automated | Automatic weight + metric gates | Low -- gates prevent bad promotion | Automatic (SLO violation) | Production-grade continuous delivery |
| Header-based (dark launch) | Only testers see new version | None for production users | Instant (remove header rule) | Pre-production validation with real backend |
| Traffic mirroring | Copy production traffic to canary | None -- canary responses not served | N/A | Load testing canary before any real traffic |
Canary deploy with Istio
📖 What the exam expects
Istio enables weight-based traffic splitting via VirtualService. You can route a percentage of traffic to a canary version without changing pod counts. Weights are enforced at the Envoy proxy layer.
Toggle between what certifications teach and what production actually requires
Asked in any senior platform, reliability, or DevOps interview. "How would you do a safe canary deploy?" is standard. Follow-up: "What could go wrong with manual weight changes?"
Common questions:
Strong answer: Immediately mentions automated promotion gates, knows the weight-sum-to-100 requirement, has a story about a canary incident, understands session affinity as a complication.
Red flags: Only knows rolling update as canary strategy, has not heard of Flagger or Argo Rollouts, does not mention automated rollback.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.