Skip to main content
Career Paths
Concepts
Istio Performance Tuning
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Istio Performance Tuning

How to right-size Envoy sidecar resources, tune istiod at scale, and avoid the access log disk-fill that crashes hosts at high RPS.

Relevant for:SeniorStaff
Why this matters at your level
Senior

Right-size sidecar resources for different service profiles. Know access log impact. Tune istiod for clusters with 200+ services. Monitor per-sidecar memory with kubectl top.

Staff

Define platform-level performance budgets for sidecar resources. Build cost allocation models including sidecar overhead. Own the observability overhead budget (tracing, logging, metrics cardinality).

Istio Performance Tuning

How to right-size Envoy sidecar resources, tune istiod at scale, and avoid the access log disk-fill that crashes hosts at high RPS.

~4 min read
Be the first to complete!
LIVEData Plane Overload -- Debug Access Logs at 500 RPS -- Disk Fills, Sidecar OOM
Breaking News
T+0

DEBUG access logging enabled across all 200 pods for routing debug

T+2h

Node disk at 95% -- Envoy buffering logs in memory

WARNING
T+3h

Sidecar memory OOM kills begin -- pods restarting

CRITICAL
T+4h

Access logging disabled -- disk fills stop -- system stabilizes

T+4h

Postmortem: access log level must be governed by policy, not ad-hoc

—Log lines per second at 500 RPS x 200 pods
—Time to disk fill
—Alerts before the disk filled
—Config setting causing the incident

The question this raises

What are the performance characteristics of Envoy at scale, how do you right-size sidecar resources, and what settings have the most impact on overhead?

Test your assumption first

Your cluster has 500 pods. After enabling mesh-wide access logging to debug an issue, node disk utilization goes from 30% to 95% in 4 hours. The cluster handles 200 RPS. What is the primary cause?

Lesson outline

Envoy performance characteristics

Envoy adds < 1ms p99 latency and ~50MB memory at baseline

A well-tuned Envoy sidecar adds < 1ms p99 latency to request processing, uses ~50MB memory at baseline, and about 50m CPU cores at 1000 RPS. These numbers grow with configuration complexity: each additional VirtualService, DestinationRule, and AuthorizationPolicy adds to Envoy's memory footprint and evaluation overhead.

Key factors that drive Envoy overhead

  • Access log verbosity — DISABLED: 0 overhead. ERROR: minimal. INFO: moderate. DEBUG: catastrophic at high RPS
  • Number of routes/clusters in xDS config — Each service adds to Envoy's route table. 1000 services = large xDS config = more memory, slower route lookup
  • ext_authz filter (OPA/Authorino) — Each request makes an HTTP call to the policy engine -- adds 2-10ms per request when in hot path
  • Tracing at high sample rate — 100% trace sampling at 1000 RPS = 1000 spans/second -- Envoy CPU and Jaeger storage both increase significantly
  • Stats collection cardinality — High-cardinality metrics labels (URL per path, user ID) create millions of Prometheus time series -- cardinality explosion

Right-sizing sidecar resources

How this concept changes your thinking

Situation
Before
After

Default sidecar resource configuration

“64MB memory request, no limit -- at 1000 pods, 64GB reserved for sidecars; under debug logging, no limit means OOM kills the node”

“Profile each service: low-traffic services get 32MB/64MB (request/limit), high-traffic services get 128MB/256MB. Monitor with kubectl top pod --containers to tune.”

istiod at scale (500+ pods)

“Default istiod: 500m CPU / 2Gi memory -- at 500 services, memory exceeds 2Gi; config pushes take 20+ seconds”

“Scale istiod memory to 4-8Gi, set PILOT_PUSH_THROTTLE and PILOT_DEBOUNCE_AFTER. Run 2 replicas for HA.”

sidecar-resource-tuning.yaml
1# Per-workload sidecar resource tuning via annotations
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: payment-svc
6spec:
7 template:
8 metadata:
9 annotations:
10 # Small service (< 100 RPS)
11 # sidecar.istio.io/proxyCPU: "50m"
12 # sidecar.istio.io/proxyMemory: "64Mi"
13 # sidecar.istio.io/proxyCPULimit: "200m"
14 # sidecar.istio.io/proxyMemoryLimit: "128Mi"
15
16 # High-traffic service (> 1000 RPS)
17 sidecar.istio.io/proxyCPU: "200m"
18 sidecar.istio.io/proxyMemory: "128Mi"
19 sidecar.istio.io/proxyCPULimit: "1000m"
20 sidecar.istio.io/proxyMemoryLimit: "512Mi"
21
22 # Concurrency (Envoy worker threads -- default: 0 = num CPUs)
23 # For services with high req/s, increase concurrency
24 proxy.istio.io/config: |
25 concurrency: 4
performance-settings.yaml
1# Mesh-level performance settings
2apiVersion: v1
3kind: ConfigMap
4metadata:
5 name: istio
6 namespace: istio-system
7data:
8 mesh: |
9 # Access logging: NONE (production) or service-specific via Telemetry API
10 accessLogFile: "" # empty = disabled (default since Istio 1.14)
11 accessLogEncoding: JSON
12
13 # Stats: reduce cardinality by removing high-cardinality default labels
14 # Consider disabling request_path label if you have many unique paths
15 defaultConfig:
16 tracing:
17 sampling: 1.0 # 1% sampling -- NOT 100%
18
19# istiod performance tuning for large clusters
20---
21apiVersion: apps/v1
22kind: Deployment
23metadata:
24 name: istiod
25 namespace: istio-system
26spec:
27 template:
28 spec:
29 containers:
30 - name: discovery
31 env:
32 - name: PILOT_PUSH_THROTTLE
33 value: "100" # max xDS pushes per second
34 - name: PILOT_DEBOUNCE_AFTER
35 value: "500ms" # batch rapid changes -- wait 500ms before push
36 - name: PILOT_DEBOUNCE_MAX
37 value: "5s" # max debounce wait
38 resources:
39 requests:
40 cpu: 1000m
41 memory: 4Gi
42 limits:
43 cpu: 4000m
44 memory: 8Gi
kubectl
1# Profile sidecar resource usage for each service
2kubectl top pod --all-namespaces --containers | grep istio-proxy | sort -k5 -rn
3
4# Find sidecars with no resource limits (memory risk)
5kubectl get pods --all-namespaces -o json | \
6 python3 -c "
7import json, sys
8data = json.load(sys.stdin)
9for pod in data['items']:
10 for c in pod['spec'].get('initContainers', []) + pod['spec']['containers']:
11 if c['name'] == 'istio-proxy':
12 limits = c.get('resources', {}).get('limits', {})
13 if not limits.get('memory'):
14 print(pod['metadata']['namespace'] + '/' + pod['metadata']['name'] + ': no memory limit on istio-proxy')
15"
16
17# Check istiod push performance
18kubectl exec -n istio-system deploy/istiod -- \
19 curl -s localhost:15014/metrics | grep "pilot_push_duration"
20
21# Enable access logging for a single workload (for debugging)
22kubectl apply -f - <<EOF
23apiVersion: telemetry.istio.io/v1alpha1
24kind: Telemetry
25metadata:
26 name: debug-logs
27 namespace: production
28spec:
29 selector:
30 matchLabels:
31 app: payment-svc # ONE service only
32 accessLogging:
33 - providers:
34 - name: envoy
35EOF
36# Remove after debugging is done

What breaks in production

Blast radius of performance misconfiguration

  • DEBUG access logs at high RPS — 500 RPS x 200 pods = disk fill in 2 hours -- sidecar OOM, node instability
  • No memory limit on sidecars — Memory leak in Envoy config (many routes) grows unchecked -- eventually OOM kills node
  • istiod with default resources at 1000+ services — istiod OOM kills during config storms -- all mesh management stops
  • 100% trace sampling at high RPS — Envoy CPU +20%, Jaeger storage fills, network overhead from span reporting
  • ext_authz on every request — Each request calls OPA -- 5ms per call at 1000 RPS = 5 CPU cores just for auth checks

Mesh-wide DEBUG logging in production

Bug
# Telemetry resource enabling DEBUG access logging mesh-wide
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: debug-all
  namespace: istio-system  # mesh-wide!
spec:
  accessLogging:
  - providers:
    - name: envoy
    filter:
      expression: 'response.code >= 200'  # all responses
# At 500 RPS cluster-wide: ~1M log lines/minute
# Disk fills within hours
# No memory limit = sidecar OOM when log buffer grows
Fix
# Production: access logging DISABLED by default
# Enable only for specific services when debugging
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: production-logging
  namespace: istio-system
spec:
  accessLogging:
  - providers: []  # disable access logging globally

# When debugging: enable for ONE workload with error filter
---
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: debug-service-x     # scoped to one service
  namespace: production
spec:
  selector:
    matchLabels:
      app: service-x
  accessLogging:
  - providers:
    - name: envoy
    filter:
      expression: 'response.code >= 500'  # errors only
# Remove this Telemetry resource after debugging

In production, access logging should be DISABLED by default and enabled selectively for debugging via namespace-scoped or workload-scoped Telemetry resources. Always filter to errors only (response.code >= 500) to reduce volume. Set a calendar reminder to remove debug logging resources -- they are routinely forgotten and cause disk issues days later.

Decision guide: sidecar sizing

Does the service handle > 1000 RPS or have many (> 100) upstream dependencies?
YesSet sidecar: 200m CPU / 128Mi memory request, 1000m / 512Mi limits. Increase concurrency to 4. Monitor with kubectl top pod --containers.
NoIs the service < 50 RPS with few upstream dependencies?
Is the service < 50 RPS with few upstream dependencies?
YesSet sidecar: 50m CPU / 32Mi memory request, 200m / 128Mi limits. Default concurrency (auto).
NoUse defaults: 100m CPU / 64Mi memory request, 500m / 256Mi limits. Profile under peak load after 1 week.

Performance budget reference

ConfigMemory overheadCPU overhead at 1k RPSp99 latency addedRisk
Baseline (no policies)~50MB~50m< 0.5msLow -- just proxy overhead
+ Access logging (INFO)+5MB+20m< 0.5msLow -- low volume structured logs
+ Access logging (DEBUG)+20MB++100m< 1msHIGH -- disk fill at high RPS
+ ext_authz (OPA)+10MB+200m+2-5ms per callMedium -- OPA in request hot path
+ Tracing (100% sample)+5MB+150m< 1msMedium -- Jaeger storage and CPU
+ 1000 routes in xDS+30-50MB+30m< 0.5msLow but memory grows with fleet

Exam Answer vs. Production Reality

1 / 2

Envoy sidecar overhead

📖 What the exam expects

Envoy adds approximately 50MB memory and 50m CPU at baseline for 1000 RPS. p99 latency overhead is typically < 1ms.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Performance-focused platform interviews. "What is the overhead of running Istio at 1000 pods?" and "How would you debug high memory usage in sidecars?" are entry points.

Common questions:

  • What is the memory overhead of running an Envoy sidecar?
  • How does access logging affect Envoy performance?
  • How would you scale istiod for a 1000-service cluster?
  • What is the risk of running 100% trace sampling in production?
  • How would you investigate high memory usage in an Envoy sidecar?

Strong answer: Knows specific overhead numbers, has right-sized sidecars per service profile, has a story about access log disk fill or tracing storage overflow.

Red flags: Cannot estimate sidecar overhead, thinks 100% trace sampling is fine, has not considered disk impact of access logging.

Related concepts

Explore topics that connect to this one.

  • Istio Advanced Routing
  • Istio Troubleshooting
  • Resource Requests & Limits: CPU Throttle and OOM Kill

Suggested next

Often learned after this topic.

Istio Troubleshooting

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Continue learning

Istio Troubleshooting

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.