How to right-size Envoy sidecar resources, tune istiod at scale, and avoid the access log disk-fill that crashes hosts at high RPS.
Right-size sidecar resources for different service profiles. Know access log impact. Tune istiod for clusters with 200+ services. Monitor per-sidecar memory with kubectl top.
Define platform-level performance budgets for sidecar resources. Build cost allocation models including sidecar overhead. Own the observability overhead budget (tracing, logging, metrics cardinality).
How to right-size Envoy sidecar resources, tune istiod at scale, and avoid the access log disk-fill that crashes hosts at high RPS.
DEBUG access logging enabled across all 200 pods for routing debug
Node disk at 95% -- Envoy buffering logs in memory
WARNINGSidecar memory OOM kills begin -- pods restarting
CRITICALAccess logging disabled -- disk fills stop -- system stabilizes
Postmortem: access log level must be governed by policy, not ad-hoc
The question this raises
What are the performance characteristics of Envoy at scale, how do you right-size sidecar resources, and what settings have the most impact on overhead?
Your cluster has 500 pods. After enabling mesh-wide access logging to debug an issue, node disk utilization goes from 30% to 95% in 4 hours. The cluster handles 200 RPS. What is the primary cause?
Lesson outline
Envoy adds < 1ms p99 latency and ~50MB memory at baseline
A well-tuned Envoy sidecar adds < 1ms p99 latency to request processing, uses ~50MB memory at baseline, and about 50m CPU cores at 1000 RPS. These numbers grow with configuration complexity: each additional VirtualService, DestinationRule, and AuthorizationPolicy adds to Envoy's memory footprint and evaluation overhead.
Key factors that drive Envoy overhead
How this concept changes your thinking
Default sidecar resource configuration
“64MB memory request, no limit -- at 1000 pods, 64GB reserved for sidecars; under debug logging, no limit means OOM kills the node”
“Profile each service: low-traffic services get 32MB/64MB (request/limit), high-traffic services get 128MB/256MB. Monitor with kubectl top pod --containers to tune.”
istiod at scale (500+ pods)
“Default istiod: 500m CPU / 2Gi memory -- at 500 services, memory exceeds 2Gi; config pushes take 20+ seconds”
“Scale istiod memory to 4-8Gi, set PILOT_PUSH_THROTTLE and PILOT_DEBOUNCE_AFTER. Run 2 replicas for HA.”
1# Per-workload sidecar resource tuning via annotations2apiVersion: apps/v13kind: Deployment4metadata:5name: payment-svc6spec:7template:8metadata:9annotations:10# Small service (< 100 RPS)11# sidecar.istio.io/proxyCPU: "50m"12# sidecar.istio.io/proxyMemory: "64Mi"13# sidecar.istio.io/proxyCPULimit: "200m"14# sidecar.istio.io/proxyMemoryLimit: "128Mi"1516# High-traffic service (> 1000 RPS)17sidecar.istio.io/proxyCPU: "200m"18sidecar.istio.io/proxyMemory: "128Mi"19sidecar.istio.io/proxyCPULimit: "1000m"20sidecar.istio.io/proxyMemoryLimit: "512Mi"2122# Concurrency (Envoy worker threads -- default: 0 = num CPUs)23# For services with high req/s, increase concurrency24proxy.istio.io/config: |25concurrency: 4
1# Mesh-level performance settings2apiVersion: v13kind: ConfigMap4metadata:5name: istio6namespace: istio-system7data:8mesh: |9# Access logging: NONE (production) or service-specific via Telemetry API10accessLogFile: "" # empty = disabled (default since Istio 1.14)11accessLogEncoding: JSON1213# Stats: reduce cardinality by removing high-cardinality default labels14# Consider disabling request_path label if you have many unique paths15defaultConfig:16tracing:17sampling: 1.0 # 1% sampling -- NOT 100%1819# istiod performance tuning for large clusters20---21apiVersion: apps/v122kind: Deployment23metadata:24name: istiod25namespace: istio-system26spec:27template:28spec:29containers:30- name: discovery31env:32- name: PILOT_PUSH_THROTTLE33value: "100" # max xDS pushes per second34- name: PILOT_DEBOUNCE_AFTER35value: "500ms" # batch rapid changes -- wait 500ms before push36- name: PILOT_DEBOUNCE_MAX37value: "5s" # max debounce wait38resources:39requests:40cpu: 1000m41memory: 4Gi42limits:43cpu: 4000m44memory: 8Gi
1# Profile sidecar resource usage for each service2kubectl top pod --all-namespaces --containers | grep istio-proxy | sort -k5 -rn34# Find sidecars with no resource limits (memory risk)5kubectl get pods --all-namespaces -o json | \6python3 -c "7import json, sys8data = json.load(sys.stdin)9for pod in data['items']:10for c in pod['spec'].get('initContainers', []) + pod['spec']['containers']:11if c['name'] == 'istio-proxy':12limits = c.get('resources', {}).get('limits', {})13if not limits.get('memory'):14print(pod['metadata']['namespace'] + '/' + pod['metadata']['name'] + ': no memory limit on istio-proxy')15"1617# Check istiod push performance18kubectl exec -n istio-system deploy/istiod -- \19curl -s localhost:15014/metrics | grep "pilot_push_duration"2021# Enable access logging for a single workload (for debugging)22kubectl apply -f - <<EOF23apiVersion: telemetry.istio.io/v1alpha124kind: Telemetry25metadata:26name: debug-logs27namespace: production28spec:29selector:30matchLabels:31app: payment-svc # ONE service only32accessLogging:33- providers:34- name: envoy35EOF36# Remove after debugging is done
Blast radius of performance misconfiguration
Mesh-wide DEBUG logging in production
# Telemetry resource enabling DEBUG access logging mesh-wide
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: debug-all
namespace: istio-system # mesh-wide!
spec:
accessLogging:
- providers:
- name: envoy
filter:
expression: 'response.code >= 200' # all responses
# At 500 RPS cluster-wide: ~1M log lines/minute
# Disk fills within hours
# No memory limit = sidecar OOM when log buffer grows# Production: access logging DISABLED by default
# Enable only for specific services when debugging
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: production-logging
namespace: istio-system
spec:
accessLogging:
- providers: [] # disable access logging globally
# When debugging: enable for ONE workload with error filter
---
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: debug-service-x # scoped to one service
namespace: production
spec:
selector:
matchLabels:
app: service-x
accessLogging:
- providers:
- name: envoy
filter:
expression: 'response.code >= 500' # errors only
# Remove this Telemetry resource after debuggingIn production, access logging should be DISABLED by default and enabled selectively for debugging via namespace-scoped or workload-scoped Telemetry resources. Always filter to errors only (response.code >= 500) to reduce volume. Set a calendar reminder to remove debug logging resources -- they are routinely forgotten and cause disk issues days later.
| Config | Memory overhead | CPU overhead at 1k RPS | p99 latency added | Risk |
|---|---|---|---|---|
| Baseline (no policies) | ~50MB | ~50m | < 0.5ms | Low -- just proxy overhead |
| + Access logging (INFO) | +5MB | +20m | < 0.5ms | Low -- low volume structured logs |
| + Access logging (DEBUG) | +20MB+ | +100m | < 1ms | HIGH -- disk fill at high RPS |
| + ext_authz (OPA) | +10MB | +200m | +2-5ms per call | Medium -- OPA in request hot path |
| + Tracing (100% sample) | +5MB | +150m | < 1ms | Medium -- Jaeger storage and CPU |
| + 1000 routes in xDS | +30-50MB | +30m | < 0.5ms | Low but memory grows with fleet |
Envoy sidecar overhead
📖 What the exam expects
Envoy adds approximately 50MB memory and 50m CPU at baseline for 1000 RPS. p99 latency overhead is typically < 1ms.
Toggle between what certifications teach and what production actually requires
Performance-focused platform interviews. "What is the overhead of running Istio at 1000 pods?" and "How would you debug high memory usage in sidecars?" are entry points.
Common questions:
Strong answer: Knows specific overhead numbers, has right-sized sidecars per service profile, has a story about access log disk fill or tracing storage overflow.
Red flags: Cannot estimate sidecar overhead, thinks 100% trace sampling is fine, has not considered disk impact of access logging.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.