Skip to main content
Career Paths
Concepts
Resource Requests Limits
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Resource Requests & Limits: CPU Throttle and OOM Kill

Resource requests guide scheduling. Resource limits enforce runtime caps. Misconfiguring them leads to CPU throttling under load, OOM kills that bypass graceful shutdown, and bin-packing failures that leave nodes half-empty.

Relevant for:Mid-levelSeniorStaff
Why this matters at your level
Mid-level

Understand requests (scheduling) vs limits (enforcement). Know CPU throttle vs OOM Kill behavior and exit codes.

Senior

Tune requests and limits based on profiled p95/p99 usage. Configure QoS classes. Use LimitRange for namespace defaults and ResourceQuota for namespace caps.

Staff

Design resource quota strategies for multi-tenant clusters. Use VPA for automated right-sizing. Understand CPU manager policies for latency-sensitive workloads requiring exclusive CPUs.

Resource Requests & Limits: CPU Throttle and OOM Kill

Resource requests guide scheduling. Resource limits enforce runtime caps. Misconfiguring them leads to CPU throttling under load, OOM kills that bypass graceful shutdown, and bin-packing failures that leave nodes half-empty.

~3 min read
Be the first to complete!
LIVECPU Throttle Cascade -- JVM Services -- 2021
Breaking News
T-6mo

CPU limits set to 250m uniformly for all Java services to "limit noisy neighbors"

T+0

Market open; request rate 10x; GC frequency increases; throttle cascade begins

T+15m

p99 latency spikes to 800ms (SLO: 100ms); on-call paged

T+1h

Profiling identifies CPU throttle rate at 42% for all Java services

T+2h

CPU limits raised to 2000m; throttle rate drops below 1%; latency restored

—CPU throttle rate during GC bursts
—Forced suspension per GC cycle
—Limit increase needed to restore SLO

The question this raises

How does the Linux CFS CPU throttle mechanism work, and why does setting CPU limits too low punish bursty workloads like JVM GC even when the node has free cores?

Test your assumption first

A pod has cpu request=100m, limit=200m. The node has 2 free cores. Under high load, CPU usage hits 200m consistently. A JVM GC burst needs 500m for 50ms. What happens?

Lesson outline

What Resource Requests and Limits Solve

Scheduling vs Runtime Enforcement

Requests answer the scheduling question: "where can this pod fit?" Limits answer the runtime question: "how much resource can this pod consume before we intervene?" They serve different purposes and should be set independently based on profiled usage patterns -- not set equal by default.

Guaranteed QoS

Use for: requests == limits for both CPU and memory. Pod is the last to be evicted under node pressure. Use for databases, Kafka brokers, and any workload where eviction causes data loss or long recovery time.

Burstable QoS

Use for: requests < limits. Pod can burst above request when node has spare capacity. Evicted before Guaranteed under pressure. Use for web servers and APIs that need burst headroom for traffic spikes.

No CPU limit (with request)

Use for: For latency-sensitive JVM or Go services with bursty GC. Set request for scheduling; omit limit so GC can burst without throttle. Requires monitoring to prevent runaway consumption.

The System View: CFS Throttle Mechanism

CPU Limit = 250m (0.25 cores)
CFS period   = 100ms
CFS quota    = 25ms (250m * 100ms / 1000m)

Timeline for JVM GC burst (needs 500m for 50ms):
[  0ms] Container starts GC; kernel tracks CPU usage
[ 25ms] CPU quota exhausted (25ms of CPU time used)
[ 25ms] CFS SUSPENDS container <- all threads frozen
[100ms] New 100ms period; quota refilled to 25ms
[100ms] Container resumes GC
[125ms] Quota exhausted again; suspended again
[200ms] GC finally completes (50ms of work took 200ms elapsed)

Node has 2 free cores? Irrelevant.
CPU limit is enforced unconditionally by cgroup cpu.cfs_quota_us.
The kernel does not check node utilization when throttling.

Detect: container_cpu_cfs_throttled_seconds_total (Prometheus)
Alert when throttle_rate > 5% for latency-sensitive services

CFS enforces CPU limits per 100ms period; bursty workloads pay a suspension tax regardless of available node capacity

Right-Sizing Resource Configuration

Situation
Before
After

Java service with 250m CPU limit and frequent GC

“42% CPU throttle rate; 85ms forced suspension per GC; p99 latency 800ms during market open”

“CPU limit raised to 2000m (matching GC burst need); throttle rate below 1%; p99 latency 12ms”

Memory limit set below observed peak

“Pod OOM killed (exit code 137) during traffic spike; no graceful shutdown; in-flight requests dropped”

“Memory limit set to p99 usage + 20% headroom; OOMKilled events drop to zero; limit based on profiling not guessing”

How Resource Enforcement Works

From resource spec to cgroup enforcement

→

01

1. Pod spec declares cpu request=100m limit=500m, memory request=256Mi limit=512Mi

→

02

2. Scheduler sums all pod requests on each node; places pod on node with sufficient allocatable capacity

→

03

3. kubelet creates cgroup for the container: cpu.cfs_quota_us = 50000 (500m * 100ms)

→

04

4. Container runs; kernel tracks CPU usage against quota; throttles when quota exhausted in a period

→

05

5. If container memory exceeds limit: kernel OOM killer sends SIGKILL -- no SIGTERM, no grace period

06

6. kubelet reports OOMKilled; pod restarts per restartPolicy; kubectl describe shows OOMKilled

1

1. Pod spec declares cpu request=100m limit=500m, memory request=256Mi limit=512Mi

2

2. Scheduler sums all pod requests on each node; places pod on node with sufficient allocatable capacity

3

3. kubelet creates cgroup for the container: cpu.cfs_quota_us = 50000 (500m * 100ms)

4

4. Container runs; kernel tracks CPU usage against quota; throttles when quota exhausted in a period

5

5. If container memory exceeds limit: kernel OOM killer sends SIGKILL -- no SIGTERM, no grace period

6

6. kubelet reports OOMKilled; pod restarts per restartPolicy; kubectl describe shows OOMKilled

resource-config.yaml
1spec:
2 containers:
3 - name: java-api
4 resources:
5 requests:
6 cpu: "500m" # scheduling: reserve 0.5 core
7 memory: "1Gi" # scheduling: reserve 1Gi
8 limits:
CPU limit 4x request: JVM GC needs burst headroom; prevents throttle cascade
9 cpu: "2000m" # allows 2-core GC bursts without throttle
Memory limit: set to p99 peak + 20%; OOM kill has no grace period -- be generous
10 memory: "2Gi" # OOM kill if exceeded; 2x request = headroom
11 - name: sidecar
12 resources:
13 requests:
14 cpu: "50m"
15 memory: "64Mi"
16 limits:
17 cpu: "100m"
18 memory: "128Mi"

What Breaks in Production: Blast Radius

Resource misconfiguration failure modes

  • CPU throttle -- hidden latency — Container exhausts CPU quota; suspended for rest of 100ms window. Invisible in app logs. Only visible via container_cpu_cfs_throttled_seconds_total metric. Fix: raise CPU limit or remove it for latency-critical services.
  • OOM kill bypasses graceful shutdown — SIGKILL from OOM killer -- no preStop hook, no SIGTERM, no 30s grace period. In-flight requests dropped. Connection pools left inconsistent. Set memory limit with realistic headroom based on load testing at peak.
  • No requests set (BestEffort QoS) — Pod evicted first under node memory pressure with no warning. Stateful workloads evicted mid-write. Always set requests for any pod with persistent state or active connections.
  • Requests too high (over-provisioning) — Node appears full for scheduling but CPU/memory sit mostly idle. Wasted cluster capacity. Use VPA in recommendation mode to get data-driven suggestions. Kubernetes bin-packing only works when requests reflect actual usage.

CPU limit equal to request throttles bursty JVM workloads

Bug
resources:
  requests:
    cpu: "250m"
  limits:
    cpu: "250m"   # limit == request: no burst allowed
    memory: "512Mi"
# Result: JVM GC needs 1-2 core burst; throttled to 250m
# 85ms forced suspension per GC cycle; p99 latency spikes
Fix
resources:
  requests:
    cpu: "250m"      # scheduling reservation
    memory: "512Mi"
  limits:
    # cpu limit omitted -- allow GC to burst unrestricted
    # OR: cpu: "2000m" for explicit high ceiling
    memory: "768Mi"  # 50% headroom above request
# Monitor: container_cpu_cfs_throttled_seconds_total

For JVM services, CPU limit == request is the worst configuration. GC bursts are throttled unconditionally. Either remove the CPU limit (monitor for runaway) or set it 4-8x higher than the request to allow burst headroom.

Decision Guide: Setting Requests and Limits

Does the workload have bursty CPU (JVM GC, Go runtime, batch processing)?
YesSet CPU limit 4-8x request or remove CPU limit; monitor throttle rate in Prometheus
NoCPU limit = 2x request is safe for steady-state workloads (proxies, web servers)
Is this a stateful workload (database, message broker)?
YesGuaranteed QoS: requests == limits for memory; prevents eviction under node pressure
NoBurstable is fine: requests = p50 usage, limits = p99 + headroom
Is this a shared multi-tenant namespace?
YesUse LimitRange to enforce default requests/limits; use ResourceQuota to cap total namespace consumption
NoPer-pod tuning is sufficient; VPA in recommendation mode helps right-size without manual profiling

Cost and Complexity: QoS Class Comparison

QoS ClassHow to achieveEviction orderBest forRisk
Guaranteedrequests == limits (CPU + memory)Last evictedDatabases, critical stateful servicesOver-provisioning wastes cluster capacity
Burstablerequests < limits (or only one set)Middle -- by usage/request ratioAPIs, web servers with traffic burstsEvicted under node memory pressure
BestEffortNo requests or limitsFirst evicted (no warning)Batch jobs on idle cluster capacityEvicted instantly under any pressure

Exam Answer vs. Production Reality

1 / 3

Request vs Limit

📖 What the exam expects

Request: amount of resource guaranteed to the container (used for scheduling). Limit: maximum the container can use (enforced at runtime by cgroups).

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Performance debugging questions, capacity planning, and multi-tenant cluster design.

Common questions:

  • What is the difference between CPU requests and CPU limits?
  • What happens when a container exceeds its memory limit vs CPU limit?
  • What are Kubernetes QoS classes and how do they affect eviction?
  • Why might you remove CPU limits from a production service?

Strong answer: Mentions VPA for automated right-sizing, LimitRange for namespace defaults, and removing CPU limits for latency-sensitive services that need to burst without throttle.

Red flags: Setting CPU limits == requests uniformly for all services, or not knowing what BestEffort QoS means for eviction behavior.

Related concepts

Explore topics that connect to this one.

  • Kubernetes Autoscaling: HPA, VPA, and Cluster Autoscaler
  • Kubernetes Pods: The Atomic Unit
  • Linux cgroups: Resource Governance for Every Container

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.