Skip to main content
Career Paths
Concepts
Kube Scheduler Controller Manager
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

kube-scheduler & controller-manager: The Decision Engines

The kube-scheduler decides WHERE a pod runs. The kube-controller-manager decides WHAT should exist. Together they form the decision layer of the Kubernetes control plane. A scheduler bug leaves pods permanently Pending. A controller memory leak stops all reconciliation. Understanding both is essential for diagnosing "why is my pod not running?" and "why are my deployments not scaling?".

Relevant for:Mid-levelSeniorStaff
Why this matters at your level
Junior

Know that the scheduler assigns pods to nodes. Know that the controller manager runs reconciliation loops for Deployments, ReplicaSets, and Nodes. Know the difference between a pod being Pending (not yet scheduled) and Unschedulable (scheduled but no node meets constraints).

Mid-level

Debug Pending pods by reading scheduler events (kubectl describe pod, look for "FailedScheduling"). Know the common scheduling failure reasons: insufficient resources, node affinity mismatch, taint tolerance missing, topology constraint violation. Know which controllers are in controller-manager.

Senior

Design node affinity, pod anti-affinity, and topology spread constraints for high-availability workloads. Tune scheduler parameters (percentageOfNodesToScore) for large clusters. Monitor controller-manager queue depth and reconciliation latency as health signals.

Staff

Design multi-cluster scheduling strategies. Evaluate scheduler plugins for custom placement policies. Define operational runbooks for scheduler failure scenarios. Design controller-manager HA -- understand why only one instance is active at a time despite multiple replicas.

kube-scheduler & controller-manager: The Decision Engines

The kube-scheduler decides WHERE a pod runs. The kube-controller-manager decides WHAT should exist. Together they form the decision layer of the Kubernetes control plane. A scheduler bug leaves pods permanently Pending. A controller memory leak stops all reconciliation. Understanding both is essential for diagnosing "why is my pod not running?" and "why are my deployments not scaling?".

~5 min read
Be the first to complete!
LIVEControl Plane Failure — Scheduler Resource Leak — 200 Pods Pending 45 Minutes
Breaking News
T+0

kube-scheduler memory leak begins -- goroutine accumulation from PodTopologySpread plugin

T+8h

Scheduler memory reaches 4GB -- OOM killed by kubelet

T+8h01m

Backup scheduler instance begins leader election (default: 15s lease duration + retry)

T+8h02m

New scheduler leader elected -- begins processing Pending pod queue

T+8h03m

Scheduler works through 200 queued Pending pods -- but scheduling is sequential per cycle

T+8h45m

All pods scheduled -- but 3 batch windows already missed their SLA

—Stuck in Pending
—Scheduling delay
—Missed SLA windows
—Time for leak to crash scheduler

The question this raises

The scheduler is stateless -- restarting it should be instant. Why did 200 pods stay Pending for 45 minutes after the scheduler restarted?

Test your assumption first

A pod has been Pending for 20 minutes. kubectl describe pod shows event: "0/5 nodes are available: 5 node(s) didn't match pod topology spread constraints." What is the most likely fix?

Lesson outline

What Problem the Scheduler and Controllers Solve

Scheduling: finding the best node. Control loops: maintaining desired state.

Without the scheduler, every pod would need a manual node assignment. Without controllers, every component failure would need manual detection and replacement. The scheduler automates placement decisions based on resources and constraints. The controller manager automates the endless loop of "desired state differs from actual state -- fix it". Together they make Kubernetes self-healing.

Filter phase (scheduling)

Use for: For each unscheduled pod, the scheduler filters all nodes to find feasible nodes: resource requests fit available capacity, no unmatched taints, node/pod affinity satisfied, topology spread constraints met. Nodes failing any filter are eliminated.

Score phase (scheduling)

Use for: Feasible nodes are scored by multiple scoring plugins: LeastRequestedPriority (prefer less loaded nodes), NodeAffinityPriority (prefer nodes matching soft affinity), PodTopologySpreadPriority (prefer balanced spread). Highest score wins.

Deployment controller

Use for: Watches Deployment objects. Creates/updates ReplicaSets. Manages rolling update strategy (maxUnavailable, maxSurge). Ensures the correct ReplicaSet is scaled up and old ones scaled down. One of the most critical reconciliation loops.

Node controller

Use for: Detects failed nodes by monitoring kubelet heartbeats. After node-monitor-grace-period (default 40s) of no heartbeat: marks node NotReady. After pod-eviction-timeout (default 5m): evicts pods from the node so they can be rescheduled elsewhere.

The System View: Scheduling and Reconciliation Flow

SCHEDULING DECISION PATH:
kubectl apply -> API server -> etcd: Pod created with nodeName=""
                                          |
                          kube-scheduler WATCHES for unscheduled pods
                                          |
                    For each unscheduled pod:
                    +-------------------------------------+
                    | FILTER PHASE                        |
                    | Check each node:                    |
                    | - Resources fit? (CPU/mem requests) |
                    | - No unmatched Taints?              |
                    | - nodeAffinity satisfied?           |
                    | - podAntiAffinity satisfied?        |
                    | - TopologySpread max skew met?      |
                    | -> Set of FEASIBLE nodes            |
                    +-------------------------------------+
                    | SCORE PHASE                         |
                    | Score each feasible node:           |
                    | - LeastRequested (prefer empty)     |
                    | - NodeAffinity soft match bonus     |
                    | - TopologySpread balance score      |
                    | -> Highest score node selected      |
                    +-------------------------------------+
                    | BIND                                |
                    | Write pod.spec.nodeName = node-X   |
                    | kubelet on node-X sees the pod     |
                    | and starts the container           |
                    +-------------------------------------+

CONTROLLER RECONCILIATION PATH:
API server watch event (ADDED/MODIFIED/DELETED)
        |
        v
Controller (Deployment, ReplicaSet, Node, etc.)
        |-- Compute diff: desired vs actual
        |-- Create/update/delete objects via API server
        |-- Watch for next change event

The scheduler runs filter then score for each pod independently. A topology constraint failure in the filter phase means 0 feasible nodes -- the pod stays Pending. Read the filter failure reason in kubectl describe pod events to understand which constraint failed.

Scheduling failure diagnosis

Situation
Before
After

A pod is stuck in Pending with 5 nodes in the cluster

“The cluster must be out of resources. Scale up nodes to fix the scheduling problem.”

“Resource shortage is only ONE of many reasons for Pending pods. kubectl describe pod events list the exact filter failure: it might be a missing toleration for a node taint, an unsatisfied node affinity, a topology spread constraint requiring more zones, or a PVC that has not bound yet. Read the events FIRST -- scaling nodes solves resource issues but not constraint issues.”

How It Actually Works: From Pending to Running

The full scheduling and reconciliation lifecycle

→

01

1. Pod enters the scheduling queue -- the scheduler maintains a priority queue of unscheduled pods. Pods with higher PriorityClass values are scheduled first. New pods are added to the queue when the scheduler's watch detects nodeName=="" on a pod.

→

02

2. Filter phase runs against all nodes -- the scheduler evaluates every node against every filter plugin for this pod. For a cluster with 1000 nodes, percentageOfNodesToScore (default 50%) limits filtering to 500 nodes for performance. If all nodes fail filtering: "0/N nodes available: X reason, Y reason..." event is created on the pod.

→

03

3. Score phase ranks feasible nodes -- each feasible node receives a score from each scoring plugin. Scores are normalized and weighted. The node with the highest total score is selected. Ties are broken randomly.

→

04

4. Scheduler binds the pod -- the scheduler writes pod.spec.nodeName = "selected-node". The kubelet on that node sees the pod via its watch and begins the container creation sequence.

05

5. Controller reconciliation: Deployment -> ReplicaSet -> Pod -- when you apply a Deployment, the Deployment controller creates a ReplicaSet. The ReplicaSet controller creates Pods. Each Pod gets scheduled. The Deployment controller monitors rollout progress (ready pods vs desired) and manages the rolling update strategy.

1

1. Pod enters the scheduling queue -- the scheduler maintains a priority queue of unscheduled pods. Pods with higher PriorityClass values are scheduled first. New pods are added to the queue when the scheduler's watch detects nodeName=="" on a pod.

2

2. Filter phase runs against all nodes -- the scheduler evaluates every node against every filter plugin for this pod. For a cluster with 1000 nodes, percentageOfNodesToScore (default 50%) limits filtering to 500 nodes for performance. If all nodes fail filtering: "0/N nodes available: X reason, Y reason..." event is created on the pod.

3

3. Score phase ranks feasible nodes -- each feasible node receives a score from each scoring plugin. Scores are normalized and weighted. The node with the highest total score is selected. Ties are broken randomly.

4

4. Scheduler binds the pod -- the scheduler writes pod.spec.nodeName = "selected-node". The kubelet on that node sees the pod via its watch and begins the container creation sequence.

5

5. Controller reconciliation: Deployment -> ReplicaSet -> Pod -- when you apply a Deployment, the Deployment controller creates a ReplicaSet. The ReplicaSet controller creates Pods. Each Pod gets scheduled. The Deployment controller monitors rollout progress (ready pods vs desired) and manages the rolling update strategy.

debug-pending-pods.sh
1# Find all Pending pods across the cluster
2$ kubectl get pods -A --field-selector=status.phase=Pending
3
4# Get the scheduling failure reason (ALWAYS check events first)
5$ kubectl describe pod <pending-pod-name>
6# Look for Events section at the bottom:
7# Events:
8# Warning FailedScheduling <time> default-scheduler
9# 0/5 nodes available:
The Events section in kubectl describe pod contains the exact scheduling failure reason. This is the most important diagnostic information for Pending pods -- always read it before guessing.
10# 1 node(s) didn't match pod affinity rules,
11# 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane:NoSchedule},
12# 2 node(s) didn't have enough memory.
13
14# Common failure reasons and fixes:
15# "Insufficient cpu/memory" -> add nodes or reduce requests
16# "didn't match node affinity" -> check nodeAffinity in pod spec
17# "had untolerated taint" -> add toleration for the taint
topology spread constraints with whenUnsatisfiable: DoNotSchedule will keep pods Pending indefinitely if the constraint cannot be met. Change to ScheduleAnyway to allow scheduling even if spread is not perfect.
18# "didn't match pod topology spread" -> add nodes in missing zones
19# OR change whenUnsatisfiable: ScheduleAnyway
20
21# Monitor scheduling queue depth (alert on Pending > 5 minutes)
22$ kubectl get --raw='/metrics' | grep 'scheduler_pending_pods'
23# scheduler_pending_pods{queue="active"} = pods being scheduled
24# scheduler_pending_pods{queue="unschedulable"} = pods that cannot be scheduled

What Breaks in Production: Blast Radius

Blast radius: scheduler and controller failures

  • Scheduler OOM kill during high pod churn — All new pod scheduling pauses during leader re-election (~15-30s). Accumulated Pending pods scheduled sequentially on recovery -- can cause 10-45 minute backlogs for batch workloads.
  • PodTopologySpread with DoNotSchedule and insufficient zones — Pods stuck Pending indefinitely. No error is thrown -- pods just never schedule. Alert on kube_pod_status_phase{phase="Pending"} > 5 minutes.
  • Controller-manager leader loss during node failures — Node controller stops detecting NotReady nodes during leader re-election. Pod eviction from failed nodes is delayed -- extends the time workloads are stuck on dead nodes.
  • Deployment controller crash during rolling update — Rolling update pauses mid-way. Old and new ReplicaSets both have replicas. Traffic goes to mixed versions until controller recovers and continues the rollout.
  • Priority class misconfiguration — High-priority pods preempt critical infrastructure pods (CoreDNS, metrics-server). Cluster destabilizes as preemption cascades. Use PodDisruptionBudgets to protect critical system pods from preemption.
  • Scheduler plugin bug with memory leak — Scheduler memory grows unbounded over hours. OOM kill. Scheduling pauses during leader election. The GitHub Copilot / financial services incident pattern.

Topology spread constraint too strict -- pods Pending indefinitely

Bug
# Topology spread: want pods in 3 zones, cluster only has 2 zones
# maxSkew=1 means: difference between most and least loaded zone <= 1
# With 2 zones and 3+ pods, this is mathematically impossible
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 6
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule  # BLOCKS scheduling
        labelSelector:
          matchLabels:
            app: my-api
      containers:
      - name: api
        image: my-api:v1
# Result: 2 pods per zone (balanced) but 7th pod Pending FOREVER
Fix
# Option 1: Allow scheduling even if spread is not perfect
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: ScheduleAnyway    # Schedule even if constraint unmet
  labelSelector:
    matchLabels:
      app: my-api

# Option 2: Use pod anti-affinity for zone spread (preferred)
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:  # Soft: prefer but don't require
    - weight: 100
      podAffinityTerm:
        topologyKey: topology.kubernetes.io/zone
        labelSelector:
          matchLabels:
            app: my-api

# Note: preferredDuring... = soft constraint (won't block scheduling)
#       requiredDuring...  = hard constraint (blocks if unsatisfiable)

topologySpreadConstraints with whenUnsatisfiable:DoNotSchedule creates a hard constraint that blocks pod scheduling if the topology cannot be balanced to within maxSkew. In a 2-zone cluster trying to spread 7 pods with maxSkew=1, the 7th pod will stay Pending forever. Use ScheduleAnyway for non-critical spread requirements, or preferredDuringSchedulingIgnoredDuringExecution affinity for soft zone preferences.

Decision Guide: Scheduling Constraints

Must this workload NEVER run two replicas on the same node or zone (high-availability requirement)?
YesUse requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity with topologyKey=kubernetes.io/hostname. This is a hard constraint -- if no node satisfies it, pods are Pending.
NoContinue -- use soft constraints or topology spread.
Should this workload be spread across zones for resilience but can tolerate imperfect spread?
YesUse topologySpreadConstraints with whenUnsatisfiable:ScheduleAnyway or preferredDuringSchedulingIgnoredDuringExecution pod anti-affinity with topologyKey=topology.kubernetes.io/zone.
NoContinue.
Does this workload need specific hardware (GPU nodes, high-memory nodes)?
YesUse node affinity (requiredDuringScheduling if mandatory, preferredDuringScheduling if preferred) or node taints with tolerations to direct pods to specific node types.
NoUse standard scheduling with only resource requests -- the scheduler will balance automatically.

Cost and Complexity: Scheduling Constraint Trade-offs

Constraint TypeEffect if UnsatisfiedUse CaseRisk
requiredDuringScheduling anti-affinityPod stays Pending foreverTrue HA -- must separate replicasHigh -- can block all scheduling
preferredDuringScheduling anti-affinityPod schedules on any nodePrefer separation -- tolerate co-locationLow -- never blocks scheduling
topologySpread DoNotSchedulePod stays Pending if spread impossibleStrict zone distributionHigh -- blocks on zone count mismatch
topologySpread ScheduleAnywayPod schedules but spread penalisedPrefer zone distributionLow -- never blocks scheduling
nodeAffinity requiredPod stays Pending if no matching nodesGPU/specialized hardwareMedium -- ensure matching nodes exist
nodeAffinity preferredPod schedules on any nodePrefer specific hardwareLow -- never blocks scheduling

Exam Answer vs. Production Reality

1 / 2

How the scheduler makes placement decisions

📖 What the exam expects

The scheduler watches for unscheduled pods (nodeName is empty). For each pod, it runs filtering (which nodes can run this pod? -- resource availability, taints, affinity, topology constraints) then scoring (which of the feasible nodes is the best? -- spreading, resource balance, custom scores). It assigns the highest-scoring node by writing pod.spec.nodeName.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

Appears as "why is my pod Pending?" in troubleshooting interviews, and "how does Kubernetes scheduling work?" in architecture discussions. Controller-manager questions often appear as "what is a reconciliation loop?" or "how does Kubernetes detect and replace failed nodes?".

Common questions:

  • Why is a pod stuck in Pending and how do you diagnose the cause?
  • What is the difference between pod anti-affinity and topology spread constraints?
  • How does Kubernetes detect that a node has failed and when does it evict pods?
  • Why does controller-manager use leader election even in an HA configuration?
  • How would you configure pod scheduling to ensure high availability across zones?

Strong answer: Mentioning kube_pod_status_phase{phase="Pending"} as the alert metric for scheduling backlogs. Knowing that the Node controller has a --pod-eviction-timeout (default 5 minutes) before evicting pods from NotReady nodes. Understanding that scheduler plugins can be extended for custom placement logic.

Red flags: Thinking the scheduler can override resource limits or node taints. Not knowing that kubectl describe pod events contain the scheduling failure reason. Believing multiple controller-manager replicas all process reconciliation simultaneously.

Related concepts

Explore topics that connect to this one.

  • kube-apiserver Internals: The Single Control Point for All Kubernetes Operations
  • Resource Requests & Limits: CPU Throttle and OOM Kill
  • Kubernetes Autoscaling: HPA, VPA, and Cluster Autoscaler

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.