The kube-scheduler decides WHERE a pod runs. The kube-controller-manager decides WHAT should exist. Together they form the decision layer of the Kubernetes control plane. A scheduler bug leaves pods permanently Pending. A controller memory leak stops all reconciliation. Understanding both is essential for diagnosing "why is my pod not running?" and "why are my deployments not scaling?".
Know that the scheduler assigns pods to nodes. Know that the controller manager runs reconciliation loops for Deployments, ReplicaSets, and Nodes. Know the difference between a pod being Pending (not yet scheduled) and Unschedulable (scheduled but no node meets constraints).
Debug Pending pods by reading scheduler events (kubectl describe pod, look for "FailedScheduling"). Know the common scheduling failure reasons: insufficient resources, node affinity mismatch, taint tolerance missing, topology constraint violation. Know which controllers are in controller-manager.
Design node affinity, pod anti-affinity, and topology spread constraints for high-availability workloads. Tune scheduler parameters (percentageOfNodesToScore) for large clusters. Monitor controller-manager queue depth and reconciliation latency as health signals.
Design multi-cluster scheduling strategies. Evaluate scheduler plugins for custom placement policies. Define operational runbooks for scheduler failure scenarios. Design controller-manager HA -- understand why only one instance is active at a time despite multiple replicas.
The kube-scheduler decides WHERE a pod runs. The kube-controller-manager decides WHAT should exist. Together they form the decision layer of the Kubernetes control plane. A scheduler bug leaves pods permanently Pending. A controller memory leak stops all reconciliation. Understanding both is essential for diagnosing "why is my pod not running?" and "why are my deployments not scaling?".
kube-scheduler memory leak begins -- goroutine accumulation from PodTopologySpread plugin
Scheduler memory reaches 4GB -- OOM killed by kubelet
Backup scheduler instance begins leader election (default: 15s lease duration + retry)
New scheduler leader elected -- begins processing Pending pod queue
Scheduler works through 200 queued Pending pods -- but scheduling is sequential per cycle
All pods scheduled -- but 3 batch windows already missed their SLA
The question this raises
The scheduler is stateless -- restarting it should be instant. Why did 200 pods stay Pending for 45 minutes after the scheduler restarted?
A pod has been Pending for 20 minutes. kubectl describe pod shows event: "0/5 nodes are available: 5 node(s) didn't match pod topology spread constraints." What is the most likely fix?
Lesson outline
Scheduling: finding the best node. Control loops: maintaining desired state.
Without the scheduler, every pod would need a manual node assignment. Without controllers, every component failure would need manual detection and replacement. The scheduler automates placement decisions based on resources and constraints. The controller manager automates the endless loop of "desired state differs from actual state -- fix it". Together they make Kubernetes self-healing.
Filter phase (scheduling)
Use for: For each unscheduled pod, the scheduler filters all nodes to find feasible nodes: resource requests fit available capacity, no unmatched taints, node/pod affinity satisfied, topology spread constraints met. Nodes failing any filter are eliminated.
Score phase (scheduling)
Use for: Feasible nodes are scored by multiple scoring plugins: LeastRequestedPriority (prefer less loaded nodes), NodeAffinityPriority (prefer nodes matching soft affinity), PodTopologySpreadPriority (prefer balanced spread). Highest score wins.
Deployment controller
Use for: Watches Deployment objects. Creates/updates ReplicaSets. Manages rolling update strategy (maxUnavailable, maxSurge). Ensures the correct ReplicaSet is scaled up and old ones scaled down. One of the most critical reconciliation loops.
Node controller
Use for: Detects failed nodes by monitoring kubelet heartbeats. After node-monitor-grace-period (default 40s) of no heartbeat: marks node NotReady. After pod-eviction-timeout (default 5m): evicts pods from the node so they can be rescheduled elsewhere.
SCHEDULING DECISION PATH:
kubectl apply -> API server -> etcd: Pod created with nodeName=""
|
kube-scheduler WATCHES for unscheduled pods
|
For each unscheduled pod:
+-------------------------------------+
| FILTER PHASE |
| Check each node: |
| - Resources fit? (CPU/mem requests) |
| - No unmatched Taints? |
| - nodeAffinity satisfied? |
| - podAntiAffinity satisfied? |
| - TopologySpread max skew met? |
| -> Set of FEASIBLE nodes |
+-------------------------------------+
| SCORE PHASE |
| Score each feasible node: |
| - LeastRequested (prefer empty) |
| - NodeAffinity soft match bonus |
| - TopologySpread balance score |
| -> Highest score node selected |
+-------------------------------------+
| BIND |
| Write pod.spec.nodeName = node-X |
| kubelet on node-X sees the pod |
| and starts the container |
+-------------------------------------+
CONTROLLER RECONCILIATION PATH:
API server watch event (ADDED/MODIFIED/DELETED)
|
v
Controller (Deployment, ReplicaSet, Node, etc.)
|-- Compute diff: desired vs actual
|-- Create/update/delete objects via API server
|-- Watch for next change eventThe scheduler runs filter then score for each pod independently. A topology constraint failure in the filter phase means 0 feasible nodes -- the pod stays Pending. Read the filter failure reason in kubectl describe pod events to understand which constraint failed.
Scheduling failure diagnosis
A pod is stuck in Pending with 5 nodes in the cluster
“The cluster must be out of resources. Scale up nodes to fix the scheduling problem.”
“Resource shortage is only ONE of many reasons for Pending pods. kubectl describe pod events list the exact filter failure: it might be a missing toleration for a node taint, an unsatisfied node affinity, a topology spread constraint requiring more zones, or a PVC that has not bound yet. Read the events FIRST -- scaling nodes solves resource issues but not constraint issues.”
The full scheduling and reconciliation lifecycle
01
1. Pod enters the scheduling queue -- the scheduler maintains a priority queue of unscheduled pods. Pods with higher PriorityClass values are scheduled first. New pods are added to the queue when the scheduler's watch detects nodeName=="" on a pod.
02
2. Filter phase runs against all nodes -- the scheduler evaluates every node against every filter plugin for this pod. For a cluster with 1000 nodes, percentageOfNodesToScore (default 50%) limits filtering to 500 nodes for performance. If all nodes fail filtering: "0/N nodes available: X reason, Y reason..." event is created on the pod.
03
3. Score phase ranks feasible nodes -- each feasible node receives a score from each scoring plugin. Scores are normalized and weighted. The node with the highest total score is selected. Ties are broken randomly.
04
4. Scheduler binds the pod -- the scheduler writes pod.spec.nodeName = "selected-node". The kubelet on that node sees the pod via its watch and begins the container creation sequence.
05
5. Controller reconciliation: Deployment -> ReplicaSet -> Pod -- when you apply a Deployment, the Deployment controller creates a ReplicaSet. The ReplicaSet controller creates Pods. Each Pod gets scheduled. The Deployment controller monitors rollout progress (ready pods vs desired) and manages the rolling update strategy.
1. Pod enters the scheduling queue -- the scheduler maintains a priority queue of unscheduled pods. Pods with higher PriorityClass values are scheduled first. New pods are added to the queue when the scheduler's watch detects nodeName=="" on a pod.
2. Filter phase runs against all nodes -- the scheduler evaluates every node against every filter plugin for this pod. For a cluster with 1000 nodes, percentageOfNodesToScore (default 50%) limits filtering to 500 nodes for performance. If all nodes fail filtering: "0/N nodes available: X reason, Y reason..." event is created on the pod.
3. Score phase ranks feasible nodes -- each feasible node receives a score from each scoring plugin. Scores are normalized and weighted. The node with the highest total score is selected. Ties are broken randomly.
4. Scheduler binds the pod -- the scheduler writes pod.spec.nodeName = "selected-node". The kubelet on that node sees the pod via its watch and begins the container creation sequence.
5. Controller reconciliation: Deployment -> ReplicaSet -> Pod -- when you apply a Deployment, the Deployment controller creates a ReplicaSet. The ReplicaSet controller creates Pods. Each Pod gets scheduled. The Deployment controller monitors rollout progress (ready pods vs desired) and manages the rolling update strategy.
1# Find all Pending pods across the cluster2$ kubectl get pods -A --field-selector=status.phase=Pending34# Get the scheduling failure reason (ALWAYS check events first)5$ kubectl describe pod <pending-pod-name>6# Look for Events section at the bottom:7# Events:8# Warning FailedScheduling <time> default-scheduler9# 0/5 nodes available:The Events section in kubectl describe pod contains the exact scheduling failure reason. This is the most important diagnostic information for Pending pods -- always read it before guessing.10# 1 node(s) didn't match pod affinity rules,11# 2 node(s) had untolerated taint {node-role.kubernetes.io/control-plane:NoSchedule},12# 2 node(s) didn't have enough memory.1314# Common failure reasons and fixes:15# "Insufficient cpu/memory" -> add nodes or reduce requests16# "didn't match node affinity" -> check nodeAffinity in pod spec17# "had untolerated taint" -> add toleration for the tainttopology spread constraints with whenUnsatisfiable: DoNotSchedule will keep pods Pending indefinitely if the constraint cannot be met. Change to ScheduleAnyway to allow scheduling even if spread is not perfect.18# "didn't match pod topology spread" -> add nodes in missing zones19# OR change whenUnsatisfiable: ScheduleAnyway2021# Monitor scheduling queue depth (alert on Pending > 5 minutes)22$ kubectl get --raw='/metrics' | grep 'scheduler_pending_pods'23# scheduler_pending_pods{queue="active"} = pods being scheduled24# scheduler_pending_pods{queue="unschedulable"} = pods that cannot be scheduled
Blast radius: scheduler and controller failures
Topology spread constraint too strict -- pods Pending indefinitely
# Topology spread: want pods in 3 zones, cluster only has 2 zones
# maxSkew=1 means: difference between most and least loaded zone <= 1
# With 2 zones and 3+ pods, this is mathematically impossible
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 6
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # BLOCKS scheduling
labelSelector:
matchLabels:
app: my-api
containers:
- name: api
image: my-api:v1
# Result: 2 pods per zone (balanced) but 7th pod Pending FOREVER# Option 1: Allow scheduling even if spread is not perfect
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway # Schedule even if constraint unmet
labelSelector:
matchLabels:
app: my-api
# Option 2: Use pod anti-affinity for zone spread (preferred)
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution: # Soft: prefer but don't require
- weight: 100
podAffinityTerm:
topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
app: my-api
# Note: preferredDuring... = soft constraint (won't block scheduling)
# requiredDuring... = hard constraint (blocks if unsatisfiable)topologySpreadConstraints with whenUnsatisfiable:DoNotSchedule creates a hard constraint that blocks pod scheduling if the topology cannot be balanced to within maxSkew. In a 2-zone cluster trying to spread 7 pods with maxSkew=1, the 7th pod will stay Pending forever. Use ScheduleAnyway for non-critical spread requirements, or preferredDuringSchedulingIgnoredDuringExecution affinity for soft zone preferences.
| Constraint Type | Effect if Unsatisfied | Use Case | Risk |
|---|---|---|---|
| requiredDuringScheduling anti-affinity | Pod stays Pending forever | True HA -- must separate replicas | High -- can block all scheduling |
| preferredDuringScheduling anti-affinity | Pod schedules on any node | Prefer separation -- tolerate co-location | Low -- never blocks scheduling |
| topologySpread DoNotSchedule | Pod stays Pending if spread impossible | Strict zone distribution | High -- blocks on zone count mismatch |
| topologySpread ScheduleAnyway | Pod schedules but spread penalised | Prefer zone distribution | Low -- never blocks scheduling |
| nodeAffinity required | Pod stays Pending if no matching nodes | GPU/specialized hardware | Medium -- ensure matching nodes exist |
| nodeAffinity preferred | Pod schedules on any node | Prefer specific hardware | Low -- never blocks scheduling |
How the scheduler makes placement decisions
📖 What the exam expects
The scheduler watches for unscheduled pods (nodeName is empty). For each pod, it runs filtering (which nodes can run this pod? -- resource availability, taints, affinity, topology constraints) then scoring (which of the feasible nodes is the best? -- spreading, resource balance, custom scores). It assigns the highest-scoring node by writing pod.spec.nodeName.
Toggle between what certifications teach and what production actually requires
Appears as "why is my pod Pending?" in troubleshooting interviews, and "how does Kubernetes scheduling work?" in architecture discussions. Controller-manager questions often appear as "what is a reconciliation loop?" or "how does Kubernetes detect and replace failed nodes?".
Common questions:
Strong answer: Mentioning kube_pod_status_phase{phase="Pending"} as the alert metric for scheduling backlogs. Knowing that the Node controller has a --pod-eviction-timeout (default 5 minutes) before evicting pods from NotReady nodes. Understanding that scheduler plugins can be extended for custom placement logic.
Red flags: Thinking the scheduler can override resource limits or node taints. Not knowing that kubectl describe pod events contain the scheduling failure reason. Believing multiple controller-manager replicas all process reconciliation simultaneously.
Related concepts
Explore topics that connect to this one.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.