Kubernetes did not emerge in a vacuum. It was shaped by a decade of operating Borg and Omega at Google scale, distilled into an open-source project in 2014. Understanding the design decisions inherited from Borg -- and the ones deliberately changed -- explains why Kubernetes behaves the way it does under failure.
Know the historical arc: Borg -> Kubernetes -> CNCF ecosystem. Understand why containers became popular before Kubernetes existed. Know what the CRI, CNI, and CSI interfaces enable.
Understand the design decisions inherited from Borg: declarative desired state, reconciliation loops, separation of control and data plane. Know why these matter when you're debugging why a pod did not get scheduled.
Evaluate the CNCF ecosystem critically -- understand which projects solve real problems vs adding complexity. Know the Kubernetes release cycle and its impact on deprecation planning (APIs removed 12-18 months after deprecation notice).
Kubernetes did not emerge in a vacuum. It was shaped by a decade of operating Borg and Omega at Google scale, distilled into an open-source project in 2014. Understanding the design decisions inherited from Borg -- and the ones deliberately changed -- explains why Kubernetes behaves the way it does under failure.
Google's Borg manages hundreds of thousands of machines across cells -- tightly coupled control and data plane
Google introduces cgroups to the Linux kernel -- the foundation for container resource management
Docker democratises containers -- makes Linux namespace/cgroup stack accessible to developers
Google open-sources Kubernetes at DockerCon -- based on Borg lessons, designed from day one with control/data plane separation
CNCF (Cloud Native Computing Foundation) formed -- Kubernetes donated as founding project
Kubernetes 1.11 -- CRI, CNI, CSI interfaces stable -- pluggable runtime, network, and storage architecture
The question this raises
Why do existing Kubernetes pods keep running even when etcd is down and the API server is unreachable -- and what does this tell you about the control plane / data plane separation that Borg taught the Kubernetes designers?
Your cluster's etcd is down and the API server is unreachable. What happens to pods that were already running before the outage?
Lesson outline
The scale problem that created Kubernetes
By 2010, Google was running billions of containers per week on Borg. The lessons: manually placing workloads does not scale, tightly coupled control and data planes create failure cascades, and the same container runtime should run anywhere (test, staging, prod, across cloud providers). Kubernetes was designed to solve exactly these problems -- with open interfaces so the community could build the runtime, network, and storage layers independently.
Declarative desired state
Use for: You declare what should run (Deployment with 3 replicas). Controllers continuously reconcile actual state toward that desired state. No imperative "start this container now" -- instead "ensure 3 replicas always exist".
CRI (Container Runtime Interface)
Use for: gRPC API between kubelet and container runtime. Enables swapping runtimes (Docker -> containerd -> CRI-O -> gVisor) without changing Kubernetes code. Created after learning that tight coupling to Docker caused integration problems.
CNI (Container Network Interface)
Use for: Plugin spec for container networking setup. Enables swapping network implementations (Flannel, Calico, Cilium) without changing Kubernetes. Called by kubelet at pod creation to set up veth pairs and IP assignment.
CSI (Container Storage Interface)
Use for: Plugin spec for persistent storage. Enables storage vendors (AWS EBS, GCP Persistent Disk, Portworx, Ceph) to integrate without modifying Kubernetes source code. Replaced the old in-tree volume plugins.
BORG (2003-2014) KUBERNETES (2014-present)
+---------------------------+ +----------------------------------+
| Borgmaster (control) | | Control Plane |
| - Task scheduling | | - kube-apiserver (REST + auth) |
| - Health monitoring | | - etcd (distributed state) |
| - Tightly coupled to | | - kube-scheduler (placement) |
| Borglet (data plane) | | - controller-manager (reconcile) |
+---------------------------+ +-----------|----------------------+
| Borglet (per-machine) | | | (independent) |
| - Runs tasks | | Data Plane (per node) |
| - Reports to Borgmaster | | - kubelet (local reconciliation) |
| - Cannot operate if | | - container runtime (CRI) |
| Borgmaster unreachable | | - kube-proxy (network rules) |
+---------------------------+ | RUNS INDEPENDENTLY FROM CP! |
+----------------------------------+
KEY DIFFERENCE:
Borg: data plane depends on control plane -- CP down = DP disrupted
K8s: data plane operates autonomously -- CP down = no new pods
but existing pods keep running until control plane recoversThe control plane / data plane separation is the key design lesson from Borg. Existing workloads continue running autonomously even during complete control plane outages.
Design model evolution
The etcd cluster loses quorum and the API server becomes unavailable
“All workloads stop immediately because the orchestration system is unavailable. Similar to a database being down -- nothing works.”
“Existing pods continue running. The kubelet on each node continues to maintain its locally known pod state. Pods that crash are restarted by the local kubelet using its cached spec. New scheduling, new deployments, and config changes are blocked -- but the data plane runs autonomously.”
How Kubernetes maintains desired state
01
1. You declare desired state via the API -- kubectl apply -f deployment.yaml sends a REST request to the API server, which validates and stores the Deployment object in etcd. No container has been created yet.
02
2. Controllers watch for changes -- the Deployment controller (inside controller-manager) watches the API server for Deployment objects. It sees the new Deployment and calculates: "desired replicas=3, actual replicas=0, need to create 3 ReplicaSets and Pods".
03
3. Scheduler assigns pods to nodes -- the scheduler watches for unscheduled pods (pod.spec.nodeName is empty). It evaluates node resources, affinity rules, and taints. It assigns each pod to a node by writing pod.spec.nodeName.
04
4. Kubelet creates the container -- the kubelet on each node watches for pods assigned to it (pod.spec.nodeName == this node). It calls CRI to pull the image and create the container. It reports status back to the API server.
05
5. Continuous reconciliation -- if a pod crashes, the kubelet reports the failure to the API server. The ReplicaSet controller sees "desired=3, actual=2" and creates a new pod. The scheduler assigns it. The kubelet runs it. This loop runs continuously, forever.
1. You declare desired state via the API -- kubectl apply -f deployment.yaml sends a REST request to the API server, which validates and stores the Deployment object in etcd. No container has been created yet.
2. Controllers watch for changes -- the Deployment controller (inside controller-manager) watches the API server for Deployment objects. It sees the new Deployment and calculates: "desired replicas=3, actual replicas=0, need to create 3 ReplicaSets and Pods".
3. Scheduler assigns pods to nodes -- the scheduler watches for unscheduled pods (pod.spec.nodeName is empty). It evaluates node resources, affinity rules, and taints. It assigns each pod to a node by writing pod.spec.nodeName.
4. Kubelet creates the container -- the kubelet on each node watches for pods assigned to it (pod.spec.nodeName == this node). It calls CRI to pull the image and create the container. It reports status back to the API server.
5. Continuous reconciliation -- if a pod crashes, the kubelet reports the failure to the API server. The ReplicaSet controller sees "desired=3, actual=2" and creates a new pod. The scheduler assigns it. The kubelet runs it. This loop runs continuously, forever.
1# Watch the reconciliation loop in action2# Terminal 1: watch pod count3$ kubectl get pods -w45# Terminal 2: manually delete a pod (the control plane will fight this)Manually deleting a pod does not reduce replica count -- it just triggers a replacement. The desired state (3 replicas) is in the Deployment spec in etcd.6$ kubectl delete pod my-app-abc1237pod "my-app-abc123" deleted8# Immediately in Terminal 1:9# NAME READY STATUS10# my-app-abc123 1/1 Running <- being deleted11# my-app-abc123 0/1 Terminating <- deletion in progress12# my-app-xyz789 0/1 Pending <- NEW pod created by ReplicaSet controller13# my-app-xyz789 0/1 ContainerCreating14# my-app-xyz789 1/1 Running <- back to 3 replicas in ~10 seconds1516# The ReplicaSet controller saw desired=3, actual=2, created 1 new pod.To actually reduce replicas, you must change the desired state. The controller then reconciles actual state toward the new desired state.17# This is the reconciliation loop -- it runs continuously.1819# To ACTUALLY reduce replicas, change desired state:20$ kubectl scale deployment my-app --replicas=221# Now desired=2, actual=3, controller DELETES one pod.
Blast radius when Kubernetes design decisions are misunderstood
Imperative kubectl commands in CI/CD instead of declarative GitOps
# Imperative: state lives in CI/CD history, not in a source of truth
# Different team members run different versions of this script
#!/bin/bash
kubectl scale deployment my-app --replicas=5
kubectl set image deployment/my-app app=my-app:v2
kubectl set resources deployment/my-app --limits=cpu=500m,memory=512Mi
# If any command fails, state is partially applied.
# No audit trail. No reconciliation. No self-healing.# Declarative: desired state in git, reconciled continuously
# deployment.yaml in git (source of truth):
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 5 # scale: change here and commit
selector:
matchLabels:
app: my-app
template:
spec:
containers:
- name: app
image: my-app:v2 # image: change here and commit
resources:
limits:
cpu: "500m"
memory: "512Mi"
# kubectl apply -f deployment.yaml
# OR let ArgoCD/Flux apply from git automaticallyImperative commands create state that exists only in shell history. Declarative YAML in git creates state that is: auditable (git blame), reconciled (controllers fix drift), and reproducible (re-apply restores desired state). The reconciliation loop is only as useful as the quality of your declared desired state stored in version control.
| Deployment Model | Operational Overhead | Flexibility | Best For |
|---|---|---|---|
| Managed K8s (EKS/GKE/AKS) | Low -- cloud provider manages control plane | Medium -- limited control plane customisation | Most teams -- start here unless requirements dictate self-managed |
| Self-managed (kubeadm) | High -- all control plane upgrades manual | Full -- any configuration | Teams with specific compliance or on-prem requirements |
| K3s / K0s (lightweight) | Low -- minimal footprint | Medium -- some features removed | Edge computing, IoT, dev environments |
| Fully managed + GitOps (EKS + ArgoCD) | Low ops, high automation | High -- declarative everything | Platform teams wanting self-service developer experience |
Declarative vs imperative orchestration
📖 What the exam expects
Kubernetes uses a declarative model: you specify desired state in YAML, the control plane continuously reconciles actual state toward desired state. Controllers watch resource objects and make changes to move the cluster toward the desired configuration.
Toggle between what certifications teach and what production actually requires
Asked in senior engineering and staff interviews as "why is Kubernetes designed this way" or "what is the control plane / data plane separation and why does it matter". Also appears as context for deep-dive questions on any specific component.
Common questions:
Strong answer: Mentioning Borg as the design ancestor. Explaining the reconciliation loop and why it makes kubectl apply idempotent. Knowing that the CNCF graduated projects (Prometheus, Envoy, Fluentd) represent production-proven ecosystem choices.
Red flags: Thinking the control plane and data plane must both be healthy for pods to run. Not knowing what CRI, CNI, or CSI abstractions enable (pluggable runtime, network, and storage). Treating Kubernetes as a black box without understanding the declarative reconciliation model.
Related concepts
Explore topics that connect to this one.
Suggested next
Often learned after this topic.
Cluster Architecture: Control Plane vs Data Plane Deep DiveReady to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.