Jobs run tasks to completion. CronJobs schedule them. Both have subtle failure modes -- from infinite job accumulation filling etcd to concurrency policy misconfigurations spawning duplicate batch runs.
Know how to create a Job (run to completion) and a CronJob (scheduled Job). Understand completions vs parallelism.
Configure backoffLimit, activeDeadlineSeconds, successfulJobsHistoryLimit, failedJobsHistoryLimit. Understand concurrencyPolicy: Forbid vs Replace vs Allow.
Design batch pipelines with proper job cleanup, dead letter queues for failed jobs, and monitoring on job duration SLOs. Understand startingDeadlineSeconds for missed schedules.
Jobs run tasks to completion. CronJobs schedule them. Both have subtle failure modes -- from infinite job accumulation filling etcd to concurrency policy misconfigurations spawning duplicate batch runs.
CronJob typo: schedule */5 instead of 0 2 -- fires every 5 minutes instead of nightly
576 job executions accumulated; etcd object count explodes
etcd reaches 8GB storage quota -- control plane switches to read-only mode
All kubectl write operations fail: "etcdserver: mvcc: database space exceeded"
etcd defrag run, old jobs manually deleted, control plane restored
The question this raises
What limits exist on CronJob accumulation, and what happens when those limits are not configured against a runaway schedule?
A CronJob has schedule "*/10 * * * *" and concurrencyPolicy: Allow. The job normally takes 8 minutes, but today it takes 15 minutes due to slow network. What happens at the next scheduled trigger?
Lesson outline
The Finite Work Problem
Deployments are designed for continuous services. Jobs solve a different problem: run this task exactly N times to completion, handling retries for transient failures, with guaranteed cleanup of completed pods. CronJobs add time-based scheduling on top of Jobs.
Indexed Job (map-reduce)
Use for: Process N shards in parallel. Each pod gets JOB_COMPLETION_INDEX (0 to N-1) and processes its shard. completionMode: Indexed. Used for large-scale data processing, image resizing, ML training shards.
Work Queue Job
Use for: Pods pull work from a queue (SQS, Redis, RabbitMQ) until empty. completions=1, parallelism=N. Job completes when one pod exits 0 (signals queue empty). Used for email sending, notification dispatch.
CronJob with Forbid
Use for: Schedule ETL or backup jobs. concurrencyPolicy: Forbid ensures no overlapping runs. startingDeadlineSeconds: 300 allows a 5-minute window to catch missed runs without backfilling every missed instance.
CronJob: nightly-etl (schedule: "0 2 * * *")
|
v [at 02:00 UTC]
Job: nightly-etl-28492810
|
+-- Pod: nightly-etl-28492810-xk7p2 (Running)
| |-- exits 0 (success)
+-- Pod: nightly-etl-28492810-mn3q1 (Completed)
CronJob history limits:
successfulJobsHistoryLimit: 3 <- keep last 3 successful jobs
failedJobsHistoryLimit: 1 <- keep last 1 failed job
Without TTLAfterFinished or history limits:
Every run creates a new Job object in etcd
576 runs in 48h = 576 Job objects + N*576 Pod objects
etcd storage quota exceeded -> control plane read-onlyCronJob spawns Jobs; Jobs spawn Pods; history limits control etcd accumulation
CronJob Safety Configuration
Long-running ETL with default concurrencyPolicy: Allow
“Slow network causes job to run 15 min instead of 8; next scheduled run overlaps; exponential accumulation begins”
“concurrencyPolicy: Forbid skips overlapping runs; alert on job duration SLO to catch slow runs before they pile up”
No job history limits set
“Every run accumulates forever in etcd; 576 runs in 48h fills 8GB quota; control plane dies”
“successfulJobsHistoryLimit: 3, failedJobsHistoryLimit: 3, ttlSecondsAfterFinished: 86400 -- automatic cleanup”
CronJob trigger and Job lifecycle
01
1. CronJob controller checks schedule every 10 seconds against current time
02
2. At scheduled time, creates a new Job object owned by the CronJob
03
3. Job controller creates pods per parallelism setting; tracks completions
04
4. Pod exits 0: Job marks one completion; creates next pod if completions not met
05
5. Pod exits non-0: Job increments failure count; retries up to backoffLimit
06
6. Job reaches completions or backoffLimit: Job marked Complete or Failed
07
7. CronJob trims history to successfulJobsHistoryLimit / failedJobsHistoryLimit
1. CronJob controller checks schedule every 10 seconds against current time
2. At scheduled time, creates a new Job object owned by the CronJob
3. Job controller creates pods per parallelism setting; tracks completions
4. Pod exits 0: Job marks one completion; creates next pod if completions not met
5. Pod exits non-0: Job increments failure count; retries up to backoffLimit
6. Job reaches completions or backoffLimit: Job marked Complete or Failed
7. CronJob trims history to successfulJobsHistoryLimit / failedJobsHistoryLimit
1apiVersion: batch/v12kind: CronJob3metadata:4name: nightly-etl5spec:6schedule: "0 2 * * *"7concurrencyPolicy: Forbid # skip if previous still runningForbid prevents concurrent ETL runs from overlapping8startingDeadlineSeconds: 300 # 5min window to catch missed runs9successfulJobsHistoryLimit: 310failedJobsHistoryLimit: 311jobTemplate:12spec:13backoffLimit: 2activeDeadlineSeconds: hard timeout on entire job -- prevents runaway jobs14activeDeadlineSeconds: 7200 # 2h hard timeoutttlSecondsAfterFinished: auto-delete job after 24h -- prevents etcd accumulation15ttlSecondsAfterFinished: 8640016template:17spec:18restartPolicy: OnFailure19containers:20- name: etl21image: etl-pipeline:v3
Job and CronJob failure modes
CronJob with default history limits causes etcd accumulation
spec:
schedule: "*/5 * * * *" # typo: every 5min not nightly
concurrencyPolicy: Allow # default: concurrent runs OK
# no successfulJobsHistoryLimit -- defaults to 3
# no failedJobsHistoryLimit -- defaults to 1
# no ttlSecondsAfterFinished
# Result: rapid accumulation fills etcdspec:
schedule: "0 2 * * *" # verified: nightly at 02:00
concurrencyPolicy: Forbid # no overlapping runs
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
activeDeadlineSeconds: 7200
ttlSecondsAfterFinished: 86400The typo "*/5" instead of "0 2" is common. Combine correct schedule with concurrencyPolicy: Forbid and ttlSecondsAfterFinished to prevent runaway accumulation regardless of schedule errors.
| Setting | Safe value | Risk if omitted | Impact |
|---|---|---|---|
| concurrencyPolicy | Forbid for ETL | Allow: exponential pod accumulation | etcd saturation, node exhaustion |
| activeDeadlineSeconds | 2x expected duration | None: hanging jobs run forever | Resource leak, Forbid policy deadlock |
| ttlSecondsAfterFinished | 86400 (24h) | None: jobs accumulate in etcd | etcd storage quota exceeded |
| backoffLimit | 2-3 for ETL | 6 (default): slow failure detection | Resource waste, alert delay |
| startingDeadlineSeconds | 300 | None: missed runs backfilled up to 100x | Sudden burst of catch-up jobs on restart |
Job completions vs parallelism
📖 What the exam expects
completions: total successful pod completions required. parallelism: max pods running simultaneously. completions=5, parallelism=2 means 5 total runs with max 2 at once.
Toggle between what certifications teach and what production actually requires
System design questions about batch processing pipelines and operational questions about CronJob failure modes.
Common questions:
Strong answer: Mentions TTLAfterFinished for automatic job cleanup, startingDeadlineSeconds to handle missed schedules, and completionMode: Indexed for sharded batch processing.
Red flags: Leaving concurrencyPolicy at default Allow for long-running ETL jobs, or not knowing that job history accumulates in etcd.
Related concepts
Explore topics that connect to this one.
Suggested next
Often learned after this topic.
Services & Endpoints: Stable Networking for Ephemeral PodsReady to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.