Skip to main content
Career Paths
Concepts
Jobs And Cronjobs
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Jobs & CronJobs: Batch and Scheduled Workloads

Jobs run tasks to completion. CronJobs schedule them. Both have subtle failure modes -- from infinite job accumulation filling etcd to concurrency policy misconfigurations spawning duplicate batch runs.

Relevant for:JuniorMid-levelSenior
Why this matters at your level
Junior

Know how to create a Job (run to completion) and a CronJob (scheduled Job). Understand completions vs parallelism.

Mid-level

Configure backoffLimit, activeDeadlineSeconds, successfulJobsHistoryLimit, failedJobsHistoryLimit. Understand concurrencyPolicy: Forbid vs Replace vs Allow.

Senior

Design batch pipelines with proper job cleanup, dead letter queues for failed jobs, and monitoring on job duration SLOs. Understand startingDeadlineSeconds for missed schedules.

Jobs & CronJobs: Batch and Scheduled Workloads

Jobs run tasks to completion. CronJobs schedule them. Both have subtle failure modes -- from infinite job accumulation filling etcd to concurrency policy misconfigurations spawning duplicate batch runs.

~3 min read
Be the first to complete!
LIVEetcd Saturation -- CronJob Runaway -- 2022
Breaking News
T+0

CronJob typo: schedule */5 instead of 0 2 -- fires every 5 minutes instead of nightly

T+48h

576 job executions accumulated; etcd object count explodes

T+48h 12m

etcd reaches 8GB storage quota -- control plane switches to read-only mode

T+48h 15m

All kubectl write operations fail: "etcdserver: mvcc: database space exceeded"

T+50h

etcd defrag run, old jobs manually deleted, control plane restored

—Runaway job executions in 48h
—etcd quota that killed the control plane
—Full cluster outage during recovery

The question this raises

What limits exist on CronJob accumulation, and what happens when those limits are not configured against a runaway schedule?

Test your assumption first

A CronJob has schedule "*/10 * * * *" and concurrencyPolicy: Allow. The job normally takes 8 minutes, but today it takes 15 minutes due to slow network. What happens at the next scheduled trigger?

Lesson outline

What Jobs and CronJobs Solve

The Finite Work Problem

Deployments are designed for continuous services. Jobs solve a different problem: run this task exactly N times to completion, handling retries for transient failures, with guaranteed cleanup of completed pods. CronJobs add time-based scheduling on top of Jobs.

Indexed Job (map-reduce)

Use for: Process N shards in parallel. Each pod gets JOB_COMPLETION_INDEX (0 to N-1) and processes its shard. completionMode: Indexed. Used for large-scale data processing, image resizing, ML training shards.

Work Queue Job

Use for: Pods pull work from a queue (SQS, Redis, RabbitMQ) until empty. completions=1, parallelism=N. Job completes when one pod exits 0 (signals queue empty). Used for email sending, notification dispatch.

CronJob with Forbid

Use for: Schedule ETL or backup jobs. concurrencyPolicy: Forbid ensures no overlapping runs. startingDeadlineSeconds: 300 allows a 5-minute window to catch missed runs without backfilling every missed instance.

The System View

CronJob: nightly-etl (schedule: "0 2 * * *")
         |
         v  [at 02:00 UTC]
      Job: nightly-etl-28492810
         |
         +-- Pod: nightly-etl-28492810-xk7p2  (Running)
         |        |-- exits 0 (success)
         +-- Pod: nightly-etl-28492810-mn3q1  (Completed)

CronJob history limits:
  successfulJobsHistoryLimit: 3  <- keep last 3 successful jobs
  failedJobsHistoryLimit: 1      <- keep last 1 failed job

Without TTLAfterFinished or history limits:
  Every run creates a new Job object in etcd
  576 runs in 48h = 576 Job objects + N*576 Pod objects
  etcd storage quota exceeded -> control plane read-only

CronJob spawns Jobs; Jobs spawn Pods; history limits control etcd accumulation

CronJob Safety Configuration

Situation
Before
After

Long-running ETL with default concurrencyPolicy: Allow

“Slow network causes job to run 15 min instead of 8; next scheduled run overlaps; exponential accumulation begins”

“concurrencyPolicy: Forbid skips overlapping runs; alert on job duration SLO to catch slow runs before they pile up”

No job history limits set

“Every run accumulates forever in etcd; 576 runs in 48h fills 8GB quota; control plane dies”

“successfulJobsHistoryLimit: 3, failedJobsHistoryLimit: 3, ttlSecondsAfterFinished: 86400 -- automatic cleanup”

How CronJob Scheduling Works

CronJob trigger and Job lifecycle

→

01

1. CronJob controller checks schedule every 10 seconds against current time

→

02

2. At scheduled time, creates a new Job object owned by the CronJob

→

03

3. Job controller creates pods per parallelism setting; tracks completions

→

04

4. Pod exits 0: Job marks one completion; creates next pod if completions not met

→

05

5. Pod exits non-0: Job increments failure count; retries up to backoffLimit

→

06

6. Job reaches completions or backoffLimit: Job marked Complete or Failed

07

7. CronJob trims history to successfulJobsHistoryLimit / failedJobsHistoryLimit

1

1. CronJob controller checks schedule every 10 seconds against current time

2

2. At scheduled time, creates a new Job object owned by the CronJob

3

3. Job controller creates pods per parallelism setting; tracks completions

4

4. Pod exits 0: Job marks one completion; creates next pod if completions not met

5

5. Pod exits non-0: Job increments failure count; retries up to backoffLimit

6

6. Job reaches completions or backoffLimit: Job marked Complete or Failed

7

7. CronJob trims history to successfulJobsHistoryLimit / failedJobsHistoryLimit

cronjob-safe.yaml
1apiVersion: batch/v1
2kind: CronJob
3metadata:
4 name: nightly-etl
5spec:
6 schedule: "0 2 * * *"
7 concurrencyPolicy: Forbid # skip if previous still running
Forbid prevents concurrent ETL runs from overlapping
8 startingDeadlineSeconds: 300 # 5min window to catch missed runs
9 successfulJobsHistoryLimit: 3
10 failedJobsHistoryLimit: 3
11 jobTemplate:
12 spec:
13 backoffLimit: 2
activeDeadlineSeconds: hard timeout on entire job -- prevents runaway jobs
14 activeDeadlineSeconds: 7200 # 2h hard timeout
ttlSecondsAfterFinished: auto-delete job after 24h -- prevents etcd accumulation
15 ttlSecondsAfterFinished: 86400
16 template:
17 spec:
18 restartPolicy: OnFailure
19 containers:
20 - name: etl
21 image: etl-pipeline:v3

What Breaks in Production: Blast Radius

Job and CronJob failure modes

  • etcd saturation from job accumulation — No history limits + frequent schedule = thousands of Job/Pod objects in etcd. etcd hits 8GB quota -> control plane read-only. Fix: successfulJobsHistoryLimit, failedJobsHistoryLimit, ttlSecondsAfterFinished.
  • CronJob with Allow policy accumulates — Default concurrencyPolicy: Allow + job longer than schedule interval = exponential pod accumulation. Set Forbid for ETL, Replace for cache-warming jobs that must always have a fresh run.
  • Missing activeDeadlineSeconds — A hanging job (DB connection never returns) runs forever, consuming node resources and blocking the CronJob from running future instances (if using Forbid). Always set a hard timeout.
  • restartPolicy: Always on Job pods — Job pods must use restartPolicy: OnFailure or Never. Always is for Deployments. A Job pod with Always restarts indefinitely, never incrementing the Job failure count -- backoffLimit never triggers.

CronJob with default history limits causes etcd accumulation

Bug
spec:
  schedule: "*/5 * * * *"   # typo: every 5min not nightly
  concurrencyPolicy: Allow   # default: concurrent runs OK
  # no successfulJobsHistoryLimit -- defaults to 3
  # no failedJobsHistoryLimit -- defaults to 1
  # no ttlSecondsAfterFinished
  # Result: rapid accumulation fills etcd
Fix
spec:
  schedule: "0 2 * * *"          # verified: nightly at 02:00
  concurrencyPolicy: Forbid       # no overlapping runs
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      activeDeadlineSeconds: 7200
      ttlSecondsAfterFinished: 86400

The typo "*/5" instead of "0 2" is common. Combine correct schedule with concurrencyPolicy: Forbid and ttlSecondsAfterFinished to prevent runaway accumulation regardless of schedule errors.

Decision Guide: Job Configuration

Can multiple instances of this job run at the same time safely?
YesconcurrencyPolicy: Allow -- but set activeDeadlineSeconds and monitor duration
NoconcurrencyPolicy: Forbid (skip overlap) or Replace (cancel and restart)
Does each pod process a specific partition of data?
YescompletionMode: Indexed -- pods get JOB_COMPLETION_INDEX; no coordination needed
NoUse work queue pattern: pods pull from queue, completions=1
Is this a fire-and-forget task where failures should not retry?
YesbackoffLimit: 0 and restartPolicy: Never -- fail fast, alert immediately
NobackoffLimit: 2-3 with activeDeadlineSeconds to bound total retry time

Cost and Complexity: Batch Configuration Tradeoffs

SettingSafe valueRisk if omittedImpact
concurrencyPolicyForbid for ETLAllow: exponential pod accumulationetcd saturation, node exhaustion
activeDeadlineSeconds2x expected durationNone: hanging jobs run foreverResource leak, Forbid policy deadlock
ttlSecondsAfterFinished86400 (24h)None: jobs accumulate in etcdetcd storage quota exceeded
backoffLimit2-3 for ETL6 (default): slow failure detectionResource waste, alert delay
startingDeadlineSeconds300None: missed runs backfilled up to 100xSudden burst of catch-up jobs on restart

Exam Answer vs. Production Reality

1 / 3

Job completions vs parallelism

📖 What the exam expects

completions: total successful pod completions required. parallelism: max pods running simultaneously. completions=5, parallelism=2 means 5 total runs with max 2 at once.

Toggle between what certifications teach and what production actually requires

How this might come up in interviews

System design questions about batch processing pipelines and operational questions about CronJob failure modes.

Common questions:

  • What is the difference between a Job and a CronJob?
  • Explain concurrencyPolicy options and when you would use each.
  • How would you prevent a CronJob from accumulating failed jobs in etcd?
  • What happens if a CronJob misses its scheduled run?

Strong answer: Mentions TTLAfterFinished for automatic job cleanup, startingDeadlineSeconds to handle missed schedules, and completionMode: Indexed for sharded batch processing.

Red flags: Leaving concurrencyPolicy at default Allow for long-running ETL jobs, or not knowing that job history accumulates in etcd.

Related concepts

Explore topics that connect to this one.

  • StatefulSets & DaemonSets: Identity and Ubiquity
  • Kubernetes Pods: The Atomic Unit
  • Resource Requests & Limits: CPU Throttle and OOM Kill

Suggested next

Often learned after this topic.

Services & Endpoints: Stable Networking for Ephemeral Pods

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.

Sign in to track your progress and mark lessons complete.

Continue learning

Services & Endpoints: Stable Networking for Ephemeral Pods

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.