Kubernetes Job Deep Dive¶
This guide provides an in-depth look at the Kubernetes Job resource. Jobs are used to run batch or finite-duration tasks in Kubernetes, where a specified number of Pods are created to run to completion.
What is a Job in Kubernetes?¶
A Job in Kubernetes ensures that a specified number of Pods successfully terminate (complete) execution. Unlike a Deployment that runs long-lived applications, Jobs are used for short-lived, one-off tasks such as database migrations, data processing, report generation, or backups.
Jobs are part of the batch/v1 API group.
Usage:
kubectl create job NAME --image=image [--from=cronjob/name] -- [COMMAND] [args...] [options]
m@ibtisam-iq:~$ kubectl create job abc --image nginx -o yaml --dry-run=client
apiVersion: batch/v1
kind: Job
metadata:
creationTimestamp: null
name: abc
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: nginx
name: abc
resources: {}
restartPolicy: Never
status: {}
¶
Usage:
kubectl create job NAME --image=image [--from=cronjob/name] -- [COMMAND] [args...] [options]
m@ibtisam-iq:~$ kubectl create job abc --image nginx -o yaml --dry-run=client
apiVersion: batch/v1
kind: Job
metadata:
creationTimestamp: null
name: abc
spec:
template:
metadata:
creationTimestamp: null
spec:
containers:
- image: nginx
name: abc
resources: {}
restartPolicy: Never
status: {}
Job Spec Overview¶
A basic Job YAML manifest might look like this:
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
completions: 1
parallelism: 1
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Key Fields:¶
completions: The number of times the job needs to complete successfully. Default is 1.parallelism: Maximum number of Pods the job can run in parallel. Controls the concurrency.template: Pod template that describes the work to be done. Each Pod will execute this template.restartPolicy: Must beOnFailureorNeverfor Jobs.Alwaysis not permitted.
How a Job Works¶
When a Job is created:
- The Job controller creates one or more Pods using the specified template.
- These Pods run the task to completion.
- When enough Pods complete successfully (
completions), the Job is marked asComplete. - If Pods fail, they may be retried depending on the backoff policy.
Pod Failure Handling¶
If a Pod fails (exits with non-zero), the Job may retry it depending on:
restartPolicy¶
Never: Pod is not restarted.OnFailure: Pod is restarted by Kubernetes if it fails.
backoffLimit¶
- Limits the number of retries for failed Pods before the entire Job is marked as
Failed.
spec:
backoffLimit: 4 # default is 6
If a Pod fails more than backoffLimit times, the Job is terminated with status Failed.
Parallelism and Completions¶
Parallel Jobs¶
- Run multiple Pods at once.
- Example: transcoding multiple videos simultaneously.
spec:
parallelism: 5
completions: 10
This configuration means 5 Pods can run at the same time until a total of 10 successful completions are achieved.
Single Pod Job¶
spec:
completions: 1
parallelism: 1
Job Termination and Cleanup¶
Manual Deletion¶
When a Job completes, the Pods it created are usually not deleted automatically. You may want to check logs first.
kubectl delete jobs/pi
kubectl delete -f job.yaml
Deleting the Job will cascade delete its Pods.
Automatic Termination Mechanisms¶
.spec.backoffLimit¶
Stops retrying failed Pods after N attempts.
.spec.activeDeadlineSeconds¶
Sets a timeout for the whole Job. All Pods will be terminated when the Job exceeds this time.
spec:
activeDeadlineSeconds: 100
This ensures that a Job does not run forever. Even if Pods keep retrying, the entire Job fails when the deadline is hit.
Precedence:¶
activeDeadlineSeconds > backoffLimit
Once the time limit is reached, Job fails regardless of backoff status.
Example: Job with Timeout¶
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note: The activeDeadlineSeconds should be defined in the Job spec, not just in the Pod template.
Terminal Job Conditions¶
Jobs end in one of two terminal conditions:
- Complete → Job succeeded (condition:
type: Complete) - Failed → Job failed (condition:
type: Failed)
Reasons for Job Failure:¶
- Pod failures exceeded
backoffLimit - Job ran longer than
activeDeadlineSeconds podFailurePolicyrules triggered a job failure- For Indexed Jobs: too many failed indexes
Reasons for Job Success:¶
- Number of Pods that completed =
completions - Success conditions defined in
.spec.successPolicyare met
Version Differences:¶
- v1.30 and earlier: Marks Job Complete/Failed as soon as finalizers are removed.
- v1.31 and later: Waits for all Pods to actually terminate before setting Complete/Failed.
You can customize this with JobManagedBy and JobPodReplacementPolicy feature gates.
Job Pod Termination¶
Once success or failure conditions are met: - Job controller sets FailureTarget or SuccessCriteriaMet. - All Pods are terminated. - Only then is the Job marked Complete or Failed.
Practical Use:¶
- If you want to save compute resources, wait until
Failedbefore spawning a new Job. - If you want fast retry, act on
FailureTargetimmediately (but be careful of resource overlap).
Automatic Job Cleanup¶
Too many completed Jobs can overload the Kubernetes API server.
CronJob-managed Jobs¶
If a CronJob manages your Jobs, it can clean up old Jobs via history limits.
TTL Controller¶
Set .spec.ttlSecondsAfterFinished to enable automatic deletion of Jobs after completion:
spec:
ttlSecondsAfterFinished: 100
After 100 seconds, Job and Pods will be deleted. If set to 0, deletion is immediate. If unset, the Job is not auto-deleted.
Recommendation:¶
Always set this for one-off Jobs to prevent orphaned Pods from consuming cluster resources unnecessarily.
Common Job Patterns¶
1. One Job Per Work Item¶
- Simple but resource-intensive for large workloads.
- Good for independent and isolated tasks.
2. One Job for All Work Items¶
- Lower overhead
- Uses Pod parallelism or work queues
- Better for scale
3. Pod = One Work Item¶
- Each Pod picks one unit of work
- Often easier to modify code this way
4. Pod = Multiple Work Items¶
- Optimized for large batches
- Requires code support to fetch from queue/bucket
5. Collaborative Jobs via Headless Service¶
- For jobs needing Pod-to-Pod communication (e.g., distributed computing)
- Use headless
Serviceto let Pods discover and talk to each other
kind: Service
spec:
clusterIP: None
This lets each Pod in the Job get a stable DNS entry via the Service.