Jobs Handling
Advanced Job Handling: Failures, Policies, and Success Criteria¶
Handling Pod and Container Failures¶
In Kubernetes, containers and Pods can fail for multiple reasons, such as:
- Container exits with a non-zero code (e.g., software bug)
- Container is OOMKilled (e.g., memory overuse)
- Pod is evicted (e.g., node reboot, upgrade, or preemption)
Depending on the .spec.template.spec.restartPolicy, the behavior differs:
OnFailure: Failed container restarts on the same Pod.Never: No restart; Pod is terminated, and Job controller may spawn a new one.
Your application must be designed to handle re-execution, possibly in a different Pod, especially for temporary files, locks, partial outputs, etc.
Backoff Policy¶
To limit retries, configure .spec.backoffLimit (default: 6). Failed Pods are retried with exponential back-off:
10s β 20s β 40s β¦ capped at 6 minutes
A Job fails when either: 1. Number of Pods in Failed phase >= backoffLimit 2. Pod restarts (with OnFailure) >= backoffLimit
π οΈ Debug Tip: Use
restartPolicy: Neverand enable persistent logging for better visibility into Job failures.
Backoff Limit Per Index (Indexed Jobs)¶
Indexed Jobs can track failures per index using:
.spec.backoffLimitPerIndex: 1
.spec.maxFailedIndexes: 5
Example YAML:
apiVersion: batch/v1
kind: Job
metadata:
name: job-backoff-limit-per-index-example
spec:
completions: 10
parallelism: 3
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
template:
spec:
restartPolicy: Never
containers:
- name: example
image: python
command:
- python3
- -c
- |
import os, sys
print("Hello world")
if int(os.environ.get("JOB_COMPLETION_INDEX")) % 2 == 0:
sys.exit(1)
In this setup: - Even indexes fail (0,2,4,6,8) - Max failed indexes is not exceeded - So Job completes with a Failed condition
Result:
status:
completedIndexes: 1,3,5,7,9
failedIndexes: 0,2,4,6,8
succeeded: 5
failed: 10
conditions:
- type: FailureTarget
reason: FailedIndexes
message: Job has failed indexes
- type: Failed
reason: FailedIndexes
message: Job has failed indexes
β Use
.spec.podFailurePolicy.rules[*].action: FailIndexto avoid retries for permanently failing indexes.
Pod Failure Policy¶
You can finely control Pod failure behavior using:
.spec.podFailurePolicy.rules
It allows decisions based on: - Exit codes - Pod conditions (e.g., DisruptionTarget)
Available actions: - FailJob: Immediately fail the Job - Ignore: Donβt count towards backoff limit - Count: Default failure behavior - FailIndex: For indexed jobs only
Example YAML:
apiVersion: batch/v1
kind: Job
metadata:
name: job-pod-failure-policy-example
spec:
completions: 12
parallelism: 3
backoffLimit: 6
template:
spec:
restartPolicy: Never
containers:
- name: main
image: docker.io/library/bash:5
command: ["bash"]
args:
- -c
- echo "Hello world!" && sleep 5 && exit 42
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main
operator: In
values: [42]
- action: Ignore
onPodConditions:
- type: DisruptionTarget
π¨ Only Pods in
Failedphase are evaluated by the failure policy. Terminating pods are not considered failed until they reach a terminal phase.
Success Policy (Indexed Jobs)¶
The .spec.successPolicy defines custom success criteria for Indexed Jobs.
You can specify: - succeededIndexes: Indexes to monitor - succeededCount: How many of those indexes must succeed
Use Cases: - Partial success (e.g., simulations) - Leader-worker models (e.g., MPI, PyTorch)
Example YAML:
apiVersion: batch/v1
kind: Job
metadata:
name: job-success
spec:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: 0,2-3
succeededCount: 1
template:
spec:
restartPolicy: Never
containers:
- name: main
image: busybox
command: ["sh", "-c", "echo done"]
This Job will succeed once one Pod from the set {0, 2, 3} completes successfully.
π§ Indexed Job Feature Requirements: -
completionMode: Indexedis required for backoff per index and success policies. -restartPolicy: Neveris required for most advanced failure handling features.
Recap Table¶
| Feature | Field | Use Case |
|---|---|---|
| Retry limits | backoffLimit | Retry on any pod failure, exponential delay |
| Retry per index | backoffLimitPerIndex, maxFailedIndexes | Indexed jobs - retry control for each index |
| Failure control | podFailurePolicy | Exit code or condition-based failure behavior |
| Partial success | successPolicy | Custom rule-based Job completion |