Kubernetes Probes: Comprehensive Guide¶
This document consolidates and organizes the essential concepts, configurations, and best practices for Kubernetes Probes (liveness, readiness, and startup). It provides a clear understanding of why probes are critical, how they function within the Pod lifecycle, and how to configure them effectively for resilient workloads.
๐ง Introduction to Pods and Probes¶
Why Probes Matter¶
Pods in Kubernetes are ephemeral, meaning they are temporary entities that can crash, complete, or be deleted. Kubernetes does not inherently know if an application inside a container is healthy just because the container is running. Probes address this by enabling Kubernetes to:
- Monitor container health (
livenessProbe). - Determine if a container is ready to serve traffic (
readinessProbe). - Allow extra time for slow-starting applications (
startupProbe).
Without probes, a container could be in a Running state but fail to serve requests (e.g., due to deadlocks or memory leaks), leading to service disruptions.
Pod Lifecycle and Phases¶
Understanding the Pod lifecycle is critical for configuring probes effectively. The .status.phase field indicates a Podโs high-level state:
| Phase | Description |
|---|---|
Pending | Pod accepted but containers not yet started (e.g., pulling images). |
Running | Pod assigned to a node; at least one container is active. |
Succeeded | All containers completed successfully and wonโt restart. |
Failed | All containers exited with failure. |
Unknown | Pod state couldnโt be retrieved (e.g., node communication error). |
Note:
CrashLoopBackOffis not a phase but an event indicating repeated container crashes with exponential backoff.
Container States¶
Each container within a Pod has a granular lifecycle state, visible via kubectl describe pod <pod-name>:
| State | Description |
|---|---|
Waiting | Container not yet running (e.g., pulling images or applying secrets). |
Running | Container is active; any postStart hook has completed. |
Terminated | Container has exited (success or failure), with details like exit code and reason. |
Pod Conditions¶
Pod conditions are boolean checkpoints used by the Kubelet to evaluate Pod health:
| Condition | Description |
|---|---|
PodScheduled | Pod assigned to a node. |
Initialized | All init containers completed successfully. |
ContainersReady | All main containers are healthy and ready. |
Ready | Pod is fully ready to serve traffic. |
PodReadyToStartContainers | Networking and sandbox setup complete (beta feature). |
Container Restart Policies¶
The restartPolicy in a Pod spec dictates container restart behavior:
| Policy | Behavior |
|---|---|
Always | Always restart (default). |
OnFailure | Restart only if exit code is non-zero. |
Never | Never restart. |
Note: Applies to init and app containers, not sidecars in
initContainers.
CrashLoopBackOff¶
CrashLoopBackOff occurs when a container repeatedly crashes, triggering exponential backoff restarts. Common causes include:
- Application bugs or misconfigurations.
- Insufficient CPU/memory.
- Failing probes.
- Missing secrets or configs.
๐ฆ Understanding Probes¶
What is a Probe?¶
A probe is a periodic diagnostic performed by the Kubelet on a container. Probes allow Kubernetes to:
- Restart unhealthy containers (
livenessProbe). - Prevent traffic to unready containers (
readinessProbe). - Delay other probes for slow-starting apps (
startupProbe).
Probes make containers observable, enabling proactive management before users notice issues.
Probe Check Mechanisms¶
Probes use one of four mechanisms to check container health:
- exec: Runs a command inside the container. Success if exit code is
0.exec: command: - cat - /tmp/healthy - Use Case: Check file existence or process-specific health.
-
Warning: Spawns a process each time, avoid in high-density clusters.
-
httpGet: Sends an HTTP GET request. Success if status code is
200-399.httpGet: path: /healthz port: 8080 -
Use Case: Web services with
/healthor/pingendpoints. -
tcpSocket: Checks if a TCP port is open. Success if connection is established.
tcpSocket: port: 3306 -
Use Case: Databases or services without HTTP (e.g., MySQL, Redis).
-
grpc: Calls the gRPC
Checkmethod. Success if response isOK.grpc: port: 50051 - Use Case: gRPC-based microservices with health servers.
Probe Outcomes¶
Each probe results in one of three outcomes:
- Success: Container passed the check.
- Failure: Container failed; action depends on probe type.
- Unknown: Check couldnโt complete (e.g., timeout); Kubelet retries.
Types of Probes¶
Kubernetes supports three probe types, each with a distinct role:
- Liveness Probe
- Purpose: Detects if a container is alive or stuck (e.g., deadlocks).
- Action: If it fails, the container is killed and restarted per
restartPolicy. - Use Case: Restart broken apps or resolve deadlocks.
-
Default: Assumes success if not defined.
-
Readiness Probe
- Purpose: Determines if a container is ready to serve traffic.
- Action: If it fails, the container is removed from Service load balancer endpoints.
- Use Case: Wait for database connections or during maintenance.
-
Default: Assumes success after initial delay.
-
Startup Probe
- Purpose: Ensures an application has started before enabling liveness/readiness probes.
- Action: Delays other probes until it succeeds, preventing premature restarts.
- Use Case: Apps with long startup times (e.g., migrations, warmups).
- Default: Assumes success if not defined.
๐ Configuring Probes¶
Letโs dive into the key configuration fields that fine-tune how probes behave:
initialDelaySeconds¶
โฑ๏ธ Time (in seconds) to wait after the container starts before running the probe.
๐ง Default: 0
๐ Use Case: - Your app takes 10s to boot up? Set this to 10.
periodSeconds¶
๐ How often (in seconds) to run the probe.
๐ง Default: 10
๐ Use Case: - Lower for rapid detection (5s), higher to reduce CPU/network traffic.
timeoutSeconds¶
๐ If a probe takes more than this time (in seconds), itโs considered a failure.
๐ง Default: 1
๐ Use Case: - Slow network or backend? Consider bumping this to 3-5.
failureThreshold¶
๐จ Number of consecutive failures before the probe is considered failed.
๐ง Default: 3
๐ Use Case: - Avoid false alarms due to temporary blips.
successThreshold¶
โ Number of consecutive successes required to mark a previously failed probe as passed.
๐ง Default: 1
๐ Use Case: - Ensure your service stabilizes before re-adding to the load balancer.
Key Configuration Fields¶
Fine-tune probe behavior using these fields:
| Field | Description | Default |
|---|---|---|
initialDelaySeconds | Delay before probe starts (seconds). | 0 |
periodSeconds | Frequency of probe execution (seconds). | 10 |
timeoutSeconds | Time before probe is considered failed (seconds). | 1 |
failureThreshold | Consecutive failures before action is taken. | 3 |
successThreshold | Consecutive successes to mark probe as passed. | 1 |
Example Configuration¶
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 12
12 ร 5) for the app to start responding to /health. How Fields Work Together¶
initialDelaySecondsprevents premature probe failures during startup.periodSecondsbalances detection speed and resource usage.timeoutSecondsaccommodates slow networks or backends.failureThresholdavoids false positives from temporary issues.successThresholdensures stability before re-adding to load balancers.
๐ฆ When to Use Which Probe?¶
| Probe Type | Purpose | Kubernetes Action |
|---|---|---|
livenessProbe | Detect if container is dead or stuck | Kill and restart container |
readinessProbe | Control traffic during boot/maintenance | Remove from Service endpoints |
startupProbe | Delay other probes until app starts | Prevent premature restarts |
๐งช Probe Check Mechanisms: When to Use What¶
| Type | Use Case | Example |
|---|---|---|
| HTTP GET | App exposes /health, /ready, etc. | Web servers, APIs |
| TCP Socket | Port-based readiness (e.g., DB, services with no HTTP) | Redis, PostgreSQL |
| Exec | Fine-grained in-container check using shell commands | Check file existence |
โ Best Practices for Probes¶
General Guidelines¶
- Always Use Readiness Probes for Delayed-Start Apps
-
Prevent traffic to apps still initializing (e.g., awaiting database connections or cache warm-up).
-
Use Liveness Probes for Long-Running Containers
-
Automatically restart containers in deadlock or hung states, but only if the app might not crash naturally.
-
Separate Liveness and Readiness Probes
- Use distinct endpoints (e.g.,
/livevs./ready) to avoid confusing signals. -
Same probe for both may lead to unnecessary restarts.
-
Design Lightweight Probes
- Use simple endpoints like
/healthzor/readythat return200 OK. - Avoid external dependencies (e.g., database queries) to minimize latency and failure points.
Configuration Best Practices¶
- Set
initialDelaySecondsfor Slow-Starting Apps -
Match the delay to the appโs startup time (e.g.,
80sfor a 70s boot). -
Tune
periodSeconds,timeoutSeconds, andfailureThreshold - Use
timeoutSeconds: 2for HTTP probes to handle network variability. - Lower
periodSeconds(e.g.,5s) for faster detection, or increase (e.g.,15s) to reduce resource usage. -
Set
failureThresholdto avoid false positives (e.g.,5for flaky networks). -
Prefer
httpGetortcpSocketOverexec execspawns processes, which can strain high-density clusters.
Probe Hygiene Checklist¶
- โ Clearly separate readiness and liveness roles.
- โ Match delays to application startup times.
- โ
Define
/liveand/readyendpoints in the application. - โ Avoid slow or expensive probe logic.
- โ
Test failure scenarios (e.g.,
/crashor/freeze) in staging.
Pro Tips¶
- Readiness failures only remove the pod from traffic, not restart it.
- Liveness failures trigger restarts, so use cautiously to avoid unnecessary cycling.
- Test probe configurations in staging to simulate failures and ensure correct behavior.
๐งช Example: Complete Probe Configuration¶
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
containers:
- name: app-container
image: my-app:latest
livenessProbe:
httpGet:
path: /live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 12
- Startup Probe: Allows 60s for the app to start.
- Liveness Probe: Checks
/liveevery 10s, restarting after 3 failures. - Readiness Probe: Checks
/readyevery 5s, removing from Service if it fails.
๐ง Summary¶
Probes are Kubernetesโ mechanism for ensuring application health and availability. By understanding the Pod lifecycle, container states, and probe configurations, you can design resilient workloads that:
- Automatically recover from failures (
livenessProbe). - Only receive traffic when ready (
readinessProbe). - Avoid premature restarts during startup (
startupProbe).
Following best practices, such as using lightweight probes, tuning thresholds, and testing failure scenarios, ensures your applications remain stable and performant in production.