July 5, 2026 · 9 min read · kubernetes
Liveness and Readiness Probe Failures: Why Kubernetes Keeps Restarting a Healthy Pod
Your pod is restarting every couple of minutes. kubectl get pods shows a climbing RESTARTS count, and your first instinct is the usual one: the app is crashing. So you read the logs, and the logs look fine. The app starts, serves a request or two, then dies, with no panic, no stack trace, no error. That's the tell. A pod that restarts cleanly on a schedule isn't crashing. Kubernetes is killing it, and a liveness probe is almost always holding the knife.
Three probes, three jobs
Kubernetes has three health checks, and confusing them is the root of most probe incidents:
- Liveness probe, "is this container still alive?" If it fails, the kubelet restarts the container. This is the one that creates restart loops.
- Readiness probe, "is this container ready for traffic?" If it fails, the pod is marked
0/1Ready and pulled out of the Service endpoints. No restart, it just stops receiving traffic, silently. - Startup probe, "has this slow app finished booting yet?" While it runs, the liveness and readiness probes are held off. This is the one nobody configures, and it's the fix for half of all probe restart loops.
The two failure modes look nothing alike from the outside. Liveness failures restart your pod and look exactly like CrashLoopBackOff. Readiness failures leave the pod Running but 0/1, serving nothing, while your ingress returns 503s and the pod logs stay completely clean. Knowing which one you're looking at is half the fight.
Always start with describe
kubectl describe pod <pod-name>
Scroll to Events. A probe failure names itself:
Liveness probe failed: HTTP probe failed with statuscode: 500
Liveness probe failed: Get "http://10.1.2.3:8080/healthz": dial tcp 10.1.2.3:8080: connect: connection refused
Readiness probe failed: Get "http://10.1.2.3:8080/ready": context deadline exceeded (Client.Timeout exceeded)
The verb (Liveness vs Readiness) tells you the failure mode. The rest of the line tells you the cause. Here are the five you'll actually hit.
The five real causes
1. The probe points at the wrong port or path
connection refused or statuscode: 404. The probe is checking /healthz on 8080, but the app serves health on /health, or listens on 8000. The container is perfectly healthy; the probe is knocking on the wrong door, and the kubelet restarts a working app on a timer.
Fix: confirm what the app actually serves, then match the probe to it.
kubectl exec <pod> -- wget -qO- localhost:8080/healthz
If that succeeds and the probe still fails, your httpGet.port or httpGet.path in the manifest is wrong. Check it:
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].livenessProbe}'
2. initialDelaySeconds is too short, the CrashLoopBackOff impostor
This is the most misdiagnosed probe failure there is. Your app takes 40 seconds to boot, JVM warmup, migrations, loading a model into memory. Your liveness probe has initialDelaySeconds: 10. So ten seconds in, the probe fires against an app that isn't listening yet, fails, and the kubelet restarts the container. The restart starts the 40-second boot over. The probe fires at 10 seconds again. Forever.
You will see RESTARTS climbing and conclude the app is crash-looping. It isn't. It never finished starting.
Fix: stop using initialDelaySeconds for slow boots and use a startup probe instead, it gives the app a generous window to come up, and only after it passes do the liveness and readiness probes begin:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10 # allows up to 5 minutes to boot
3. timeoutSeconds is too aggressive, failure under load
Default timeoutSeconds is 1. Under a traffic spike, an app that normally answers /healthz in 50ms might take 1.5 seconds because the event loop is busy. The probe times out, the kubelet restarts the container, and it does this to your busiest pods first, at the exact moment you need them. A latency blip becomes a self-inflicted outage as Kubernetes restarts healthy pods one after another.
The Event reads:
Liveness probe failed: context deadline exceeded
Fix: raise timeoutSeconds (3 to 5 is sane) and failureThreshold so one slow response doesn't equal a kill. A liveness probe should only fire on a genuinely wedged process, not a busy one.
4. Readiness failing, the silent 503 machine
No restarts. kubectl get pods shows 1/1 Running... except READY is 0/1. The pod is alive but pulled from the Service, so traffic routed to it gets nothing, and your ingress serves 503s with zero pods logging an error. This is the quiet one, nothing crashes, nothing alerts, and you only notice because users are seeing errors.
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-7d9f-abc 0/1 Running 0 4m
STATUS: Running with READY: 0/1 and RESTARTS: 0 is the signature. Read the readiness Event to see why it never went ready, usually the same wrong-port/wrong-path problem as cause #1, but on the readiness endpoint.
5. The probe checks a dependency it shouldn't
Someone wired the liveness probe to a /health endpoint that pings the database. The database has a two-second blip. Every liveness probe across every pod fails at once, and Kubernetes restarts your entire deployment simultaneously, turning a momentary DB hiccup into a full cold-start outage with an empty connection pool.
Fix: liveness should check only whether this process is wedged, never a downstream dependency. Dependency health belongs in the readiness probe (so a pod stops taking traffic) or in application retry logic, not in the check that triggers a restart.
A fast triage order
kubectl describe pod <pod>→ read the Event.Liveness= restarts;Readiness= removed from traffic.- Restart loop but clean logs? It's a probe, not a crash. Suspect cause #2 (too-short delay) or #3 (too-tight timeout).
connection refused/404? Wrong port or path (#1). Verify withkubectl exec ... wget localhost:<port><path>.1/1 RunningbutREADY 0/1, no restarts? Readiness failure (#4), you're 503-ing silently.- Whole deployment restarted at once? A probe is checking a shared dependency (#5).
The fastest way to tell a probe failure from a real crash: kubectl logs <pod> --previous. If the previous container exited cleanly with no error, the app didn't crash, Kubernetes killed it, and the Event names the probe that did.
Why this still eats an hour
A probe restart loop is the single most convincing CrashLoopBackOff impersonator in Kubernetes. The pod is restarting, the status says CrashLoopBackOff, and every instinct says "debug the crash." So you read logs that show a healthy startup, attach a debugger to an app that has nothing wrong with it, and burn an hour before someone thinks to check whether the liveness probe is killing a process that simply hadn't finished booting. The answer was in the describe Event the whole time, one line, Liveness probe failed, that nobody read because the word "crash" sent them the wrong way.
This is exactly the gap Tracegrid closes. When a pod starts restarting, Tracegrid reads the Event and tells you whether it's a genuine crash or a probe killing a healthy container, and if it's the probe, which one, and whether you need a startup probe, a longer timeout, or a corrected port. It posts that to Slack with the fix, so you never lose an hour debugging a crash that isn't one.
It installs in 60 seconds, watches Kubernetes, Linux, Docker, ECS, and Azure, and there's a free tier with AI explanations included. If you've ever attached a debugger to a perfectly healthy app at 3am, that's the hour it's built to give back.
curl -sSL https://tracegrid.app/install.sh | bash, or tracegrid.app.
Written by Pradip, founder of Tracegrid, building AI infrastructure intelligence so small teams get senior-SRE answers at 3am.