June 5, 2026 · 9 min read · kubernetes
OOMKilled: Why Kubernetes Is Killing Your Pods and How to Stop It
Your pod restarted. The logs are empty. kubectl describe says Reason: OOMKilled and Exit Code: 137. Welcome to one of the most frustrating incidents in Kubernetes — the one where the evidence is destroyed at the moment of the crash.
What OOMKilled means
OOMKilled means the Linux Out-Of-Memory killer terminated your container because it used more memory than it was allowed. Exit code 137 is the giveaway: 128 + 9, where 9 is SIGKILL. The process did not crash on its own — the kernel executed it.
This happens at two levels in Kubernetes:
- Container limit exceeded. The container hit the
resources.limits.memoryyou set. The cgroup enforcing that limit triggers the OOM killer against processes inside the container. - Node memory exhausted. The node itself ran out of memory and the kernel killed the highest-scoring process to survive. This often hits pods that set requests too low and got scheduled onto an already-tight node.
How the Linux OOM killer works
The kernel tracks an oom_score for every process, weighted by how much memory it uses and adjusted by oom_score_adj. When memory pressure hits zero headroom, it kills the highest score to reclaim the most memory fastest. Inside a memory cgroup (which is how Kubernetes limits work), the killer is scoped to that cgroup — so your one greedy container dies without taking the node down with it.
That is by design and it is good behaviour. The problem is purely that it is silent.
Why the pod has no logs
When the OOM killer fires, it sends SIGKILL — not SIGTERM. There is no grace period, no shutdown hook, no final flush. The process is stopped instantly. Anything buffered in memory, including the last few log lines, is gone. That is why OOMKilled pods so often show clean-looking logs that just stop mid-sentence.
This is the single biggest reason OOM incidents take so long to diagnose: the most natural tool — the logs — is the one tool the failure mode erases.
How to find OOMKilled pods
Check the pod's last state:
kubectl describe pod <pod> | grep -iA3 'last state'
You are looking for Reason: OOMKilled. To sweep a namespace for recent OOM events:
kubectl get events --field-selector reason=OOMKilling
And to see the memory limit the pod was actually running under:
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
How to set the right memory limits
The fix is rarely "make the number bigger and hope." Do this instead:
- Measure real usage. Watch the pod under normal and peak load (
kubectl top pod, or your metrics). Note the steady-state and the peak. - Set the request to steady-state. This tells the scheduler how much the pod actually needs, so it lands on a node with room.
- Set the limit to peak plus ~20% headroom. Enough to absorb spikes without letting a leak run unbounded.
Requests vs limits — why both matter
requests is what the scheduler reserves; limits is the hard ceiling the kernel enforces. If you set only a limit, the scheduler may pack the pod onto a crowded node and you get node-level OOM. If you set only a request, a leaking app can consume the whole node. Set both. For memory specifically, many teams set requests == limits to get the "Guaranteed" QoS class, which makes the pod the last thing evicted under node pressure.
Language runtimes and container limits
Two runtimes catch people constantly:
- JVM: older JVMs do not see the cgroup limit and size the heap from the host's total RAM — then get OOMKilled. Use a modern JDK with
-XX:+UseContainerSupport(default on JDK 11+) or set-XX:MaxRAMPercentageso the heap stays inside the container limit. - Node.js: the V8 old-space defaults can exceed a small container. Set
--max-old-space-sizeto roughly 75% of the container limit (e.g.--max-old-space-size=384for a 512Mi limit) so V8 garbage-collects before the cgroup kills it.
How to prevent OOMKilled
- Always set memory requests and limits, informed by real measurement.
- Pin runtime heap sizes (JVM, Node) to the container limit.
- Alert on a sustained downward trend in available memory — a leak is visible for hours before it kills.
- Load-test before launch so you discover the peak in staging, not production.
How Tracegrid handles it
Because OOMKilled erases the logs, Tracegrid does not rely on them. It reads the Kubernetes OOM event and the kernel signal directly, so it detects the kill even when the container left nothing behind. It then tells you which pod died, that the cause was memory, and recommends a new limit based on the usage it observed — turning a silent, log-less mystery into a one-line fix. Tracegrid also watches the slow memory trend that precedes the kill, so often you hear about it before the OOM killer ever runs.
Written by Pradip — founder of Tracegrid, building AI infrastructure intelligence so small teams get senior-SRE answers at 3am.