← All posts

June 5, 2026 · 9 min read · kubernetes

OOMKilled: Why Kubernetes Is Killing Your Pods and How to Stop It

Your pod restarted. The logs are empty. kubectl describe says Reason: OOMKilled and Exit Code: 137. Welcome to one of the most frustrating incidents in Kubernetes — the one where the evidence is destroyed at the moment of the crash.

What OOMKilled means

OOMKilled means the Linux Out-Of-Memory killer terminated your container because it used more memory than it was allowed. Exit code 137 is the giveaway: 128 + 9, where 9 is SIGKILL. The process did not crash on its own — the kernel executed it.

This happens at two levels in Kubernetes:

  • Container limit exceeded. The container hit the resources.limits.memory you set. The cgroup enforcing that limit triggers the OOM killer against processes inside the container.
  • Node memory exhausted. The node itself ran out of memory and the kernel killed the highest-scoring process to survive. This often hits pods that set requests too low and got scheduled onto an already-tight node.

How the Linux OOM killer works

The kernel tracks an oom_score for every process, weighted by how much memory it uses and adjusted by oom_score_adj. When memory pressure hits zero headroom, it kills the highest score to reclaim the most memory fastest. Inside a memory cgroup (which is how Kubernetes limits work), the killer is scoped to that cgroup — so your one greedy container dies without taking the node down with it.

That is by design and it is good behaviour. The problem is purely that it is silent.

Why the pod has no logs

When the OOM killer fires, it sends SIGKILL — not SIGTERM. There is no grace period, no shutdown hook, no final flush. The process is stopped instantly. Anything buffered in memory, including the last few log lines, is gone. That is why OOMKilled pods so often show clean-looking logs that just stop mid-sentence.

This is the single biggest reason OOM incidents take so long to diagnose: the most natural tool — the logs — is the one tool the failure mode erases.

How to find OOMKilled pods

Check the pod's last state:

kubectl describe pod <pod> | grep -iA3 'last state'

You are looking for Reason: OOMKilled. To sweep a namespace for recent OOM events:

kubectl get events --field-selector reason=OOMKilling

And to see the memory limit the pod was actually running under:

kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'

How to set the right memory limits

The fix is rarely "make the number bigger and hope." Do this instead:

  1. Measure real usage. Watch the pod under normal and peak load (kubectl top pod, or your metrics). Note the steady-state and the peak.
  2. Set the request to steady-state. This tells the scheduler how much the pod actually needs, so it lands on a node with room.
  3. Set the limit to peak plus ~20% headroom. Enough to absorb spikes without letting a leak run unbounded.

Requests vs limits — why both matter

requests is what the scheduler reserves; limits is the hard ceiling the kernel enforces. If you set only a limit, the scheduler may pack the pod onto a crowded node and you get node-level OOM. If you set only a request, a leaking app can consume the whole node. Set both. For memory specifically, many teams set requests == limits to get the "Guaranteed" QoS class, which makes the pod the last thing evicted under node pressure.

Language runtimes and container limits

Two runtimes catch people constantly:

  • JVM: older JVMs do not see the cgroup limit and size the heap from the host's total RAM — then get OOMKilled. Use a modern JDK with -XX:+UseContainerSupport (default on JDK 11+) or set -XX:MaxRAMPercentage so the heap stays inside the container limit.
  • Node.js: the V8 old-space defaults can exceed a small container. Set --max-old-space-size to roughly 75% of the container limit (e.g. --max-old-space-size=384 for a 512Mi limit) so V8 garbage-collects before the cgroup kills it.

How to prevent OOMKilled

  • Always set memory requests and limits, informed by real measurement.
  • Pin runtime heap sizes (JVM, Node) to the container limit.
  • Alert on a sustained downward trend in available memory — a leak is visible for hours before it kills.
  • Load-test before launch so you discover the peak in staging, not production.

How Tracegrid handles it

Because OOMKilled erases the logs, Tracegrid does not rely on them. It reads the Kubernetes OOM event and the kernel signal directly, so it detects the kill even when the container left nothing behind. It then tells you which pod died, that the cause was memory, and recommends a new limit based on the usage it observed — turning a silent, log-less mystery into a one-line fix. Tracegrid also watches the slow memory trend that precedes the kill, so often you hear about it before the OOM killer ever runs.

Written by Pradip — founder of Tracegrid, building AI infrastructure intelligence so small teams get senior-SRE answers at 3am.

Related reading

Stop Googling incidents at 3am

Start free monitoring

Tracegrid explains them for you. 1 host free forever.