Solving challenges caused by Out Of Memory (OOM) Killer in Linux

Blog

Engineering

Learn how out of memory events created challenges for our team, and how we solved them.

ByRafał KoreptaonJuly 7, 2022

Solving challenges caused by Out Of Memory (OOM) Killer in Linux

Introduction

Out of memory (OOM) events are common in the Linux environment when there are programs that allocate a lot of memory. Redpanda is one such program, as it uses the Seastar library, which tries to utilize whole hardware to its limits.

There is special kernel functionality, called Out Of Memory Killer (OOM Killer), that helps keep Linux machines operational by killing the biggest process with the least priority. OOM Killer can recognize and respect processes that have constraints in Linux cgroups.

Unless you specify input parameters, Redpanda reads hardware-available memory and sets aside at least 1.5 GiB for the operating system (OS) and divides the rest equally for each machine core in order to maximize efficiency of the Seastar memory allocator. If Redpanda is running alongside other programs, the Linux OS might run out of memory.

If you’ve also experienced problems with OOM Killer, keep reading to learn how we resolved our issues with it so you can do the same.

How OOM Killer began interrupting our sidecar

When we began experiencing problems with OOM Killer, Redpanda Cloud used (and still does) Kubernetes (K8s), and relied on cgroups and Linux namespaces to constrain the workloads. If Redpanda wasn’t told what memory parameters it should pick, then the underlying Seastar library would allocate 1.5 GiB for OS, and the rest from the cgroup would be divided among the number of available CPU cores.

Such a setup didn’t make sense for a containerized environment where Redpanda was isolated from any other process. Hypothetical users of the Redpanda operator shouldn’t have to worry about how to set up the Redpanda advanced memory parameters but, depending on your desired capacity, adequate K8s nodes must be available for Redpanda, and correct limits and requests need to be set. The first sizing for the Redpanda pod in K8s reserved 0.5 GiB of the memory to the other pods running in a dedicated Redpanda node.

To automate and ease the K8s deployment of Redpanda, we created an operator. In order to constrain Redpanda and leverage cgroup capability, we provided a resource configuration option in the cluster custom resource. This configuration was mapped directly to the Redpanda configuration so that Redpanda could use all memory available to the container.

In our first Redpanda operator implementation, the K8s deployment resource was configured to not overwrite the container entry point. The default entry point leveraged supervisord to schedule Redpanda processes, telemetry reporting, and WebAssembly (Wasm) coprocessors. That simplification played a role in local environment deployments (e.g. docker run).

When Redpanda warmed up its cache, OOM Killer saw that memory inside the Redpanda cgroup was exhausted, and it killed the biggest Redpanda process. Users would see that the broker was unavailable until container runtime restarted the Redpanda process. The Redpanda operator could automate the same function as supervisord by scheduling only one process inside one container, and the container runtime would do the heavy lifting and isolate each process. Debugging further problems was made easier by the fact that OOM Killer recognized individual processes and only those were affected.

The first solution we tried to resolve the OOM Killer events involved the K8s deployment, where every process was running in its own dedicated container. By investigating this potential solution, we saw that rpk debug info, which sends telemetry data, was executed every 10 minutes. The problem was that Redpanda had a higher-than-usual load, and our sidecar used more memory than was set in cgroup. Then the OOM Killer started to kill this sidecar container.

Next, the Cloud team optimized the managed solution, so we eliminated all sidecars from the deployment. The telemetry was moved outside the Redpanda pod and Wasm coprocessors were disabled until GA. With only one Redpanda process running in the pod, memory cgroup constraints were mapped to Redpanda memory. In long-running clusters, memory allocation grew to the point where, from the OS perspective, all available memory was consumed by Redpanda. The processes were again killed by OOM Killer. At this point, we were looking for a bug in Redpanda, but it turns out that K8s pod implementation is backed by a pause container.

Solving the OOM Killer challenge

To create a container sandbox and be able to restart individual containers in a multi-container pod setup, pause processes play a crucial role to orchestrate other processes. Looking at the source code, this process might seem to not be that big in terms of memory, but it needs one page from the operating system just to work. This one page plays a key role when OOM Killer scans all cgroups, and finds that the Redpanda container overflows its memory usage.

Once the OOM Killer report proved that the pause container was listed along Redpanda process, we implemented memory reservation to solve this issue. With a single container, we couldn’t allocate whole memory to the Redpanda process. The Redpanda operator extends cluster custom resource definition to include Redpanda resource configuration. Now, cgroup memory is not tight with Redpanda memory maximum allocation. Depending on the K8s worker node size and the traffic in particular, node clients can assign less memory to Redpanda in comparison to the container.

The next improvement we made to resolve our issues with OOM Killer was to add 10% default memory reservation to the OS. This was done in order to prevent memory pressure in overprovisioned K8s worker nodes. If Redpanda operator users would not set Redpanda memory, then — in big enough clusters where all memory limit was distributed among all pods — clients could observe memory pressure events. With spikes in traffic and Kafka clients' usage, the SRE team might observe that default kubelet memory host reservation is not enough for the operating system. This 10% memory reservation mitigation was implemented to help clients that were using the Redpanda operator already. An operator upgrade would recalculate necessary memory reservation. This solution, instead, gives room for a pause container and other kernel data structures that are necessary for the K8S node to work correctly.

Optimizing resource consumption in bigger machines

In the bigger clusters (e.g. 16 cores and 64 GiB), Redpanda needs to give more room to the auxiliary services. Each core will be occupied by the Redpanda shard. That shard doesn’t overload the metrics system or logging aggregator but, when it’s multiplied by the number of cores, it can significantly change the resource requirements (for example, Prometheus for metrics or FluentBit for logging). While OOM Killer was looking at the biggest processes with the lowest priority inside each cgroup, Redpanda was picked to be terminated. K8s node-exporter started to report node memory pressure events. For our biggest deployments we adjusted memory to leave more room for
logging collector, kubelet, node-exporter, and kube-proxy.

Ironically, what's interesting is that, to prevent OOM kills of Redpanda, we actually reduced the amount of memory Redpanda used. Firstly by reducing the amount of memory allocated to the cgroup, and then by reducing the amount of memory Seastar can use within that cgroup.

How to adjust default memory allocation

Despite encountering these challenges with the OOM Killer, we were able to effectively troubleshoot these memory usage issues. We are now more mindful about resource constraints in a containerized environment. All improvements were done to our observability stack and Redpanda operator to ease the debugging experience of losing Redpanda nodes.

For any user of the Redpanda operator, the most important thing is to understand that, by default, the operator will assign 10% of the provided K8s resource requests.

If users want to change the 10% threshold in the cluster custom resource section, they must calculate requests, limits, and Redpanda options to match the desired configuration:

yaml
kind: Cluster
spec:
  resources:
    requests:
      cpu: 2
      memory: 2.23Gi
    redpanda:
      cpu: 2
      memory: 2Gi
    limits:
      cpu: 2
      memory: 2.23Gi

If they do not need to change the 10% memory cushion, the Redpanda section can be omitted.

Not only should the cgroup be taken into account, but so should the overall memory resource exhaustion on the K8s node.

Conclusion

By optimizing the overhead of the containerized environment, we’re able to provide a better-managed cloud experience and meet our users wherever they are in their streaming applications journey.

For more information about using Redpanda on Linux, view our documentation. Learn more about Redpanda in our GitHub repo, or join our Community Slack to interact directly with our engineers and team.