Hub receives thousands of /hub/error/503 messages when jupyter pod is OOMKilled

We’ve recently updated our host node AMIs to Amazon Linux 2023, and our EKS to 1.32, so it looks like cgroupv2 is in full effect.

Basically we’re having an issue where a single user will log anywhere from about 2k~10k logs when their pod experiences an OOMKilled event. All of these /hub/error/503?url= logs are various api calls (/api/kernels, /api/terminals, /ai/chat, etc), effectively any polling the user had happening. From what I can tell that pretty much means it’s an ECONNRESET or ECONNREFUSED error, which makes sense since the pod is being forcibly restarted. These events overwhelm the hub, understandably, sometimes impacting service for other users (since the hub will also be forced to restart, if the liveness probes fail as a result of these events).

We had never observed user pods hitting OOM before - usually the kernel would die and save the pod from having to do that. Now with cgroupv2, that doesn’t seem possible anymore. Really just wondering if anyone else has experienced this, or if there’s any advice as to how we should start picking this apart. Thanks!

Hub: 3.1.2
Lab: 4.1.8

Do you have any monitoring (e.g. Prometheus, or AWS CloudWatch)? If so try comparing the pod and node memory usage for the old and new clusters, this will hopefully tell you whether memory usage of the user pods increased, or whether it’s the same but pods were incorrectly being allowed to use more memory, or if something else is using more memory.

Your could also try running up a standalone jupyterlab pod on EKS, and see if there’s a particular configuration that causes the OOM.

Thanks manics!

So to be clear, I think the circumstances in which a pod is OOM is normal - typically it’s some data a user is trying to unpack or use and they just over use resources. The major difference is whether the pod or kernel dies however - in the past, if a user overused resources, the kernel would be able to die so that the pod stayed alive.

The circumstances I’m trying to figure out is if there’s a way in cgroupv2 we can have the kernel die first again, as well as how to reduce the sheer volume of hub errors when a pod is OOMKilled.

We do have Cloudwatch and NewRelic monitoring, there doesn’t seem to be a trend of increased memory usage compared to before, just the way in which it seems to be handled. I believe in cgroupv2 both jupyter and ipykernel processes are in the same group? So ipykernel can no longer get the individual SIGKILL? That’s my understanding anyway.

The thousands of hits for an OOM event is also something that’s bothering me though, in our test environments (same setup, just lower usage), we have only like, 20~30 hits for an OOM event? But in our prod environment it’s thousands of hits for a single OOM event, all from the single user that ran OOM.

I believe we’ve found the root cause of the flood of messages - it seems that the initiator of these requests is the browser, specifically Chrome. In Firefox the websocket requests are made and fail, where as in Chrome the websocket requests stay as pending and are relentless.

1 Like

From this article - Solving Out-Of-Memory Issues in Kubernetes with cgroup v2 - Preferred Networks Research & Development

This is the evidence that kubernetes will actually kill pods in an OOM event of any process as of 1.28 (cgroupv2): use the cgroup aware OOM killer if available by tzneal · Pull Request #117793 · kubernetes/kubernetes · GitHub

The jupyter user pods being terminated seems to be unexpected behavior of both the CHP and the hub