A few times in my Kubernetes clusters, I’ve encountered situations where some process consumes all the CPU or RAM which starves other services for critical services. For example, in one situation, Longhorn consumed all CPU and RAM and my pi-hole running on the same machine stopped being able to process DNS requests. Other issues have included having to shut down one of my worker nodes and the other nodes not having enough capacity to take on pods and important pods not getting scheduled or even a mistake when I changed the pod selector labels and Kubernetes just spawned thousands of pods.
The graph below shows Disk I/O of a node with excessive disk writes because the OS is swapping RAM out to desk and back.
My home lab servers are now running what I consider to be “business critical” services and I don’t want those to be impacted. Kubernetes has several different knobs we can use to improve this such as leveraging Linux’s cgroups to ensure that specific pods get a certain amount of CPU and RAM. It also supports prioritization, so that certain pods get scheduled and others get evicted if there isn’t enough space.
Or even lately, I’ve been hitting the max pod limit of 110 pods on my single-node cluster. Not everything is important and I want to make sure certain cron jobs always run even if I’m running some low-priority jobs. Turns out it is possible to be running 110 different pods.