K8s pod resource limits and requests

How does k8s manage pod resource limits and requests?

1. Cgroup memory resource limits

Some useful links:

https://jvns.ca/blog/2016/12/03/how-much-memory-is-my-process-using-/ talks about the overall memory model.
Another of Julia's post seems to put forth that: in a cgroup, if you run out of memory allocated, you start using swap as long as the other kernel params allow you to and don't immediately get killed by the OOM killer.
The hands on cgroup tutorial here
Kernel documentation

2. Docker resource limits

Let's look at the Container layer to see what limits exist.

From the docker documentation on resource limits, there is a provison of hard limits and soft limits (using --memory-reservation) with some relation to each other.

With some experimentation with pods and examining the underlying containers, some things become clear:

All guaranteed pods go into their own top level cgroup under /kubepods.
All besteffort/burstable go into cgroups under /kubepods/burstable and /kubepods/besteffort respectively.
k8s pod limits directly map to Docker container level hard limits.
At the Docker layer, none of the other soft, swap limits are changed from the default max values.

Essentially, k8s requests are entirely a new concept that are not enforced in the underlying layers. Docker containerization and the cgroups layers are unaware of the request values. No minimum resource reservation is done.

3. Request management

The final piece of the puzzle is provided by the original design document. This is certainly more detailed than the reference on the website.

This link makes it clear that resource requests are fed into the scheduling algorithm for chosing nodes to place pods on. So, requests of pods, total requests on a node etc are all purely bookkeeping. That is, once the sum total of requests of pods on a node goes above a value (say 100% of available resource), k8s can chose to not place Guaranteed pods on the node. The actual limits and resources used are not captured in this calculation.

4. Limit management

In the context of memory (CPU throttling seems to be much simpler of an operation):

If any single pod goes above its hard limit (which may not be specified), it is killed immediately by the cgroup hard limit violation.
Based on the QoS class, k8s messes with the OOM_SCORE_ADJ settings for each cgroup to make the OOM Killer choose pods in the order Best Effort -> Burstable -> Guaranteed in crunch time - when the node is running out of memory.
This ordering is achieved by hueristics that take into account a burstable pod's request to make it more likely that a pod going beyond its request will be popped. See the code here.
The important thing to note is, the actual score is calclulated automatically by the kernel based on the current memory, other factors and finally, the score adjuster above. See man proc.
This setup happens once, and then k8s is hand's off.

Also, it seems that k8s does not allow SWAP to be set, if the above system is to work correctly. This simplifies some things. See issues:

5. Summary

Limit directly translates to cgroup limit (hard). A pod cross this limit, it gets whacked. Cgroup soft limits are not used at all (though Docker does provide knobs for this).
Requests are only used by the k8s scheduler to calculate placement of the pod - a node with available space (purely math, not from a system pov) is chosen. This request value has got nothing to do with the reality in the system.
Occasionally, the scheduler may decide to overcommit, this happens when (Sigma limit > node capacity). This business is not related to (Sigma request).
Based on request and limit values, a QoS class is calculated - Guaranteed, Burstable, Best Effort. See the formula in the docs.
When the system is in trouble, the OOM Killer calculates a OOM Score based on current memory consumed and a adjustment factor (called OOM Adjust score, which is set per process/cgroup) and starts whacking processes in this order.
When pods are placed, the initial cgroup setup also (one-time) sets the "OOM Adjust Score" based on heurestics to encourage the purge in the order of Best Effort -> Burstable -> Guaranteed.