Getting hugepages to work with kubernetes unofficially

While trying to get DPDK enabled Click working with containers, I came across a slight problem. DPDK applications depend on HugePage support to work. This worked easily enough on hosts, but I needed to containerize this to use with k8s eventually.

The first jump, getting it to work with Docker containers was easy enough:

docker run -it --rm -v /mnt/huge:/mnt/huge --privileged ...
# command that uses DPDK with the EAL parameters set to --huge-dir /mnt/huge

In this, it is assumed that you have previously enabled HugePages and mounted the hugetlbfs file system onto the /mnt/huge directory on the host.

I thought that converting the same over to pods would be a 5 minute job, except I was wrong. Just using privileged and mounting the folder in was not enough. The error that I kept facing was:

EAL: Not enough memory available! Requested: 1024MB, available: 0MB
EAL: Cannot init memory

It was not an issue of not having enough huge pages, the container version was running fine directly.

One way to get HugePages working with k8s is to simply use the in-built feature. This works if you have your cluster is recent enough and you can enable the feature. The cluster I was working with was slightly older and this option was ruled out. In any case I was fine with a hacky option for my use case.

What finally clued me to the solution was the top comment of this Stack Overflow question.

What happens is that the first non-leaf cgroup in the kubernetes hugetlb hierarchy, that is the cgroup at the pod level has hugetlb.1GB.limit_in_bytes set to 0. The limits at the container level are set to the defaults and the limits above the pod level are set to non zero as well.

So, one can edit the cgroup limit to get the right value, but the pod's cgroup is only created when the pod is created. The only way to deal with this then, is to use an init container.

...
volumes:
- name: hugetlb-cgfs
  hostPath: 
    path: /sys/fs/cgroup/hugetlb/
...
initContainers:
- name: adjust-cgroup
  image: busybox:latest
  command: ["sh", "-c", "path=$(cat /proc/self/cgroup | grep hugetlb | awk -F: '{print $3}' | rev | cut -d/ -f2- | rev); echo $path; cd /hugetlb-cgfs/$path; pwd; echo 9223372036854771712 | tee hugetlb.1GB.limit_in_bytes; cat hugetlb.1GB.limit_in_bytes"]   
  volumeMounts:
  - mountPath: /hugetlb-cgfs
    name: hugetlb-cgfs

Basically, we start by mounting the host cgroup hugetlb hierarchy into the container (obviously unsafe, proceed with caution, only use with trusted sources warning).

The /proc/self/cgroup file gives you a listing of the various cgroups that the initContainer belongs to. The rev and cut after that is to take out the innermost cgroup (the one pertaining to the initContainer) and instead return the cgroup of the pod. Then, simply go into the right place in the hierarchy and overwrite the limit.