--- title: "Control Group version 2 by hand" slug: "cgroup-v2-by-hand" description: null date: 2021-07-17T17:05:00+02:00 type: posts draft: false tags: - cgroup - Linux toc: true --- :source-highlighter: pygments :idprefix: :experimental: true :toc: :toclevels: 2 :url-openrc: https://wiki.gentoo.org/wiki/OpenRC/CGroups :url-kernel-doc: https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html :url-kernel-doc-14: https://www.kernel.org/doc/html/v5.14-rc1/admin-guide/cgroup-v2.html :url-nice: https://manpages.debian.org/buster/coreutils/nice.1.en.html :url-htop: https://htop.dev/ :url-ionice: https://manpages.debian.org/buster/util-linux/ionice.1.en.html We have Control Group v2 since 2016 but I had trouble finding good documentation on how to use it. Most tutorials and blog posts only cover v1 or are specific to systemdfootnote:[If you are looking for OpenRC specific documentation, take a look at the link:{url-openrc}[article in the Gentoo Wiki]]. The link:{url-kernel-doc}[kernel documentation] is a great reference and the basis for this post but not always easy to follow. I will give you a few short examples on how to use it. I will not explain everything, but hopefully enough to get an idea and understand the reference better. Your interface to cgroups is a special file-system. Most distributions have cgroup v1 mounted at `/sys/fs/cgroup` and cgroup v2 at `/sys/fs/cgroup/unified`. Some distributions removed v1 support by default and have v2 mounted at `/sys/fs/cgroup`. You can find out where cgroup v2 is mounted with `mount | grep cgroup2`. If it is not mounted, you can do it yourself with `mount -t cgroup2 none /sys/fs/cgroup/unified`. You can theoretically mount it anywhere you like, but tools expect it in the path mentioned above. Going forward I will assume you are in a terminal in the cgroup v2 directory. Linux distributions should have all cgroup options compiled in. If you built the kernel yourself, or you are missing files in `/sys/fs/cgroup`, you can check with `zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y'` if you are missing anything important. [NOTE] All examples on this page are tested with kernel 5.10. == Enabling controllers There are 8 controllers currentlyfootnote:controllers[link:{url-kernel-doc}#controllers[Kernel documentation on Control Group v2, section “Controllers”]]: cpu, memory, io, pids, cpuset, rdma, hugetlb and perf_event. You can find out which are available with `cat cgroup.controllers`. perf_event is automatically enabled, all others have to be enabled explicitly, with `echo "+cpu +memory" > cgroup.subtree_control` for example. You can disable a controller by using a `-` instead of a `+`. [quote, citetitle = "Kernel documentation on Control Group v2"] ________________________________________________________________________________ Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the parent. This means that all non-root “cgroup.subtree_control” files can only contain controllers which are enabled in the parent’s “cgroup.subtree_control” file. A controller can be enabled only if the parent has the controller enabled and a controller can’t be disabled if one or more children have it enabled. […] Non-root cgroups can distribute domain resources to their children only when they don’t have any processes of their own. In other words, __only domain cgroups which don’t contain any processes can have domain controllers enabled in their “cgroup.subtree_control” files.__footnote:[link:{url-kernel-doc}#top-down-constraint[Kernel documentation on Control Group v2, sections “Top-down Constraint” and “No Internal Process Constraint”]] ________________________________________________________________________________ We will keep it simple by only setting controllers globally in our root cgroup. == Controlling CPU usage This control group will use the cpu controllerfootnote:controllers[]. Every process in this group will be deprioritized, all processes together can only use the power of 2 CPU cores. [source,shell] -------------------------------------------------------------------------------- echo "+cpu" > cgroup.subtree_control mkdir testgroup cd testgroup echo "50" > cpu.weight echo "200000 100000" > cpu.max -------------------------------------------------------------------------------- Try adding your current shell to the group with `echo "$$" > cgroup.procs`. All processes you start from this shell are now in the same cgroup. But what does the example do, exactly? - *cpu.weight* is the relative amount of CPU cycles the cgroup is getting under load. The CPU cycles are distributed by adding up the weights of all _active_ children and giving each the fraction matching the ratio of its weight against the sum.footnote:[link:{url-kernel-doc}#weights[Kernel documentation on Control Group v2, section “Resource Distribution Models” → “Weights”]] It has a range from 1 to 10,000. If one process has a weight of 3,000 and the only other active process has a weight of 7,000, the former will get 30% and the latter 70% of CPU cycles. The default is 100. - *cpu.max* sets the “maximum bandwidth limit”. We told the kernel that the processes should use at most 200,000 µs every 100,000 µs, meaning they can use the power of up to 2 cores. Try running `for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &); done`. You will see that the CPU usage of each process hovers at around 50% instead of 100%. The processes were added to *cgroup.procs*. [TIP] You can add a cgroup column to link:{url-htop}[htop] by pressing kbd:[F2] and then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press kbd:[Enter]. [TIP] *cpu.weight.nice* is an alternate interface to *cpu.weight* that uses the same values used by link:{url-nice}[nice] and has a range from -20 to 19. === Controlling CPU core usage This control group will use the cpuset controller to restrict the processes to the CPU cores 0 and 3. [source,shell] -------------------------------------------------------------------------------- echo "+cpuset" > cgroup.subtree_control mkdir testgroup cd testgroup echo "0,3" > cpuset.cpus echo "$$" > cgroup.procs -------------------------------------------------------------------------------- *cpuset.cpus* takes comma-separated numbers or ranges. For example: “0-4,6,8-10”. [TIP] You can add a CPU column to link:{url-htop}[htop] by pressing kbd:[F2] and then navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press kbd:[Enter]. == Controlling memory usage This control group will use the memory controllerfootnote:controllers[]. All processes together can only use 1 pass:[GiB] of memory at most. [source,shell] -------------------------------------------------------------------------------- echo "+memory" > cgroup.subtree_control mkdir testgroup cd testgroup echo "512M" > memory.high echo "1G" > memory.max echo "$$" > cgroup.procs -------------------------------------------------------------------------------- - If the memory usage of a cgroup goes over *memory.high* it throttles allocations by forcing them into direct reclaim to work off the excess. - *memory.max* is the hard limit. If the cgroup reaches that limit and the memory usage can not be reduced, the pass:[OOM] killer is invoked in the cgroup. == Controlling Input/Output usage This control group will increase the IO priority and limit the write speed to 2 pass:[MiB] a second using the io controllerfootnote:controllers[]. IO limits are set per device. You need to specify the major and minor device numbers of the _device_ (not partition) you want to limit (in my case it is “8:0” for `/dev/sda`). Run `lsblk` or `cat /proc/partitions` to find them out. [source,shell] -------------------------------------------------------------------------------- echo "+io" > cgroup.subtree_control mkdir testgroup cd testgroup echo "default 500" > io.weight echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max echo "$$" > cgroup.procs -------------------------------------------------------------------------------- - *io.weight* specifies the relative amount of IO time the cgroup can use in relation to its siblings and has a range from 1 to 10,000. The priority can be overridden for individual devices with the major:minor syntax, like “8:0 90”. The default is value is 100. - *io.max* limits bytes per second (_rbps/wbps_) and/or IO operations per second (_riops/wiops_). Try running ``dd if=/dev/zero bs=1M count=100 of=test.img oflag=direct``footnote:[`oflag=direct` opens the file with the `O_DIRECT` flag, bypassing caches.]. You will see that the speed is around 2 MiB a second. [TIP] Kernel 5.14 introduced **blkio.prio.class**footnote:[link:{url-kernel-doc-14}#io-priority[Kernel documentation on Control Group v2, section “IO Priority”]] that controls the IO priority. It seems to work like link:{url-ionice}[ionice]. I could not test it yet, since I run kernel 5.10. == Controlling process numbers This control group will limit the amount of processes to 10 using the process number controllerfootnote:controllers[]. [source,shell] -------------------------------------------------------------------------------- echo "+pids" > cgroup.subtree_control mkdir testgroup cd testgroup echo 10 > pids.max echo "$$" > cgroup.procs -------------------------------------------------------------------------------- Try running `for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &); done`. You will get error messages from your shell that it can not fork another process. // LocalWords: cgroups cgroup cpuset