Add “Control Group version 2 by hand”.

2021-07-17 16:41:58 +02:00 · 2021-07-17 16:41:58 +02:00 · 07260be96c
parent acdfb79b91
commit 07260be96c
1 changed files with 214 additions and 0 deletions
--- a/content/posts/cgroup
+++ b/content/posts/cgroup
@ -0,0 +1,214 @@
+---
+title: "Control Group version 2 by hand"
+slug: "cgroup-v2-by-hand"
+description: null
+date: 2021-07-17T17:05:00+02:00
+type: posts
+draft: false
+tags:
+- cgroup
+- Linux
+toc: true
+---
+
+:source-highlighter: pygments
+:idprefix:
+:experimental: true
+:toc:
+:toclevels: 2
+
+:url-openrc: https://wiki.gentoo.org/wiki/OpenRC/CGroups
+:url-kernel-doc: https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html
+:url-kernel-doc-14: https://www.kernel.org/doc/html/v5.14-rc1/admin-guide/cgroup-v2.html
+:url-nice: https://manpages.debian.org/buster/coreutils/nice.1.en.html
+:url-htop: https://htop.dev/
+:url-ionice: https://manpages.debian.org/buster/util-linux/ionice.1.en.html
+
+We have Control Group v2 since 2016 but I had trouble finding good documentation
+on how to use it. Most tutorials and blog posts only cover v1 or are specific to
+systemdfootnote:[If you are looking for OpenRC specific documentation, take a
+look at the link:{url-openrc}[article in the Gentoo Wiki]]. The
+link:{url-kernel-doc}[kernel documentation] is a great reference but not always
+easy to follow. I will give you a few short examples on how to use it. I will
+not explain everything, but hopefully enough to get an idea and understand the
+reference better.
+
+Your interface to cgroups is a special file-system. Most distributions have
+cgroup v1 mounted at `/sys/fs/cgroup` and cgroup v2 at
+`/sys/fs/cgroup/unified`. Some distributions removed v1 support by default and
+have v2 mounted at `/sys/fs/cgroup`. You can find out where cgroup v2 is mounted
+with `mount | grep cgroup2`. If it is not mounted, you can do it yourself with
+`mount -t cgroup2 none /sys/fs/cgroup/unified`. You can theoretically mount it
+anywhere you like, but tools expect it in the path mentioned above. Going
+forward I will assume you are in a terminal in the cgroup v2 directory.
+
+Linux distributions should have all cgroup options compiled in. If you built the
+kernel yourself, or you are missing files in `/sys/fs/cgroup`, you can check
+with `zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y'` if you are missing
+anything important.
+
+[NOTE]
+All examples on this page are tested with kernel 5.10.
+
+== Enabling controllers
+
+There are 8 controllers
+currentlyfootnote:controllers[link:{url-kernel-doc}#controllers[Kernel
+documentation on Control Group v2, section “Controllers”]]: cpu, memory, io,
+pids, cpuset, rdma, hugetlb and perf_event. You can find out which are
+available with `cat cgroup.controllers`. perf_event is automatically enabled,
+all others have to be enabled explicitly, with `echo "+cpu +memory" >
+cgroup.subtree_control` for example. You can disable a controller by using a `-`
+instead of a `+`.
+
+[quote, citetitle = "Kernel documentation on Control Group v2"]
+________________________________________________________________________________
+Resources are distributed top-down and a cgroup can further distribute a
+resource only if the resource has been distributed to it from the parent. This
+means that all non-root “cgroup.subtree_control” files can only contain
+controllers which are enabled in the parent’s “cgroup.subtree_control” file. A
+controller can be enabled only if the parent has the controller enabled and a
+controller can’t be disabled if one or more children have it enabled. […]
+Non-root cgroups can distribute domain resources to their children only when
+they don’t have any processes of their own. In other words, __only domain
+cgroups which don’t contain any processes can have domain controllers enabled in
+their “cgroup.subtree_control”
+files.__footnote:[link:{url-kernel-doc}#top-down-constraint[Kernel documentation
+on Control Group v2, sections “Top-down Constraint” and “No Internal Process
+Constraint”]]
+________________________________________________________________________________
+
+We will keep it simple by only setting controllers globally in our root cgroup.
+
+== Controlling CPU usage
+
+This control group will use the cpu controllerfootnote:controllers[]. Every
+process in this group will be deprioritized, all processes together can only use
+the power of 2 CPU cores.
+
+[source,shell]
+--------------------------------------------------------------------------------
+echo "+cpu" > cgroup.subtree_control
+mkdir testgroup
+cd testgroup
+echo "10" > cpu.weight.nice
+echo "200000 100000" > cpu.max
+--------------------------------------------------------------------------------
+
+Try adding your current shell to the group with `echo "$$" > cgroup.procs`. All
+processes you start from this shell are now in the same cgroup. But what does
+the example do, exactly?
+
+- *cpu.weight.nice* works like the link:{url-nice}[nice] command and has a range
+  from -20 to 19. It is an alternate interface to *cpu.weight* which has a range
+  from 1 to 10,000.
+- *cpu.max* sets the “maximum bandwidth limit”. We told the kernel that the
+  processes should use at most 200,000 µs every 100,000 µs, meaning they can use
+  the power of up to 2 cores.
+
+Try running `for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &);
+done`.  You will see that the CPU usage of each process hovers at around 50%
+instead of 100%. The processes were added to *cgroup.procs*.
+
+[TIP]
+You can add a cgroup column to link:{url-htop}[htop] by pressing kbd:[F2] and
+then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press
+kbd:[Enter].
+
+=== Controlling CPU core usage
+
+This control group will use the cpuset controller to restrict the processes to
+the CPU cores 0 and 3.
+
+[source,shell]
+--------------------------------------------------------------------------------
+echo "+cpuset" > cgroup.subtree_control
+mkdir testgroup
+cd testgroup
+echo "0,3" > cpuset.cpus
+echo "$$" > cgroup.procs
+--------------------------------------------------------------------------------
+
+*cpuset.cpus* takes comma-separated numbers or ranges. For example:
+“0-4,6,8-10”.
+
+[TIP]
+You can add a CPU column to link:{url-htop}[htop] by pressing kbd:[F2] and then
+navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press
+kbd:[Enter].
+
+== Controlling memory usage
+
+This control group will use the memory controllerfootnote:controllers[]. All
+processes together can only use 1
+pass:[<abbr title="Gibibyte, 1024 Mibibyte">GiB</abbr>] of memory at most.
+
+[source,shell]
+--------------------------------------------------------------------------------
+echo "+memory" > cgroup.subtree_control
+mkdir testgroup
+cd testgroup
+echo "512M" > memory.high
+echo "1G" > memory.max
+echo "$$" > cgroup.procs
+--------------------------------------------------------------------------------
+
+- If the memory usage of a cgroup goes over *memory.high* it throttles
+  allocations by forcing them into direct reclaim to work off the excess.
+- *memory.max* is the hard limit. If the cgroup reaches that limit and the
+   memory usage can not be reduced, the
+   pass:[<abbr title="Out Of Memory">OOM</abbr>] killer is invoked in the
+   cgroup.
+
+== Controlling Input/Output usage
+
+This control group will limit the write speed to 2
+pass:[<abbr title="Mebibyte, 1024 Kibibyte">MiB</abbr>] a second using the io
+controllerfootnote:controllers[]. IO limits are set per device. You need to
+specify the major and minor device numbers of the _device_ (not partition) you
+want to limit (in my case it is “8:0” for `/dev/sda`). Run `lsblk` or `cat
+/proc/partitions` to find them out.
+
+[source,shell]
+--------------------------------------------------------------------------------
+echo "+io" > cgroup.subtree_control
+mkdir testgroup
+cd testgroup
+echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max
+echo "$$" > cgroup.procs
+--------------------------------------------------------------------------------
+
+*io.max* limits bytes per second (_rbps/wbps_) and/or IO operations per
+ second (_riops/wiops_).
+
+Try running ``dd if=/dev/zero bs=1M count=100 of=test.img
+oflag=direct``footnote:[`oflag=direct` opens the file with the `O_DIRECT` flag,
+bypassing caches.]. You will see that the speed is around 2 MiB a second.
+
+[TIP]
+Kernel 5.14 introduced
+**blkio.prio.class**footnote:[link:{url-kernel-doc-14}#io-priority[Kernel
+documentation on Control Group v2, section “IO Priority”]] that controls the IO
+priority. It seems to work like link:{url-ionice}[ionice]. I could not test it
+yet, since I run kernel 5.10.
+
+== Controlling process numbers
+
+This control group will limit the amount of processes to 10 using the process
+number controllerfootnote:controllers[].
+
+[source,shell]
+--------------------------------------------------------------------------------
+echo "+pids" > cgroup.subtree_control
+mkdir testgroup
+cd testgroup
+echo 10 > pids.max
+echo "$$" > cgroup.procs
+--------------------------------------------------------------------------------
+
+Try running `for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &);
+done`. You will get error messages from your shell that it can't fork another
+process.
+
+
+//  LocalWords:  cgroups cgroup cpuset