blog/content/posts/cgroup v2 by hand.adoc

---
title: "Control Group version 2 by hand"
slug: "cgroup-v2-by-hand"
description: null
date: 2021-07-17T17:05:00+02:00
type: posts
draft: false
tags:
- CGroup
- Linux
toc: true
---

:source-highlighter: pygments
:idprefix:
:experimental: true
:toc:
:toclevels: 2

:url-openrc: https://wiki.gentoo.org/wiki/OpenRC/CGroups
:url-kernel-doc: https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html
:url-kernel-doc-14: https://www.kernel.org/doc/html/v5.14-rc1/admin-guide/cgroup-v2.html
:url-nice: https://manpages.debian.org/buster/coreutils/nice.1.en.html
:url-htop: https://htop.dev/
:url-ionice: https://manpages.debian.org/buster/util-linux/ionice.1.en.html
:url-linux-git: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

We have Control Group v2 since 2016 but I had trouble finding good documentation
on how to use it. Most tutorials and blog posts only cover v1 or are specific to
systemdfootnote:[If you are looking for OpenRC specific documentation, take a
look at the link:{url-openrc}[article in the Gentoo Wiki]]. The
link:{url-kernel-doc}[kernel documentation] is a great reference and the basis
for this post but not always easy to follow. I will give you a few short
examples on how to use it. I will not explain everything, but hopefully enough
to get an idea and understand the reference better.

Your interface to cgroups is a special file-system. Most distributions have
cgroup v1 mounted at `/sys/fs/cgroup` and cgroup v2 at
`/sys/fs/cgroup/unified`. Some distributions removed v1 support by default and
have v2 mounted at `/sys/fs/cgroup`. You can find out where cgroup v2 is mounted
with `mount | grep cgroup2`. If it is not mounted, you can do it yourself with
`mount -t cgroup2 none /sys/fs/cgroup/unified`. You can theoretically mount it
anywhere you like, but tools expect it in the path mentioned above. Going
forward I will assume you are in a terminal in the cgroup v2 directory.

Linux distributions should have all cgroup options compiled in. If you built the
kernel yourself, or you are missing files in `/sys/fs/cgroup`, you can check
with `zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y'` if you are missing
anything important.

[NOTE]
All examples on this page are tested with kernel 5.10.

== Enabling controllers

There are 8 controllers
currentlyfootnote:controllers[link:{url-kernel-doc}#controllers[Kernel
documentation on Control Group v2, section “Controllers”]]: cpu, memory, io,
pids, cpuset, rdma, hugetlb and perf_event. You can find out which are
available with `cat cgroup.controllers`. perf_event is automatically enabled,
all others have to be enabled explicitly, with `echo "+cpu +memory" >
cgroup.subtree_control` for example. You can disable a controller by using a `-`
instead of a `+`.

[quote, citetitle = "Kernel documentation on Control Group v2"]
________________________________________________________________________________
Resources are distributed top-down and a cgroup can further distribute a
resource only if the resource has been distributed to it from the parent. This
means that all non-root “cgroup.subtree_control” files can only contain
controllers which are enabled in the parent’s “cgroup.subtree_control” file. A
controller can be enabled only if the parent has the controller enabled and a
controller can’t be disabled if one or more children have it enabled. […]
Non-root cgroups can distribute domain resources to their children only when
they don’t have any processes of their own. In other words, __only domain
cgroups which don’t contain any processes can have domain controllers enabled in
their “cgroup.subtree_control”
files.__footnote:[link:{url-kernel-doc}#top-down-constraint[Kernel documentation
on Control Group v2, sections “Top-down Constraint” and “No Internal Process
Constraint”]]
________________________________________________________________________________

We will keep it simple by only setting controllers globally in our root cgroup.

== Controlling CPU usage

This control group will use the cpu
controllerfootnote:[link:{url-kernel-doc}#cpu[Kernel documentation on Control
Group v2, section Controllers → CPU]]. Every process in this group will be
deprioritized, all processes together can only use the power of 2 CPU cores.

[source,shell]
--------------------------------------------------------------------------------
echo "+cpu" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "50" > cpu.weight
echo "200000 100000" > cpu.max
--------------------------------------------------------------------------------

Try adding your current shell to the group with `echo "$$" > cgroup.procs`. All
processes you start from this shell are now in the same cgroup. But what does
the example do, exactly?

- *cpu.weight* is the relative amount of CPU cycles the cgroup is getting under
  load. The CPU cycles are distributed by adding up the weights of all _active_
  children and giving each the fraction matching the ratio of its weight against
  the sum.footnote:weight[link:{url-kernel-doc}#weights[Kernel documentation on
  Control Group v2, section “Resource Distribution Models” → “Weights”]] It has
  a range from 1 to 10,000. If one process has a weight of 3,000 and the only
  other active process has a weight of 7,000, the former will get 30% and the
  latter 70% of CPU cycles. The default is 100.
- *cpu.max* sets the “maximum bandwidth limit”. We told the kernel that the
  processes should use at most 200,000 µs every 100,000 µs, meaning they can use
  the power of up to 2 cores.

Try running `for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &);
done`.  You will see that the CPU usage of each process hovers at around 50%
instead of 100%. The processes were added to *cgroup.procs*.

[TIP]
You can add a cgroup column to link:{url-htop}[htop] by pressing kbd:[F2] and
then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press
kbd:[Enter].

[TIP]
*cpu.weight.nice* is an alternate interface to *cpu.weight* that uses the same
values used by link:{url-nice}[nice] and has a range from -20 to 19.

=== Controlling CPU core usage

This control group will use the cpuset
controllerfootnote:[link:{url-kernel-doc}#cpuset[Kernel documentation on Control
Group v2, section Controllers → Cpuset]] to restrict the processes to the CPU
cores 0 and 3.

[source,shell]
--------------------------------------------------------------------------------
echo "+cpuset" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "0,3" > cpuset.cpus
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------

*cpuset.cpus* takes comma-separated numbers or ranges. For example:
“0-4,6,8-10”.

[TIP]
You can add a CPU column to link:{url-htop}[htop] by pressing kbd:[F2] and then
navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press
kbd:[Enter].

== Controlling memory usage

This control group will use the memory
controllerfootnote:[link:{url-kernel-doc}#memory[Kernel documentation on Control
Group v2, section Controllers → Memory]]. All processes together can only use 1
pass:[<abbr title="Gibibyte, 1024 Mebibyte">GiB</abbr>] of memory at most.

[source,shell]
--------------------------------------------------------------------------------
echo "+memory" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "512M" > memory.high
echo "1G" > memory.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------

- If the memory usage of a cgroup goes over *memory.high* it throttles
  allocations by forcing them into direct reclaim to work off the excess.
- *memory.max* is the hard limit. If the cgroup reaches that limit and the
   memory usage can not be reduced, the
   pass:[<abbr title="Out Of Memory">OOM</abbr>] killer is invoked in the
   cgroup.

== Controlling Input/Output usage

This control group will increase the IO priority and limit the write speed to 2
pass:[<abbr title="Mebibyte, 1024 Kibibyte">MiB</abbr>] a second using the io
controllerfootnote:io[link:{url-kernel-doc}#io[Kernel documentation on Control
Group v2, section Controllers → IO]]. IO limits are set per device. You need to
specify the major and minor device numbers of the _device_ (not partition) you
want to limit (in my case it is “8:0” for `/dev/sda`). Run `lsblk` or `cat
/proc/partitions` to find them out.

[source,shell]
--------------------------------------------------------------------------------
echo "+io" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "default 500" > io.weight
echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------

- *io.weight* specifies the relative amount of IO time the cgroup can use in
  relation to its siblings and has a range from 1 to 10,000.footnote:weight[]
  The priority can be overridden for individual devices with the major:minor
  syntax, like “8:0 90”. The default is value is 100.
- *io.max* limits bytes per second (_rbps/wbps_) and/or IO operations per second
  (_riops/wiops_).

Try running ``dd if=/dev/zero bs=1M count=100 of=test.img
oflag=direct``footnote:[`oflag=direct` opens the file with the `O_DIRECT` flag,
bypassing caches.]. You will see that the speed is around 2 MiB a second.

[TIP]
Kernel 5.14 introduced
**blkio.prio.class**footnote:[link:{url-kernel-doc-14}#io-priority[Kernel
documentation on Control Group v2, section “IO Priority”]] that controls the IO
priority. It seems to work similar to link:{url-ionice}[ionice].

[IMPORTANT]
Weight based distribution (*io.weight*) is available only if cfq-iosched is in
use and absolute bandwidth or IOPS limit distribution (*io.max*) is not
available for blk-mq devices.footnote:io[] The CFQ scheduler was removed in
kernel 5.0.footnote:[link:{url-linux-git}/commit/?id=f382fb0[git commit: “block:
remove legacy IO schedulers”]]

== Controlling process numbers

This control group will limit the amount of processes to 10 using the process
number controllerfootnote:[link:{url-kernel-doc}#pid[Kernel documentation on
Control Group v2, section Controllers → PID]].

[source,shell]
--------------------------------------------------------------------------------
echo "+pids" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo 10 > pids.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------

Try running `for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &);
done`. You will get error messages from your shell that it can not fork another
process.


//  LocalWords:  cgroups cgroup cpuset