blog/content/posts/cgroup v2 by hand.adoc

229 lines
9.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Control Group version 2 by hand"
slug: "cgroup-v2-by-hand"
description: null
date: 2021-07-17T17:05:00+02:00
type: posts
draft: false
tags:
- cgroup
- Linux
toc: true
---
:source-highlighter: pygments
:idprefix:
:experimental: true
:toc:
:toclevels: 2
:url-openrc: https://wiki.gentoo.org/wiki/OpenRC/CGroups
:url-kernel-doc: https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html
:url-kernel-doc-14: https://www.kernel.org/doc/html/v5.14-rc1/admin-guide/cgroup-v2.html
:url-nice: https://manpages.debian.org/buster/coreutils/nice.1.en.html
:url-htop: https://htop.dev/
:url-ionice: https://manpages.debian.org/buster/util-linux/ionice.1.en.html
We have Control Group v2 since 2016 but I had trouble finding good documentation
on how to use it. Most tutorials and blog posts only cover v1 or are specific to
systemdfootnote:[If you are looking for OpenRC specific documentation, take a
look at the link:{url-openrc}[article in the Gentoo Wiki]]. The
link:{url-kernel-doc}[kernel documentation] is a great reference and the basis
for this post but not always easy to follow. I will give you a few short
examples on how to use it. I will not explain everything, but hopefully enough
to get an idea and understand the reference better.
Your interface to cgroups is a special file-system. Most distributions have
cgroup v1 mounted at `/sys/fs/cgroup` and cgroup v2 at
`/sys/fs/cgroup/unified`. Some distributions removed v1 support by default and
have v2 mounted at `/sys/fs/cgroup`. You can find out where cgroup v2 is mounted
with `mount | grep cgroup2`. If it is not mounted, you can do it yourself with
`mount -t cgroup2 none /sys/fs/cgroup/unified`. You can theoretically mount it
anywhere you like, but tools expect it in the path mentioned above. Going
forward I will assume you are in a terminal in the cgroup v2 directory.
Linux distributions should have all cgroup options compiled in. If you built the
kernel yourself, or you are missing files in `/sys/fs/cgroup`, you can check
with `zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y'` if you are missing
anything important.
[NOTE]
All examples on this page are tested with kernel 5.10.
== Enabling controllers
There are 8 controllers
currentlyfootnote:controllers[link:{url-kernel-doc}#controllers[Kernel
documentation on Control Group v2, section “Controllers”]]: cpu, memory, io,
pids, cpuset, rdma, hugetlb and perf_event. You can find out which are
available with `cat cgroup.controllers`. perf_event is automatically enabled,
all others have to be enabled explicitly, with `echo "+cpu +memory" >
cgroup.subtree_control` for example. You can disable a controller by using a `-`
instead of a `+`.
[quote, citetitle = "Kernel documentation on Control Group v2"]
________________________________________________________________________________
Resources are distributed top-down and a cgroup can further distribute a
resource only if the resource has been distributed to it from the parent. This
means that all non-root “cgroup.subtree_control” files can only contain
controllers which are enabled in the parents “cgroup.subtree_control” file. A
controller can be enabled only if the parent has the controller enabled and a
controller cant be disabled if one or more children have it enabled. […]
Non-root cgroups can distribute domain resources to their children only when
they dont have any processes of their own. In other words, __only domain
cgroups which dont contain any processes can have domain controllers enabled in
their “cgroup.subtree_control”
files.__footnote:[link:{url-kernel-doc}#top-down-constraint[Kernel documentation
on Control Group v2, sections “Top-down Constraint” and “No Internal Process
Constraint”]]
________________________________________________________________________________
We will keep it simple by only setting controllers globally in our root cgroup.
== Controlling CPU usage
This control group will use the cpu controllerfootnote:controllers[]. Every
process in this group will be deprioritized, all processes together can only use
the power of 2 CPU cores.
[source,shell]
--------------------------------------------------------------------------------
echo "+cpu" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "50" > cpu.weight
echo "200000 100000" > cpu.max
--------------------------------------------------------------------------------
Try adding your current shell to the group with `echo "$$" > cgroup.procs`. All
processes you start from this shell are now in the same cgroup. But what does
the example do, exactly?
- *cpu.weight* is the relative amount of CPU cycles the cgroup is getting under
load. The CPU cycles are distributed by adding up the weights of all _active_
children and giving each the fraction matching the ratio of its weight against
the sum.footnote:[link:{url-kernel-doc}#weights[Kernel documentation on
Control Group v2, section “Resource Distribution Models” → “Weights”]] It has
a range from 1 to 10,000. If one process has a weight of 3,000 and the only
other active process has a weight of 7,000, the former will get 30% and the
latter 70% of CPU cycles. The default is 100.
- *cpu.max* sets the “maximum bandwidth limit”. We told the kernel that the
processes should use at most 200,000 µs every 100,000 µs, meaning they can use
the power of up to 2 cores.
Try running `for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &);
done`. You will see that the CPU usage of each process hovers at around 50%
instead of 100%. The processes were added to *cgroup.procs*.
[TIP]
You can add a cgroup column to link:{url-htop}[htop] by pressing kbd:[F2] and
then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press
kbd:[Enter].
[TIP]
*cpu.weight.nice* is an alternate interface to *cpu.weight* that uses the same
values used by link:{url-nice}[nice] and has a range from -20 to 19.
=== Controlling CPU core usage
This control group will use the cpuset controller to restrict the processes to
the CPU cores 0 and 3.
[source,shell]
--------------------------------------------------------------------------------
echo "+cpuset" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "0,3" > cpuset.cpus
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------
*cpuset.cpus* takes comma-separated numbers or ranges. For example:
“0-4,6,8-10”.
[TIP]
You can add a CPU column to link:{url-htop}[htop] by pressing kbd:[F2] and then
navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press
kbd:[Enter].
== Controlling memory usage
This control group will use the memory controllerfootnote:controllers[]. All
processes together can only use 1
pass:[<abbr title="Gibibyte, 1024 Mebibyte">GiB</abbr>] of memory at most.
[source,shell]
--------------------------------------------------------------------------------
echo "+memory" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "512M" > memory.high
echo "1G" > memory.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------
- If the memory usage of a cgroup goes over *memory.high* it throttles
allocations by forcing them into direct reclaim to work off the excess.
- *memory.max* is the hard limit. If the cgroup reaches that limit and the
memory usage can not be reduced, the
pass:[<abbr title="Out Of Memory">OOM</abbr>] killer is invoked in the
cgroup.
== Controlling Input/Output usage
This control group will increase the IO priority and limit the write speed to 2
pass:[<abbr title="Mebibyte, 1024 Kibibyte">MiB</abbr>] a second using the io
controllerfootnote:controllers[]. IO limits are set per device. You need to
specify the major and minor device numbers of the _device_ (not partition) you
want to limit (in my case it is “8:0” for `/dev/sda`). Run `lsblk` or `cat
/proc/partitions` to find them out.
[source,shell]
--------------------------------------------------------------------------------
echo "+io" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo "default 500" > io.weight
echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------
- *io.weight* specifies the relative amount of IO time the cgroup can use in
relation to its siblings and has a range from 1 to 10,000. The priority can
be overridden for individual devices with the major:minor syntax, like “8:0
90”. The default is value is 100.
- *io.max* limits bytes per second (_rbps/wbps_) and/or IO operations per second
(_riops/wiops_).
Try running ``dd if=/dev/zero bs=1M count=100 of=test.img
oflag=direct``footnote:[`oflag=direct` opens the file with the `O_DIRECT` flag,
bypassing caches.]. You will see that the speed is around 2 MiB a second.
[TIP]
Kernel 5.14 introduced
**blkio.prio.class**footnote:[link:{url-kernel-doc-14}#io-priority[Kernel
documentation on Control Group v2, section “IO Priority”]] that controls the IO
priority. It seems to work like link:{url-ionice}[ionice]. I could not test it
yet, since I run kernel 5.10.
== Controlling process numbers
This control group will limit the amount of processes to 10 using the process
number controllerfootnote:controllers[].
[source,shell]
--------------------------------------------------------------------------------
echo "+pids" > cgroup.subtree_control
mkdir testgroup
cd testgroup
echo 10 > pids.max
echo "$$" > cgroup.procs
--------------------------------------------------------------------------------
Try running `for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &);
done`. You will get error messages from your shell that it can not fork another
process.
// LocalWords: cgroups cgroup cpuset