242 lines
10 KiB
Plaintext
242 lines
10 KiB
Plaintext
---
|
||
title: "Control Group version 2 by hand"
|
||
slug: "cgroup-v2-by-hand"
|
||
description: null
|
||
date: 2021-07-17T17:05:00+02:00
|
||
type: posts
|
||
draft: false
|
||
tags:
|
||
- CGroup
|
||
- Linux
|
||
toc: true
|
||
---
|
||
|
||
:source-highlighter: pygments
|
||
:idprefix:
|
||
:experimental: true
|
||
:toc:
|
||
:toclevels: 2
|
||
|
||
:url-openrc: https://wiki.gentoo.org/wiki/OpenRC/CGroups
|
||
:url-kernel-doc: https://www.kernel.org/doc/html/v5.10/admin-guide/cgroup-v2.html
|
||
:url-kernel-doc-14: https://www.kernel.org/doc/html/v5.14-rc1/admin-guide/cgroup-v2.html
|
||
:url-nice: https://manpages.debian.org/buster/coreutils/nice.1.en.html
|
||
:url-htop: https://htop.dev/
|
||
:url-ionice: https://manpages.debian.org/buster/util-linux/ionice.1.en.html
|
||
:url-linux-git: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
|
||
|
||
We have Control Group v2 since 2016 but I had trouble finding good documentation
|
||
on how to use it. Most tutorials and blog posts only cover v1 or are specific to
|
||
systemdfootnote:[If you are looking for OpenRC specific documentation, take a
|
||
look at the link:{url-openrc}[article in the Gentoo Wiki]]. The
|
||
link:{url-kernel-doc}[kernel documentation] is a great reference and the basis
|
||
for this post but not always easy to follow. I will give you a few short
|
||
examples on how to use it. I will not explain everything, but hopefully enough
|
||
to get an idea and understand the reference better.
|
||
|
||
Your interface to cgroups is a special file-system. Most distributions have
|
||
cgroup v1 mounted at `/sys/fs/cgroup` and cgroup v2 at
|
||
`/sys/fs/cgroup/unified`. Some distributions removed v1 support by default and
|
||
have v2 mounted at `/sys/fs/cgroup`. You can find out where cgroup v2 is mounted
|
||
with `mount | grep cgroup2`. If it is not mounted, you can do it yourself with
|
||
`mount -t cgroup2 none /sys/fs/cgroup/unified`. You can theoretically mount it
|
||
anywhere you like, but tools expect it in the path mentioned above. Going
|
||
forward I will assume you are in a terminal in the cgroup v2 directory.
|
||
|
||
Linux distributions should have all cgroup options compiled in. If you built the
|
||
kernel yourself, or you are missing files in `/sys/fs/cgroup`, you can check
|
||
with `zgrep CGROUP /proc/config.gz | grep -Ev 'DEBUG|=y'` if you are missing
|
||
anything important.
|
||
|
||
[NOTE]
|
||
All examples on this page are tested with kernel 5.10.
|
||
|
||
== Enabling controllers
|
||
|
||
There are 8 controllers
|
||
currentlyfootnote:controllers[link:{url-kernel-doc}#controllers[Kernel
|
||
documentation on Control Group v2, section “Controllers”]]: cpu, memory, io,
|
||
pids, cpuset, rdma, hugetlb and perf_event. You can find out which are
|
||
available with `cat cgroup.controllers`. perf_event is automatically enabled,
|
||
all others have to be enabled explicitly, with `echo "+cpu +memory" >
|
||
cgroup.subtree_control` for example. You can disable a controller by using a `-`
|
||
instead of a `+`.
|
||
|
||
[quote, citetitle = "Kernel documentation on Control Group v2"]
|
||
________________________________________________________________________________
|
||
Resources are distributed top-down and a cgroup can further distribute a
|
||
resource only if the resource has been distributed to it from the parent. This
|
||
means that all non-root “cgroup.subtree_control” files can only contain
|
||
controllers which are enabled in the parent’s “cgroup.subtree_control” file. A
|
||
controller can be enabled only if the parent has the controller enabled and a
|
||
controller can’t be disabled if one or more children have it enabled. […]
|
||
Non-root cgroups can distribute domain resources to their children only when
|
||
they don’t have any processes of their own. In other words, __only domain
|
||
cgroups which don’t contain any processes can have domain controllers enabled in
|
||
their “cgroup.subtree_control”
|
||
files.__footnote:[link:{url-kernel-doc}#top-down-constraint[Kernel documentation
|
||
on Control Group v2, sections “Top-down Constraint” and “No Internal Process
|
||
Constraint”]]
|
||
________________________________________________________________________________
|
||
|
||
We will keep it simple by only setting controllers globally in our root cgroup.
|
||
|
||
== Controlling CPU usage
|
||
|
||
This control group will use the cpu
|
||
controllerfootnote:[link:{url-kernel-doc}#cpu[Kernel documentation on Control
|
||
Group v2, section Controllers → CPU]]. Every process in this group will be
|
||
deprioritized, all processes together can only use the power of 2 CPU cores.
|
||
|
||
[source,shell]
|
||
--------------------------------------------------------------------------------
|
||
echo "+cpu" > cgroup.subtree_control
|
||
mkdir testgroup
|
||
cd testgroup
|
||
echo "50" > cpu.weight
|
||
echo "200000 100000" > cpu.max
|
||
--------------------------------------------------------------------------------
|
||
|
||
Try adding your current shell to the group with `echo "$$" > cgroup.procs`. All
|
||
processes you start from this shell are now in the same cgroup. But what does
|
||
the example do, exactly?
|
||
|
||
- *cpu.weight* is the relative amount of CPU cycles the cgroup is getting under
|
||
load. The CPU cycles are distributed by adding up the weights of all _active_
|
||
children and giving each the fraction matching the ratio of its weight against
|
||
the sum.footnote:weight[link:{url-kernel-doc}#weights[Kernel documentation on
|
||
Control Group v2, section “Resource Distribution Models” → “Weights”]] It has
|
||
a range from 1 to 10,000. If one process has a weight of 3,000 and the only
|
||
other active process has a weight of 7,000, the former will get 30% and the
|
||
latter 70% of CPU cycles. The default is 100.
|
||
- *cpu.max* sets the “maximum bandwidth limit”. We told the kernel that the
|
||
processes should use at most 200,000 µs every 100,000 µs, meaning they can use
|
||
the power of up to 2 cores.
|
||
|
||
Try running `for process in $(seq 1 4); do (cat /dev/urandom > /dev/null &);
|
||
done`. You will see that the CPU usage of each process hovers at around 50%
|
||
instead of 100%. The processes were added to *cgroup.procs*.
|
||
|
||
[TIP]
|
||
You can add a cgroup column to link:{url-htop}[htop] by pressing kbd:[F2] and
|
||
then navigating to “Columns”. Select “CGROUP” in “Available Columns” and press
|
||
kbd:[Enter].
|
||
|
||
[TIP]
|
||
*cpu.weight.nice* is an alternate interface to *cpu.weight* that uses the same
|
||
values used by link:{url-nice}[nice] and has a range from -20 to 19.
|
||
|
||
=== Controlling CPU core usage
|
||
|
||
This control group will use the cpuset
|
||
controllerfootnote:[link:{url-kernel-doc}#cpuset[Kernel documentation on Control
|
||
Group v2, section Controllers → Cpuset]] to restrict the processes to the CPU
|
||
cores 0 and 3.
|
||
|
||
[source,shell]
|
||
--------------------------------------------------------------------------------
|
||
echo "+cpuset" > cgroup.subtree_control
|
||
mkdir testgroup
|
||
cd testgroup
|
||
echo "0,3" > cpuset.cpus
|
||
echo "$$" > cgroup.procs
|
||
--------------------------------------------------------------------------------
|
||
|
||
*cpuset.cpus* takes comma-separated numbers or ranges. For example:
|
||
“0-4,6,8-10”.
|
||
|
||
[TIP]
|
||
You can add a CPU column to link:{url-htop}[htop] by pressing kbd:[F2] and then
|
||
navigating to “Columns”. Select “PROCESSOR” in “Available Columns” and press
|
||
kbd:[Enter].
|
||
|
||
== Controlling memory usage
|
||
|
||
This control group will use the memory
|
||
controllerfootnote:[link:{url-kernel-doc}#memory[Kernel documentation on Control
|
||
Group v2, section Controllers → Memory]]. All processes together can only use 1
|
||
pass:[<abbr title="Gibibyte, 1024 Mebibyte">GiB</abbr>] of memory at most.
|
||
|
||
[source,shell]
|
||
--------------------------------------------------------------------------------
|
||
echo "+memory" > cgroup.subtree_control
|
||
mkdir testgroup
|
||
cd testgroup
|
||
echo "512M" > memory.high
|
||
echo "1G" > memory.max
|
||
echo "$$" > cgroup.procs
|
||
--------------------------------------------------------------------------------
|
||
|
||
- If the memory usage of a cgroup goes over *memory.high* it throttles
|
||
allocations by forcing them into direct reclaim to work off the excess.
|
||
- *memory.max* is the hard limit. If the cgroup reaches that limit and the
|
||
memory usage can not be reduced, the
|
||
pass:[<abbr title="Out Of Memory">OOM</abbr>] killer is invoked in the
|
||
cgroup.
|
||
|
||
== Controlling Input/Output usage
|
||
|
||
This control group will increase the IO priority and limit the write speed to 2
|
||
pass:[<abbr title="Mebibyte, 1024 Kibibyte">MiB</abbr>] a second using the io
|
||
controllerfootnote:io[link:{url-kernel-doc}#io[Kernel documentation on Control
|
||
Group v2, section Controllers → IO]]. IO limits are set per device. You need to
|
||
specify the major and minor device numbers of the _device_ (not partition) you
|
||
want to limit (in my case it is “8:0” for `/dev/sda`). Run `lsblk` or `cat
|
||
/proc/partitions` to find them out.
|
||
|
||
[source,shell]
|
||
--------------------------------------------------------------------------------
|
||
echo "+io" > cgroup.subtree_control
|
||
mkdir testgroup
|
||
cd testgroup
|
||
echo "default 500" > io.weight
|
||
echo "8:0 wbps=$((2 * 1024 * 1024)) rbps=max" > io.max
|
||
echo "$$" > cgroup.procs
|
||
--------------------------------------------------------------------------------
|
||
|
||
- *io.weight* specifies the relative amount of IO time the cgroup can use in
|
||
relation to its siblings and has a range from 1 to 10,000.footnote:weight[]
|
||
The priority can be overridden for individual devices with the major:minor
|
||
syntax, like “8:0 90”. The default is value is 100.
|
||
- *io.max* limits bytes per second (_rbps/wbps_) and/or IO operations per second
|
||
(_riops/wiops_).
|
||
|
||
Try running ``dd if=/dev/zero bs=1M count=100 of=test.img
|
||
oflag=direct``footnote:[`oflag=direct` opens the file with the `O_DIRECT` flag,
|
||
bypassing caches.]. You will see that the speed is around 2 MiB a second.
|
||
|
||
[TIP]
|
||
Kernel 5.14 introduced
|
||
**blkio.prio.class**footnote:[link:{url-kernel-doc-14}#io-priority[Kernel
|
||
documentation on Control Group v2, section “IO Priority”]] that controls the IO
|
||
priority. It seems to work similar to link:{url-ionice}[ionice].
|
||
|
||
[IMPORTANT]
|
||
Weight based distribution (*io.weight*) is available only if cfq-iosched is in
|
||
use and absolute bandwidth or IOPS limit distribution (*io.max*) is not
|
||
available for blk-mq devices.footnote:io[] The CFQ scheduler was removed in
|
||
kernel 5.0.footnote:[link:{url-linux-git}/commit/?id=f382fb0[git commit: “block:
|
||
remove legacy IO schedulers”]]
|
||
|
||
== Controlling process numbers
|
||
|
||
This control group will limit the amount of processes to 10 using the process
|
||
number controllerfootnote:[link:{url-kernel-doc}#pid[Kernel documentation on
|
||
Control Group v2, section Controllers → PID]].
|
||
|
||
[source,shell]
|
||
--------------------------------------------------------------------------------
|
||
echo "+pids" > cgroup.subtree_control
|
||
mkdir testgroup
|
||
cd testgroup
|
||
echo 10 > pids.max
|
||
echo "$$" > cgroup.procs
|
||
--------------------------------------------------------------------------------
|
||
|
||
Try running `for process in $(seq 1 10); do ((sleep 2 && echo ${process}) &);
|
||
done`. You will get error messages from your shell that it can not fork another
|
||
process.
|
||
|
||
|
||
// LocalWords: cgroups cgroup cpuset
|