This was deprecated in commit e6a4870e4ac40 ("libct: better errors for
hooks"), and users have had ample time to migrate to Hooks.Run since.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The per-file deprecation in cgroup_deprecated.go is not working,
let's replace it.
Link to Hooks.Run in Hook.Run deprecation notice.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Implement support for passing Linux Network Devices to the container
network namespace.
The network device is passed during the creation of the container,
before the process is started.
It implements the logic defined in the OCI runtime specification.
Signed-off-by: Antonio Ojea <aojea@google.com>
This makes the state.json file 1303 bytes or almost 25% smaller (when
using the default spec, YMMV) by omitting default values.
Before: 5496 bytes
After: 4193 bytes
(With cgroups#9 applied, the new size is 3424, which is almost 40%
savings, compared to the original).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261
CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
applied to a given exec and overrides the value from (1).
Add some basic tests.
Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.
Because of the above,
- it's impossible to really test initial CPU affinity without adding
debug logging to libcontainer/nsenter;
- for older kernels, there can be a brief moment when exec's affinity
is different than either initial or final affinity being set;
- exec's final CPU affinity, if not specified, can be different
depending on the kernel, therefore we don't test it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.
Mostly generated by:
git rm -f libcontainer/cgroups
find . -type f -name "*.go" -exec sed -i \
's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
{} +
go get github.com/opencontainers/cgroups@v0.0.1
make vendor
gofumpt -w .
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use the old package name as an alias to minimize the patch.
No functional change; this just eliminates a bunch of deprecation
warnings.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make CommandHook.Command a pointer, which reduces the amount of data
being copied when using hooks, and allows to modify command hooks.
2. Add SetDefaultEnv, which is to be used by the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is an internal implementation detail and should not be either
public or visible.
Amend setIOPriority to do own class conversion.
Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:
failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295
This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.
In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
* We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
* Revert 97f5ee4e6a ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
* Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
produce errors rather than silently producing incorrect atime
mounts.
* Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
[1]: https://github.com/util-linux/util-linux/issues/2433
Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When a hook has failed, the error message looks like this:
> error running hook: error running hook #1: exit status 1, stdout: ...
The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;
Change that to
> error running createContainer hook #1: exit status 1, stdout: ...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset
Example runtime spec
"timeOffsets": {
"monotonic": {
"secs": 172800,
"nanosecs": 0
},
"boottime": {
"secs": 604800,
"nanosecs": 0
}
}
Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.
Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.
Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.
* contrib/completions/bash/runc:
Add `--no-mount-fallback` command line option for bash completion.
* create.go:
Add `--no-mount-fallback` command line option.
* restore.go:
Add `--no-mount-fallback` command line option.
* run.go:
Add `--no-mount-fallback` command line option.
* libcontainer/configs/config.go:
Add `NoMountFallback` field to the `Config` struct to store
the command line option value.
* libcontainer/specconv/spec_linux.go:
Add `NoMountFallback` field to the `CreateOpts` struct to store
the command line option value and store it in the libcontainer
config.
* utils_linux.go:
Store the command line option value in the `CreateOpts` struct.
* libcontainer/rootfs_linux.go:
In case that `--no-mount-fallback` is not set try to remount the
bind filesystem again with the options nodev, nosuid, noexec,
noatime, relatime, strictatime or nodiratime if they are set on
the source filesystem.
* tests/integration/mounts_sshfs.bats:
Add testcases and rework sshfs setup to allow specifying
different mount options depending on the test case.
Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
Two new seccomp actions have been added to the libseccomp-golang
dependency, which can be now supported by runc, too.
ActKillThread kills the thread that violated the rule. It is the same as
ActKill. All other threads from the same thread group will continue to
execute.
ActKillProcess kills the process that violated the rule. All threads in
the thread group are also terminated. This action is only usable when
libseccomp API level 3 or higher is supported.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit ccdd75760c introduced the HookName type
for hooks, but only set this type on the Prestart const, but not for the
other hooks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
As reported by go test -race ./libcontainer/configs:
=== RUN TestCommandHookRunTimeout
==================
WARNING: DATA RACE
Read at 0x00c000202230 by goroutine 23:
os/exec.(*Cmd).Wait()
/usr/lib/golang/src/os/exec/exec.go:502 +0x91
github.com/opencontainers/runc/libcontainer/configs.Command.Run()
/home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:390 +0x58c
github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout()
/home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed
testing.tRunner()
/usr/lib/golang/src/testing/testing.go:1123 +0x202
Previous write at 0x00c000202230 by goroutine 27:
os/exec.(*Cmd).Wait()
/usr/lib/golang/src/os/exec/exec.go:505 +0xb4
github.com/opencontainers/runc/libcontainer/configs.Command.Run.func1()
/home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:373 +0x55
Goroutine 23 (running) created at:
testing.(*T).Run()
/usr/lib/golang/src/testing/testing.go:1168 +0x5bb
testing.runTests.func1()
/usr/lib/golang/src/testing/testing.go:1439 +0xa6
testing.tRunner()
/usr/lib/golang/src/testing/testing.go:1123 +0x202
testing.runTests()
/usr/lib/golang/src/testing/testing.go:1437 +0x612
testing.(*M).Run()
/usr/lib/golang/src/testing/testing.go:1345 +0x3b3
main.main()
_testmain.go:69 +0x236
Goroutine 27 (running) created at:
github.com/opencontainers/runc/libcontainer/configs.Command.Run()
/home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:372 +0x415
github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout()
/home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed
testing.tRunner()
/usr/lib/golang/src/testing/testing.go:1123 +0x202
==================
testing.go:1038: race detected during execution of test
--- FAIL: TestCommandHookRunTimeout (0.10s)
Apparently, the issue is we call two Wait()s for the same command
which can race internally.
Fix is easy -- since we already have a waiting goroutine,
wait for it to return instead of calling a second Wait().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:
> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.
Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.
Signed-off-by: Aleksa Sarai <asarai@suse.de>