zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-04-24 08:35:53 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	87b0804345	libct: remove deprecated HooksList.RunHooks This was deprecated in commit e6a4870e4ac40 ("libct: better errors for hooks"), and users have had ample time to migrate to Hooks.Run since. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2026-03-04 23:01:30 +11:00
Antti Kervinen	eda7bdf80c	Add memory policy support Implement support for Linux memory policy in OCI spec PR: https://github.com/opencontainers/runtime-spec/pull/1282 Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>	2025-10-07 15:06:37 +03:00
Kir Kolyshkin	89e59902c4	Modernize code for Go 1.24 Brought to you by modernize -fix -test ./... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-08-27 19:11:02 -07:00
Kir Kolyshkin	71bd84f32e	Merge pull request #4784 from kolyshkin/cgr-fup cgroups separation followup	2025-06-19 10:32:33 -07:00
Kir Kolyshkin	d22a42113d	libct/configs: stop using deprecated id Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-06-18 18:14:54 -07:00
Kir Kolyshkin	b25bcaa8b3	libct/configs: fix/improve deprecation notices The per-file deprecation in cgroup_deprecated.go is not working, let's replace it. Link to Hooks.Run in Hook.Run deprecation notice. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-06-18 18:14:46 -07:00
Kir Kolyshkin	a10d338eb2	libct/configs: add package docstring Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-06-18 18:10:51 -07:00
Antonio Ojea	8d180e9658	Add support for Linux Network Devices Implement support for passing Linux Network Devices to the container network namespace. The network device is passed during the creation of the container, before the process is started. It implements the logic defined in the OCI runtime specification. Signed-off-by: Antonio Ojea <aojea@google.com>	2025-06-18 15:52:30 +01:00
Kir Kolyshkin	7fdec327a0	Use any instead of interface{} The keyword is available since Go 1.18 (see https://pkg.go.dev/builtin#any). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-31 17:15:06 -07:00
Kir Kolyshkin	3a33b6a3df	Make state.json 25% smaller This makes the state.json file 1303 bytes or almost 25% smaller (when using the default spec, YMMV) by omitting default values. Before: 5496 bytes After: 4193 bytes (With cgroups#9 applied, the new size is 3424, which is almost 40% savings, compared to the original). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-19 15:51:52 -07:00
Kir Kolyshkin	10ca66bff5	runc exec: implement CPU affinity As per - https://github.com/opencontainers/runtime-spec/pull/1253 - https://github.com/opencontainers/runtime-spec/pull/1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-02 19:17:41 -08:00
Kir Kolyshkin	a75076b4a4	Switch to opencontainers/cgroups This removes libcontainer/cgroups packages and starts using those from github.com/opencontainers/cgroups repo. Mostly generated by: git rm -f libcontainer/cgroups find . -type f -name "*.go" -exec sed -i \ 's\|github.com/opencontainers/runc/libcontainer/cgroups\|github.com/opencontainers/cgroups\|g' \ {} + go get github.com/opencontainers/cgroups@v0.0.1 make vendor gofumpt -w . Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-28 15:20:33 -08:00
Kir Kolyshkin	6c9ddcc648	libct: switch from libct/devices to libct/cgroups/devices/config Use the old package name as an alias to minimize the patch. No functional change; this just eliminates a bunch of deprecation warnings. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-31 16:51:09 -08:00
Kir Kolyshkin	6171da6005	libct/configs: add HookList.SetDefaultEnv 1. Make CommandHook.Command a pointer, which reduces the amount of data being copied when using hooks, and allows to modify command hooks. 2. Add SetDefaultEnv, which is to be used by the next commit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-09 18:22:53 +08:00
lfbzhm	d48d9cfefc	Merge pull request #4459 from kolyshkin/prio-nits Fixups to scheduler/priority settings	2024-12-25 23:41:27 +08:00
Kir Kolyshkin	7334ee01e6	libct/configs: rm IOPrioClassMapping This is an internal implementation detail and should not be either public or visible. Amend setIOPriority to do own class conversion. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:17:44 -08:00
Kir Kolyshkin	8afeb58398	libct: add/use configs.HasHook This allows to omit a call to c.currentOCIState (which can be somewhat costly when there are many annotations) when the hooks of a given kind won't be run. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:47:09 -08:00
utam0k	bfbd0305ba	Add I/O priority Signed-off-by: utam0k <k0ma@utam0k.jp>	2024-03-30 22:31:54 +09:00
Aleksa Sarai	482e56379a	configs: make id mappings int64 to better handle 32-bit Using ints for all of our mapping structures means that a 32-bit binary errors out when trying to parse /proc/self/id_map: failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user: parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295 This issue was unearthed by commit `1912d5988b` (": actually support joining a userns with a new container") but the underlying issue has been present since the docker/libcontainer days. In theory, switching to uint32 (to match the spec) instead of int64 would also work, but keeping everything signed seems much less error-prone. It's also important to note that a mapping might be too large for an int on 32-bit, so we detect this during the mapping. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 12:14:32 +11:00
Zheao.Li	98511bb40e	linux: Support setting execution domain via linux personality carry #3126 Co-authored-by: Aditya R <arajan@redhat.com> Signed-off-by: Zheao.Li <me@manjusaka.me>	2023-10-27 19:33:37 +08:00
Aleksa Sarai	7c71a22705	rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT The original reasoning for this option was to avoid having mount options be overwritten by runc. However, adding command-line arguments has historically been a bad idea because it forces strict-runc-compatible OCI runtimes to copy out-of-spec features directly from runc and these flags are usually quite difficult to enable by users when using runc through several layers of engines and orchestrators. A far more preferable solution is to have a heuristic which detects whether copying the original mount's mount options would override an explicit mount option specified by the user. In this case, we should return an error. You only end up in this path in the userns case, if you have a bind-mount source with locked flags. During the course of writing this patch, I discovered that several aspects of our handling of flags for bind-mounts left much to be desired. We have completely botched the handling of explicitly cleared flags since commit `97f5ee4e6a` ("Only remount if requested flags differ from current"), with our behaviour only becoming increasingly more weird with `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") and `da780e4d27` ("Fix bind mounts of filesystems with certain options set"). In short, we would only clear flags explicitly request by the user purely by chance, in ways that it really should've been reported to us by now. The most egregious is that mounts explicitly marked "rw" were actually mounted "ro" if the bind-mount source was "ro" and no other special flags were included. In addition, our handling of atime was completely broken -- mostly due to how subtle the semantics of atime are on Linux. Unfortunately, while the runtime-spec requires us to implement mount(8)'s behaviour, several aspects of the util-linux mount(8)'s behaviour are broken and thus copying them makes little sense. Since the runtime-spec behaviour for this case (should mount options for a "bind" mount use the "mount --bind -o ..." or "mount --bind -o remount,..." semantics? Is the fallback code we have for userns actually spec-compliant?) and the mount(8) behaviour (see [1]) are not well-defined, this commit simply fixes the most obvious aspects of the behaviour that are broken while keeping the current spirit of the implementation. NOTE: The handling of atime in the base case is left for a future PR to deal with. This means that the atime of the source mount will be silently left alone unless the fallback path needs to be taken, and any flags not explicitly set will be cleared in the base case. Whether we should always be operating as "mount --bind -o remount,..." (where we default to the original mount source flags) is a topic for a separate PR and (probably) associated runtime-spec PR. So, to resolve this: * We store which flags were explicitly requested to be cleared by the user, so that we can detect whether the userns fallback path would end up setting a flag the user explicitly wished to clear. If so, we return an error because we couldn't fulfil the configuration settings. * Revert `97f5ee4e6a` ("Only remount if requested flags differ from current"), as missing flags do not mean we can skip MS_REMOUNT (in fact, missing flags are how you indicate a flag needs to be cleared with mount(2)). The original purpose of the patch was to fix the userns issue, but as mentioned above the correct mechanism is to do a fallback mount that copies the lockable flags from statfs(2). * Improve handling of atime in the fallback case by: - Correctly handling the returned flags in statfs(2). - Implement the MNT_LOCK_ATIME checks in our code to ensure we produce errors rather than silently producing incorrect atime mounts. * Improve the tests so we correctly detect all of these contingencies, including a general "bind-mount atime handling" test to ensure that the behaviour described here is accurate. This change also inlines the remount() function -- it was only ever used for the bind-mount remount case, and its behaviour is very bind-mount specific. [1]: https://github.com/util-linux/util-linux/issues/2433 Reverts: `97f5ee4e6a` ("Only remount if requested flags differ from current") Fixes: `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") Fixes: `da780e4d27` ("Fix bind mounts of filesystems with certain options set") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-10-24 17:28:25 +11:00
utam0k	770728e16e	Support `process.scheduler` Spec: https://github.com/opencontainers/runtime-spec/pull/1188 Fix: https://github.com/opencontainers/runc/issues/3895 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: utam0k <k0ma@utam0k.jp> Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-04 15:53:18 +08:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Chethan Suresh	ebc2e7c435	Support time namespace "time" namespace was introduced in Linux v5.6 support new time namespace to set boottime and monotonic time offset Example runtime spec "timeOffsets": { "monotonic": { "secs": 172800, "nanosecs": 0 }, "boottime": { "secs": 604800, "nanosecs": 0 } } Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>	2023-08-03 10:12:01 +05:30
Ruediger Pluem	da780e4d27	Fix bind mounts of filesystems with certain options set Currently bind mounts of filesystems with nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime options set fail in rootless mode if the same options are not set for the bind mount. For ro filesystems this was resolved by #2570 by remounting again with ro set. Follow the same approach for nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime but allow to revert back to the old behaviour via the new `--no-mount-fallback` command line option. Add a testcase to verify that bind mounts of filesystems with nodev, nosuid, noexec, noatime options set work in rootless mode. Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem with a ro flag. Add two further testcases that ensure that the above testcases would fail if the `--no-mount-fallback` command line option is set. * contrib/completions/bash/runc: Add `--no-mount-fallback` command line option for bash completion. * create.go: Add `--no-mount-fallback` command line option. * restore.go: Add `--no-mount-fallback` command line option. * run.go: Add `--no-mount-fallback` command line option. * libcontainer/configs/config.go: Add `NoMountFallback` field to the `Config` struct to store the command line option value. * libcontainer/specconv/spec_linux.go: Add `NoMountFallback` field to the `CreateOpts` struct to store the command line option value and store it in the libcontainer config. * utils_linux.go: Store the command line option value in the `CreateOpts` struct. * libcontainer/rootfs_linux.go: In case that `--no-mount-fallback` is not set try to remount the bind filesystem again with the options nodev, nosuid, noexec, noatime, relatime, strictatime or nodiratime if they are set on the source filesystem. * tests/integration/mounts_sshfs.bats: Add testcases and rework sshfs setup to allow specifying different mount options depending on the test case. Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>	2023-07-28 16:32:02 -07:00
Francis Laniel	c47f58c4e9	Capitalize [UG]idMappings as [UG]IDMappings Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
utam0k	d9230602e9	Implement to set a domainname opencontainers/runtime-spec#1156 Signed-off-by: utam0k <k0ma@utam0k.jp>	2023-04-12 13:31:20 +00:00
Alban Crequy	58ea21daef	seccomp: add support for flags List of seccomp flags defined in runtime-spec: * SECCOMP_FILTER_FLAG_TSYNC * SECCOMP_FILTER_FLAG_LOG * SECCOMP_FILTER_FLAG_SPEC_ALLOW Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not make sense to apply the seccomp filter on only one thread; other threads will be terminated after exec anyway. See similar commit in crun: https://github.com/containers/crun/commit/fefabffa2816ea343068ed036a86944393db189a Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243 in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released yet. Signed-off-by: Alban Crequy <albancrequy@microsoft.com>	2022-07-28 16:25:26 +02:00
Kir Kolyshkin	1b14d97484	libct/configs: rm Windows TODO It's clear at this point that runc won't support Windows. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 18:59:47 -08:00
Akihiro Suda	520702dac5	Add `runc features` command Fix issue 3274 See `types/features/features.go`. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-11-30 16:40:39 +09:00
Sascha Grunert	4a4d4f109b	Add support for seccomp actions ActKillThread and ActKillProcess Two new seccomp actions have been added to the libseccomp-golang dependency, which can be now supported by runc, too. ActKillThread kills the thread that violated the rule. It is the same as ActKill. All other threads from the same thread group will continue to execute. ActKillProcess kills the process that violated the rule. All threads in the thread group are also terminated. This action is only usable when libseccomp API level 3 or higher is supported. Signed-off-by: Sascha Grunert <sgrunert@redhat.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-09-09 17:47:00 +10:00
Alban Crequy	2b025c0173	Implement Seccomp Notify This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 13:04:24 +02:00
Kir Kolyshkin	a7cfb23b88	*: stop using pkg/errors Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn	7e7ff8722a	errcheck: libcontainer/configs Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-05-20 14:17:45 +02:00
Giuseppe Scrivano	c61f606254	libcontainer: honor seccomp defaultErrnoRet https://github.com/opencontainers/runtime-spec/pull/1087 added support for defaultErrnoRet to the OCI runtime specs. If a defaultErrnoRet is specified, disable patching the generated libseccomp cBPF. Closes: https://github.com/opencontainers/runc/issues/2943 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2021-05-17 09:23:32 +02:00
Sebastiaan van Stijn	4126b807cc	libcontainer/configs: add missing type for hooknames Commit `ccdd75760c` introduced the HookName type for hooks, but only set this type on the Prestart const, but not for the other hooks. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-04-02 11:08:24 +02:00
Kir Kolyshkin	c342872276	libct/config: fix a data race As reported by go test -race ./libcontainer/configs: === RUN TestCommandHookRunTimeout ================== WARNING: DATA RACE Read at 0x00c000202230 by goroutine 23: os/exec.(Cmd).Wait() /usr/lib/golang/src/os/exec/exec.go:502 +0x91 github.com/opencontainers/runc/libcontainer/configs.Command.Run() /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:390 +0x58c github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout() /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed testing.tRunner() /usr/lib/golang/src/testing/testing.go:1123 +0x202 Previous write at 0x00c000202230 by goroutine 27: os/exec.(Cmd).Wait() /usr/lib/golang/src/os/exec/exec.go:505 +0xb4 github.com/opencontainers/runc/libcontainer/configs.Command.Run.func1() /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:373 +0x55 Goroutine 23 (running) created at: testing.(T).Run() /usr/lib/golang/src/testing/testing.go:1168 +0x5bb testing.runTests.func1() /usr/lib/golang/src/testing/testing.go:1439 +0xa6 testing.tRunner() /usr/lib/golang/src/testing/testing.go:1123 +0x202 testing.runTests() /usr/lib/golang/src/testing/testing.go:1437 +0x612 testing.(M).Run() /usr/lib/golang/src/testing/testing.go:1345 +0x3b3 main.main() _testmain.go:69 +0x236 Goroutine 27 (running) created at: github.com/opencontainers/runc/libcontainer/configs.Command.Run() /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:372 +0x415 github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout() /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed testing.tRunner() /usr/lib/golang/src/testing/testing.go:1123 +0x202 ================== testing.go:1038: race detected during execution of test --- FAIL: TestCommandHookRunTimeout (0.10s) Apparently, the issue is we call two Wait()s for the same command which can race internally. Fix is easy -- since we already have a waiting goroutine, wait for it to return instead of calling a second Wait(). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-02-05 13:02:47 -08:00
Sebastiaan van Stijn	677baf22d2	libcontainer: isolate libcontainer/devices Move the Device-related types to libcontainer/devices, so that the package can be used in isolation. Aliases have been created in libcontainer/configs for backward compatibility. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:21 +01:00
Giuseppe Scrivano	a63f99fcc5	Add support for umask Signed-off-by: Ashley Cui <acui@redhat.com>	2020-08-20 11:39:43 -04:00
Sebastiaan van Stijn	901dccf05d	vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6 Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-07-30 22:08:54 +02:00
Renaud Gaubert	2f7bdf9d3b	Tests the new Hook Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2020-06-19 02:39:20 +00:00
Renaud Gaubert	ccdd75760c	Add the CreateRuntime, CreateContainer and StartContainer Hooks Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2020-06-17 02:10:00 +00:00
Giuseppe Scrivano	41aa19662b	libcontainer: honor seccomp errnoRet Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2020-05-20 09:11:55 +02:00
blacktop	84373aaa56	Add SCMP_ACT_LOG as a valid Seccomp action (#1951 ) Signed-off-by: blacktop <blacktop@users.noreply.github.com>	2019-09-26 11:03:03 -04:00
W. Trevor King	e23868603a	libcontainer: Set 'status' in hook stdin Finish off the work started in `a344b2d6` (sync up `HookState` with OCI spec `State`, 2016-12-19, #1201). And drop HookState, since there's no need for a local alias for specs.State. Also set c.initProcess in newInitProcess to support OCIState calls from within initProcess.start(). I think the cyclic references between linuxContainer and initProcess are unfortunate, but didn't want to address that here. I've also left the timing of the Prestart hooks alone, although the spec calls for them to happen before start (not as part of creation) [1,2]. Once the timing gets fixed we can drop the initProcessStartTime hacks which initProcess.start currently needs. I'm not sure why we trigger the prestart hooks in response to both procReady and procHooks. But we've had two prestart rounds in initProcess.start since `2f276498` (Move pre-start hooks after container mounts, 2016-02-17, #568). I've left that alone too. I really think we should have len() guards to avoid computing the state when .Hooks is non-nil but the particular phase we're looking at is empty. Aleksa, however, is adamantly against them [3] citing a risk of sloppy copy/pastes causing the hook slice being len-guarded to diverge from the hook slice being iterated over within the guard. I think that ort of thing is very lo-risk, because: * We shouldn't be copy/pasting this, right? DRY for the win :). * There's only ever a few lines between the guard and the guarded loop. That makes broken copy/pastes easy to catch in review. * We should have test coverage for these. Guarding with the wrong slice is certainly not the only thing you can break with a sloppy copy/paste. But I'm not a maintainer ;). [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart [2]: https://github.com/opencontainers/runc/issues/1710 [3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570 Signed-off-by: W. Trevor King <wking@tremily.us>	2018-11-14 06:49:49 -08:00
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Akihiro Suda	06f789cf26	Disable rootless mode except RootlessCgMgr when executed as the root in userns This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and `RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc. `RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in the current user namespace. `RootlessEUID` is almost identical to the former `Rootless` except cgroups stuff. `RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups. `RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace. Otherwise `RootlessCgroups` is set to true. (Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well) When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes), `RootlessEUID` is set to false but `RootlessCgroups` is set to true. So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored. This PR does not have any impact on CLI flags and `state.json`. Note about CLI: * Now `runc --rootless=(auto\|true\|false)` CLI flag is only used for setting `RootlessCgroups`. * Now `runc spec --rootless` is only required when `RootlessEUID` is set to true. For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of UID/GID are mapped. Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`): * `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility. (`/run/runc` is used) * If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`. This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`. Note about `state.json`: * `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-09-07 15:05:03 +09:00
Aleksa Sarai	fd3a6e6c83	libcontainer: handle unset oomScoreAdj corectly Previously if oomScoreAdj was not set in config.json we would implicitly set oom_score_adj to 0. This is not allowed according to the spec: > If oomScoreAdj is not set, the runtime MUST NOT change the value of > oom_score_adj. Change this so that we do not modify oom_score_adj if oomScoreAdj is not present in the configuration. While this modifies our internal configuration types, the on-disk format is still compatible. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00

1 2

82 Commits