zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-04-25 08:56:43 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	3b57e45cbf	mount: add support for ridmap and idmap ridmap indicates that the id mapping should be applied recursively (only really relevant for rbind mount entries), and idmap indicates that it should not be applied recursively (the default). If no mappings are specified for the mount, we use the userns configuration of the container. This matches the behaviour in the currently-unreleased runtime-spec. This includes a minor change to the state.json serialisation format, but because there has been no released version of runc with commit `fbf183c6f8` ("Add uid and gid mappings to mounts"), we can safely make this change without affecting running containers. Doing it this way makes it much easier to handle m.IsIDMapped() and indicating that a mapping has been specified. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:42 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	5ae88daf06	idmap: allow arbitrary idmap mounts regardless of userns configuration With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we can now handle arbitrary mappings without issue, so remove the primary artificial limit of mappings (must use the same mapping as the container's userns) and add some tests. We still only support idmap mounts for bind-mounts because configuring mappings for other filesystems would require switching our entire mount machinery to the new mount API. The current design would easily allow for this but we would need to convert new mount options entirely to the fsopen/fsconfig/fsmount API. This can be done in the future. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	09822c3da8	configs: disallow ambiguous userns and timens configurations For userns and timens, the mappings (and offsets, respectively) cannot be changed after the namespace is first configured. Thus, configuring a container with a namespace path to join means that you cannot also provide configuration for said namespace. Previously we would silently ignore the configuration (and just join the provided path), but we really should be returning an error (especially when you consider that the configuration userns mappings are used quite a bit in runc with the assumption that they are the correct mapping for the userns -- but in this case they are not). In the case of userns, the mappings are also required if you _do not_ specify a path, while in the case of the time namespace you can have a container with a timens but no mappings specified. It should be noted that the case checking that the user has not specified a userns path and a userns mapping needs to be handled in specconv (as opposed to the configuration validator) because with this patchset we now cache the mappings of path-based userns configurations and thus the validator can't be sure whether the mapping is a cached mapping or a user-specified one. So we do the validation in specconv, and thus the test for this needs to be an integration test. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-05 17:46:09 +11:00
Aleksa Sarai	3bab7e9223	configs: clean up error messages for Host[UG]ID If a user has misconfigured their userns mappings, they need to know which id specifically is not mapped. There's no need to be vague. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-05 17:46:09 +11:00
Aleksa Sarai	1912d5988b	: actually support joining a userns with a new container Our handling for name space paths with user namespaces has been broken for a long time. In particular, the need to parse /proc/self/id_map in quite a few places meant that we would treat userns configurations that had a namespace path as if they were a userns configuration without mappings, resulting in errors. The primary issue was down to the id translation helper functions, which could only handle configurations that had explicit mappings. Obviously, when joining a user namespace we need to map the ids but figuring out the correct mapping is non-trivial in comparison. In order to get the mapping, you need to read /proc/<pid>/id_map of a process inside the userns -- while most userns paths will be of the form /proc/<pid>/ns/user (and we have a fast-path for this case), this is not guaranteed and thus it is necessary to spawn a process inside the container and read its /proc/<pid>/id_map files in the general case. As Go does not allow us spawn a subprocess into a target userns, we have to use CGo to fork a sub-process which does the setns(2). To be honest, this is a little dodgy in regards to POSIX signal-safety(7) but since we do no allocations and we are executing in the forked context from a Go program (not a C program), it should be okay. The other alternative would be to do an expensive re-exec (a-la nsexec which would make several other bits of runc more complicated), or to use nsenter(1) which might not exist on the system and is less than ideal. Because we need to logically remap users quite a few times in runc (including in "runc init", where joining the namespace is not feasable), we cache the mapping inside the libcontainer config struct. A future patch will make sure that we stop allow invalid user configurations where a mapping is specified as well as a userns path to join. Finally, add an integration test to make sure we don't regress this again. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-05 17:46:08 +11:00
Aleksa Sarai	669f4dbef8	configs: validate: add validation for bind-mount fsflags Bind-mounts cannot have any filesystem-specific "data" arguments, because the kernel ignores the data argument for MS_BIND and MS_BIND\|MS_REMOUNT and we cannot safely try to override the flags because those would affect mounts on the host (these flags affect the superblock). It should be noted that there are cases where the filesystem-specified flags will also be ignored for non-bind-mounts but those are kernel quirks and there's no real way for us to work around them. And users wouldn't get any real benefit from us adding guardrails to existing kernel behaviour. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-11-02 07:50:03 +11:00
Rodrigo Campos	4bf8b55594	libct: Remove old comment We changed it in PR: https://github.com/opencontainers/runtime-spec/pull/1225 But we missed to remove this comment. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-11-01 12:48:42 +01:00
Zheao.Li	98511bb40e	linux: Support setting execution domain via linux personality carry #3126 Co-authored-by: Aditya R <arajan@redhat.com> Signed-off-by: Zheao.Li <me@manjusaka.me>	2023-10-27 19:33:37 +08:00
Aleksa Sarai	7c71a22705	rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT The original reasoning for this option was to avoid having mount options be overwritten by runc. However, adding command-line arguments has historically been a bad idea because it forces strict-runc-compatible OCI runtimes to copy out-of-spec features directly from runc and these flags are usually quite difficult to enable by users when using runc through several layers of engines and orchestrators. A far more preferable solution is to have a heuristic which detects whether copying the original mount's mount options would override an explicit mount option specified by the user. In this case, we should return an error. You only end up in this path in the userns case, if you have a bind-mount source with locked flags. During the course of writing this patch, I discovered that several aspects of our handling of flags for bind-mounts left much to be desired. We have completely botched the handling of explicitly cleared flags since commit `97f5ee4e6a` ("Only remount if requested flags differ from current"), with our behaviour only becoming increasingly more weird with `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") and `da780e4d27` ("Fix bind mounts of filesystems with certain options set"). In short, we would only clear flags explicitly request by the user purely by chance, in ways that it really should've been reported to us by now. The most egregious is that mounts explicitly marked "rw" were actually mounted "ro" if the bind-mount source was "ro" and no other special flags were included. In addition, our handling of atime was completely broken -- mostly due to how subtle the semantics of atime are on Linux. Unfortunately, while the runtime-spec requires us to implement mount(8)'s behaviour, several aspects of the util-linux mount(8)'s behaviour are broken and thus copying them makes little sense. Since the runtime-spec behaviour for this case (should mount options for a "bind" mount use the "mount --bind -o ..." or "mount --bind -o remount,..." semantics? Is the fallback code we have for userns actually spec-compliant?) and the mount(8) behaviour (see [1]) are not well-defined, this commit simply fixes the most obvious aspects of the behaviour that are broken while keeping the current spirit of the implementation. NOTE: The handling of atime in the base case is left for a future PR to deal with. This means that the atime of the source mount will be silently left alone unless the fallback path needs to be taken, and any flags not explicitly set will be cleared in the base case. Whether we should always be operating as "mount --bind -o remount,..." (where we default to the original mount source flags) is a topic for a separate PR and (probably) associated runtime-spec PR. So, to resolve this: * We store which flags were explicitly requested to be cleared by the user, so that we can detect whether the userns fallback path would end up setting a flag the user explicitly wished to clear. If so, we return an error because we couldn't fulfil the configuration settings. * Revert `97f5ee4e6a` ("Only remount if requested flags differ from current"), as missing flags do not mean we can skip MS_REMOUNT (in fact, missing flags are how you indicate a flag needs to be cleared with mount(2)). The original purpose of the patch was to fix the userns issue, but as mentioned above the correct mechanism is to do a fallback mount that copies the lockable flags from statfs(2). * Improve handling of atime in the fallback case by: - Correctly handling the returned flags in statfs(2). - Implement the MNT_LOCK_ATIME checks in our code to ensure we produce errors rather than silently producing incorrect atime mounts. * Improve the tests so we correctly detect all of these contingencies, including a general "bind-mount atime handling" test to ensure that the behaviour described here is accurate. This change also inlines the remount() function -- it was only ever used for the bind-mount remount case, and its behaviour is very bind-mount specific. [1]: https://github.com/util-linux/util-linux/issues/2433 Reverts: `97f5ee4e6a` ("Only remount if requested flags differ from current") Fixes: `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") Fixes: `da780e4d27` ("Fix bind mounts of filesystems with certain options set") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-10-24 17:28:25 +11:00
utam0k	770728e16e	Support `process.scheduler` Spec: https://github.com/opencontainers/runtime-spec/pull/1188 Fix: https://github.com/opencontainers/runc/issues/3895 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: utam0k <k0ma@utam0k.jp> Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-04 15:53:18 +08:00
Rodrigo Campos	b17c6f237d	validator: Relax warning for not abs mount dst path The runtime spec now allows relative mount dst paths, so remove the comment saying we will switch this to an error later and change the error messages to reflect that. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-09-11 16:02:41 +02:00
Kailun Qin	e1584831b6	libct/cg: add CFS bandwidth burst for CPU Burstable CFS controller is introduced in Linux 5.14. This helps with parallel workloads that might be bursty. They can get throttled even when their average utilization is under quota. And they may be latency sensitive at the same time so that throttling them is undesired. This feature borrows time now against the future underrun, at the cost of increased interference against the other system users, by introducing cfs_burst_us into CFS bandwidth control to enact the cap on unused bandwidth accumulation, which will then used additionally for burst. The patch adds the support/control for CFS bandwidth burst. runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120 Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com> Co-authored-by: Nadeshiko Manju <me@manjusaka.me> Signed-off-by: Kailun Qin <kailun.qin@intel.com>	2023-09-06 23:23:30 +08:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	1f25724a96	configs: fix idmapped mounts json field names In the runc state JSON we always use snake_case. This is a no-op change, but it will cause any existing container state files to be incorrectly parsed. Luckily, commit `fbf183c6f8` ("Add uid and gid mappings to mounts") has never been in a runc release so we can change this before a 1.2.z release. Fixes: `fbf183c6f8` ("Add uid and gid mappings to mounts") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-15 19:54:24 -07:00
Aleksa Sarai	aa5f4c1137	tests: add several timens tests These are not exhaustive, but at least confirm that the feature is not obviously broken (we correctly set the time offsets). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-10 19:01:31 +10:00
Aleksa Sarai	9acfd7b1a3	timens: minor cleanups Fix up a few things that were flagged in the review of the original timens PR, namely around error handling and validation. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-10 18:59:55 +10:00
Aleksa Sarai	0866112e81	merge #3876 into opencontainers/runc:main Chethan Suresh (1): Support time namespace LGTMs: kolyskin cyphar Closes #3876	2023-08-10 18:27:17 +10:00
Rodrigo Campos	19d26a6596	Revert "libct/validator: Error out on non-abs paths" This reverts commit `881e92a3fd` and adjust the code so the idmap validations are strict. We now only throw a warning and the container is started just fine. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-08-08 13:45:31 +02:00
Chethan Suresh	ebc2e7c435	Support time namespace "time" namespace was introduced in Linux v5.6 support new time namespace to set boottime and monotonic time offset Example runtime spec "timeOffsets": { "monotonic": { "secs": 172800, "nanosecs": 0 }, "boottime": { "secs": 604800, "nanosecs": 0 } } Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>	2023-08-03 10:12:01 +05:30
Ruediger Pluem	da780e4d27	Fix bind mounts of filesystems with certain options set Currently bind mounts of filesystems with nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime options set fail in rootless mode if the same options are not set for the bind mount. For ro filesystems this was resolved by #2570 by remounting again with ro set. Follow the same approach for nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime but allow to revert back to the old behaviour via the new `--no-mount-fallback` command line option. Add a testcase to verify that bind mounts of filesystems with nodev, nosuid, noexec, noatime options set work in rootless mode. Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem with a ro flag. Add two further testcases that ensure that the above testcases would fail if the `--no-mount-fallback` command line option is set. * contrib/completions/bash/runc: Add `--no-mount-fallback` command line option for bash completion. * create.go: Add `--no-mount-fallback` command line option. * restore.go: Add `--no-mount-fallback` command line option. * run.go: Add `--no-mount-fallback` command line option. * libcontainer/configs/config.go: Add `NoMountFallback` field to the `Config` struct to store the command line option value. * libcontainer/specconv/spec_linux.go: Add `NoMountFallback` field to the `CreateOpts` struct to store the command line option value and store it in the libcontainer config. * utils_linux.go: Store the command line option value in the `CreateOpts` struct. * libcontainer/rootfs_linux.go: In case that `--no-mount-fallback` is not set try to remount the bind filesystem again with the options nodev, nosuid, noexec, noatime, relatime, strictatime or nodiratime if they are set on the source filesystem. * tests/integration/mounts_sshfs.bats: Add testcases and rework sshfs setup to allow specifying different mount options depending on the test case. Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>	2023-07-28 16:32:02 -07:00
lifubang	c875ea8529	use the length of UIDMappings/GIDMappings to check whether empty or not Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-07-23 11:59:18 +08:00
Francis Laniel	c47f58c4e9	Capitalize [UG]idMappings as [UG]IDMappings Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Rodrigo Campos	fbf183c6f8	Add uid and gid mappings to mounts Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Rodrigo Campos	881e92a3fd	libct/validator: Error out on non-abs paths This was a warning already and it was requested to make this an error while we will add validation of idmap mounts: https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318 I've also tested a k8s cluster and the config.json generated by containerd didn't use any relative paths. I tested one pod, so it was definitely not an extensive test. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-07 12:00:33 +02:00
Kir Kolyshkin	2a7dcbbb40	libct: fix shared pidns detection When someone is using libcontainer to start and kill containers from a long lived process (i.e. the same process creates and removes the container), initProcess.wait method is used, which has a kludge to work around killing containers that do not have their own PID namespace. The code that checks for own PID namespace is not entirely correct. To be exact, it does not set sharePidns flag when the host/caller PID namespace is implicitly used. As a result, the above mentioned kludge does not work. Fix the issue, add a test case (which fails without the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
utam0k	d9230602e9	Implement to set a domainname opencontainers/runtime-spec#1156 Signed-off-by: utam0k <k0ma@utam0k.jp>	2023-04-12 13:31:20 +00:00
wineway	81c379fa8b	support SCHED_IDLE for runc cgroupfs Signed-off-by: wineway <wangyuweihx@gmail.com>	2023-01-31 15:19:05 +08:00
Eric Ernst	e29e57b5fc	libcontainer: configs: ensure can build on darwin configs package can no longer be built on non-Linux OS, such as Darwin. When running `GOOS=darwin go build` on the packge, we had the following errors: ``` ./configs/mount.go:34:16: undefined: unix.MountAttr ./configs/mount.go:47:22: undefined: unix.MS_BIND ``` Let's ensure that the linux specific bits are handled in mount_linux.go, and introduce a _unsupported file, similar to how cgroups file is handled within the package. This'll facilitate utilization of the pkg for other projects that care about Darwin. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2023-01-17 15:57:08 -08:00
Kir Kolyshkin	6462e9de67	runc update: implement memory.checkBeforeUpdate This is aimed at solving the problem of cgroup v2 memory controller behavior which is not compatible with that of cgroup v1. In cgroup v1, if the new memory limit being set is lower than the current usage, setting the new limit fails. In cgroup v2, same operation succeeds, and the container is OOM killed. Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic cgroup v1 behavior. Note that this is not 100% reliable because of TOCTOU, but this is the best we can do. Add some test cases. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-11-02 17:15:26 -07:00
Kir Kolyshkin	45cc290f02	libct: fixes for godoc 1.19 Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It also reformats the sources making it more apparent that these features are used. Fix a few places where it misinterpreted the formatting (such as indented vs unindented), and format the result using the gofumpt from HEAD, which already incorporates gofmt 1.19 changes. Some more fixes (and enhancements) might be required. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-08-16 09:53:54 -07:00
Alban Crequy	58ea21daef	seccomp: add support for flags List of seccomp flags defined in runtime-spec: * SECCOMP_FILTER_FLAG_TSYNC * SECCOMP_FILTER_FLAG_LOG * SECCOMP_FILTER_FLAG_SPEC_ALLOW Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not make sense to apply the seccomp filter on only one thread; other threads will be terminated after exec anyway. See similar commit in crun: https://github.com/containers/crun/commit/fefabffa2816ea343068ed036a86944393db189a Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243 in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released yet. Signed-off-by: Alban Crequy <albancrequy@microsoft.com>	2022-07-28 16:25:26 +02:00
cdoern	c0be1aa2d1	export blockIODevice the struct blockIODevice is used in an exported struct but it is not itself exported rendering that type inaccessible to outside projects Signed-off-by: cdoern <cdoern@redhat.com>	2022-06-13 13:40:39 -04:00
Kir Kolyshkin	48006d0007	libct/configs/validate: rootlessEUIDMount: speedup 1. Fix function docs. In particular, remove the part which is not true ("verifies that the user isn't trying to set up any mounts they don't have the rights to do"), and fix the part that says "that doesn't resolve to root" (which is no longer true since commit `d8b669400a`). 2. Replace fmt.Sscanf (which is slow and does lots of allocations) with strings.TrimPrefix and strconv.Atoi. 3. Add a benchmark for rootlessEUIDMount. Comparing the old and the new implementations: name old time/op new time/op delta RootlessEUIDMount-4 1.01µs ± 2% 0.16µs ± 1% -84.15% (p=0.008 n=5+5) name old alloc/op new alloc/op delta RootlessEUIDMount-4 224B ± 0% 80B ± 0% -64.29% (p=0.008 n=5+5) name old allocs/op new allocs/op delta RootlessEUIDMount-4 7.00 ± 0% 1.00 ± 0% -85.71% (p=0.008 n=5+5) Note this code is already tested (in rootless_test.go). Fixes: `d8b669400a` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-17 13:39:55 -07:00
Markus Lehtonen	1d5c331042	configs/validate: looser validation for RDT Don't require CAT or MBA because we don't detect those correctly (we don't support L2 or L3DATA/L3CODE for example, and in the future possibly even more). With plain "ClosId mode" we don't really care: we assign the container to a pre-configured CLOS without trying to do anything smarter. Moreover, this was a duplicate/redundant check anyway, as for CAT and MBA there is another specific sanity check that is done if L3 or MB is specified in the config. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2022-02-18 16:24:50 +02:00
Kir Kolyshkin	0d21515038	libct: remove Validator interface We only have one implementation of config validator, which is always used. It makes no sense to have Validator interface. Having validate.Validator field in Factory does not make sense for all the same reasons. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 11:40:29 -08:00
Sebastiaan van Stijn	7173c53906	Merge pull request #3350 from kolyshkin/mount-cmds libct: Mount: rm {Pre,Post}mountCmds	2022-02-03 10:12:56 +01:00
Kir Kolyshkin	1b14d97484	libct/configs: rm Windows TODO It's clear at this point that runc won't support Windows. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 18:59:47 -08:00
Kir Kolyshkin	0fec1c2d8c	libct: Mount: rm {Pre,Post}mountCmds Those were added by commit `59c5c3ac0` back in Apr 2015, but AFAICS were never used and are obsoleted by more generic container hooks (initially added by commit `05567f2c94` in Sep 2015). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 15:51:55 -08:00
Akihiro Suda	382eba4354	Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...) The new mount option "rro" makes the mount point recursively read-only, by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`. https://man7.org/linux/man-pages/man2/mount_setattr.2.html Requires kernel >= 5.12. The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501. Fix issue 2823 Similary, this commit also adds the following mount options: - rrw - r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow} Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-12-07 17:39:57 +09:00
Aleksa Sarai	cdce249635	merge branch 'pr-3057' Fraser Tweedale (1): chown cgroup to process uid in container namespace LGTMs: kolyshkin cyphar Closes #3057	2021-12-07 17:06:19 +11:00
Akihiro Suda	520702dac5	Add `runc features` command Fix issue 3274 See `types/features/features.go`. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-11-30 16:40:39 +09:00
Fraser Tweedale	35d20c4e0b	chown cgroup to process uid in container namespace Delegating cgroups to the container enables more complex workloads, including systemd-based workloads. The OCI runtime-spec was recently updated to explicitly admit such delegation, through specification of cgroup ownership semantics: https://github.com/opencontainers/runtime-spec/pull/1123 Pursuant to the updated OCI runtime-spec, change the ownership of the container's cgroup directory and particular files therein, when using cgroups v2 and when the cgroupfs is to be mounted read/write. As a result of this change, systemd workloads can run in isolated user namespaces on OpenShift when the sandbox's cgroupfs is mounted read/write. It might be possible to implement this feature in other cgroup managers, but that work is deferred. Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>	2021-11-30 08:52:59 +10:00
Mengjiao Liu	a9bb11ec3c	Fix the conversion of sysctl variable dots and slashes Signed-off-by: Mengjiao Liu <mengjiao.liu@daocloud.io>	2021-11-04 11:45:15 +08:00
Mengjiao Liu	0f933d54fe	Rename package validate_test to package validate Signed-off-by: Mengjiao Liu <mengjiao.liu@daocloud.io>	2021-11-04 11:45:15 +08:00
Kir Kolyshkin	972aea3af0	libct/configs/validate: allow / in sysctl names Runtime spec says: > sysctl (object, OPTIONAL) allows kernel parameters to be modified at > runtime for the container. For more information, see the sysctl(8) > man page. and sysctl(8) says: > variable > The name of a key to read from. An example is > kernel.ostype. The '/' separator is also accepted in place of a '.'. Apparently, runc config validator do not support sysctls with / as a separator. Fortunately this is a one-line fix. Add some more test data where / is used as a separator. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-29 09:45:55 -07:00
Akihiro Suda	4d17654479	Merge pull request #2576 from kinvolk/alban/userns-2484-take2 Open bind mount sources from the host userns	2021-10-28 14:50:33 +09:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Alban Crequy	9c444070ec	Open bind mount sources from the host userns The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount fds, and close them early. Fixes: #2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-10-12 15:13:45 +02:00
Kir Kolyshkin	097c6d7425	libct/cg: simplify getting cgroup manager 1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). [v2: can't -> cannot] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 09:11:44 -07:00

1 2 3 4 5 ...

293 Commits