zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-04-22 23:17:17 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	ca509e76ff	libct: move cmsg helpers to new internal/cmsg package These helpers all make more sense as a self-contained package and moving them has the added benefit of removing an unneeded libpathrs dependency (from libcontainer/utils's import of pathrs-lite) from several test binaries. Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-04-08 01:21:41 +10:00
Kir Kolyshkin	5996fe143a	Revert "Preventing containers from being unable to be deleted" This fixes random failures to start a container in conmon integration tests (see issue 5151). I guess we need to find another way to fix issue 4645. This reverts commit `1b39997e73`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-05 18:44:30 -08:00
Antti Kervinen	700c944c4d	libct: fix resetting CPU affinity unix.CPUSet is limited to 1024 CPUs. Calling unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024 from allowed CPUs of pid, even if cpuset is all ones. As a consequence, when runc tries to reset CPU affinity to "allow all" by default, it prevents all containers from CPUs 1024 onwards. This change uses a huge CPU mask to play safe and get all possible CPUs enabled with a single sched_setaffinity call. Fixes: #5023 Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>	2026-03-04 13:06:33 -08:00
Kir Kolyshkin	6c07a37a58	libct: prepareCgroupFD: fall back to container init cgroup Previously, when prepareCgroupFD would not open container's cgroup (as configured in config.json and saved to state.json), it returned a fatal error, as we presumed a container can't exist without its own cgroup. Apparently, it can. In a case when container is configured without cgroupns (i.e. it uses hosts cgroups), and /sys/fs/cgroup is mounted read-write, a rootful container's init can move itself to an entirely different cgroup (even a new one that it just created), and then the original container cgroup is removed by the kernel (or systemd?) as it has no processes left. By the way, from the systemd point of view the container is gone. And yet it is still there, and users want runc exec to work! And it worked, thanks to the "let's try container init's cgroup" fallback as added by commit `c91fe9aeba` ("cgroup2: exec: join the cgroup of the init process on EBUSY"). The fallback was added for the entirely different reason, but it happened to work in this very case, too. This behavior was broken with the introduction of CLONE_INTO_CGROUP support. While it is debatable whether this is a valid scenario when a container moves itself into a different cgroup, this very setup is used by e.g. buildkitd running in a privileged kubernetes container (see issue 5089). To restore the way things are expected to work, add the same "try container init's cgroup" fallback into prepareCgroupFD. While at it, simplify the code flow. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-02-11 11:57:25 -08:00
Kir Kolyshkin	1d030fab7d	libct: refactor addIntoCgroupV2, fix wrt rootless 1. Refactor addIntoCgroupV2 in an attempt to simplify it. 2. Fix the bug of not trying the init cgroup fallback if rootlessCgroup is set. This is a bug because rootlessCgroup tells to ignore cgroup join errors, not to never try the fallback. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-02-11 11:56:57 -08:00
Kir Kolyshkin	94133fab97	libct: factor out initProcessCgroupPath Separate initProcessCgroupPath code out of addIntoCgroupV2. To be used by the next patch. While at it, describe the new scenario in which the container's configured cgroup might not be available. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-02-11 11:52:59 -08:00
Kir Kolyshkin	cb31d62f1c	Fix exec vs Go 1.26 Since [PR 4812], runc exec tries to use clone3 syscall with CLONE_INTO_CGROUP, falling back to the old method if it is not supported. One issue with that approach is, a > Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run], > [Cmd.Output], or [Cmd.CombinedOutput] methods. (from https://pkg.go.dev/os/exec#Cmd). This is enforced since Go 1.26, see [CL 728642], and so runc exec actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP support). The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd). From the [CL 734200] it looks like it is an acceptable way, but it might break in the future as it also copies the private fields, so let's do a proper field-by-field copy. If the upstream will add cmd.Clone method, we will switch to it. Also, we can probably be fine with a post-copy (once the first Start has failed), but let's be conservative here and do a pre-copy. [PR 4812]: https://github.com/opencontainers/runc/pull/4812 [CL 728642]: https://go.dev/cl/728642 [CL 734200]: https://go.dev/cl/734200 Reported-by: Efim Verzakov <efimverzakov@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-01-29 13:49:34 -08:00
Curd Becker	536e183451	Replace os.Is* error checking functions with their errors.Is counterpart Signed-off-by: Curd Becker <me@curd-becker.de>	2025-12-11 03:16:02 +01:00
Kir Kolyshkin	c24965b742	libct: newProcessComm: close fds on error Reported in issue 5008. Reported-by: Arina Cherednik <arinacherednik034@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-12-02 15:15:23 -08:00
Kir Kolyshkin	871052b791	libct: refactor setnsProcess.start Factor startWithCgroupFD out of start to reduce the start complexity. This also implements a more future-proof way of calling p.comm.closeChild. Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-10-14 11:48:25 -07:00
Kir Kolyshkin	4e262509b8	libct: close child fds on prepareCgroupFD error The (*setns).start is supposed to close child fds once the child has started, or upon an error. Commit `5af4dd4e6` added a bug -- child fds are not closed if prepareCgroupFD fails. Fix by adding a missing call to closeChild. I'm not sure how to write a good test case for it. Found when working on PR 4928 (and tested in there). Fixes: `5af4dd4e6` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-10-14 11:48:25 -07:00
Aleksa Sarai	93f9a392cf	libct: switch to (*CPUSet).Fill Now that we've updated to golang.org/x/sys@v0.37.0, CPUSet has a Fill helper that does the equivalent to our underflow trick to make setting all CPUs efficient. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-10-09 15:39:02 +11:00
Ariel Otilibili	4404cdf94b	libcontainer: switch goCreateMountSources() to ctx.AfterFunc `ba0b5e26` ("libcontainer: remove all mount logic from nsexec") introduced a request function that handles two tasks: - the exchanges with the request and response channels - the closing of the request channel. From 1.21, the closing of the request channel may be done with context.AfterFunc(). Moreover, context.AfterFunc() is guaranteed to run once. Link: https://pkg.go.dev/context#AfterFunc Suggested-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Ariel Otilibili <otilibil@eurecom.fr>	2025-10-02 16:46:31 +02:00
Kir Kolyshkin	5af4dd4e64	runc exec: use CLONE_INTO_CGROUP when available It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP), if it is available. Since it requires a recent kernel and might not work, implement a fallback to older way of joining the cgroup. Based on: - https://go-review.googlesource.com/c/go/+/417695 - https://github.com/coreos/go-systemd/pull/458 - https://github.com/opencontainers/cgroups/pull/26 - https://github.com/opencontainers/runc/pull/4822 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-09-26 14:27:18 -07:00
Kir Kolyshkin	37b5acc2d7	libct: use manager.AddPid to add exec to cgroup The main benefit here is when we are using a systemd cgroup driver, we actually ask systemd to add a PID, rather than doing it ourselves. This way, we can add rootless exec PID to a cgroup. This requires newer opencontainers/cgroups and coreos/go-systemd. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-09-16 13:31:16 -07:00
Kir Kolyshkin	5730a141f1	libct: move exec sub-cgroup handling down the line Remove cgroupPaths field from struct setnsProcess, because: - we can get base cgroup paths from p.manager.GetPaths(); - we can get sub-cgroup paths from p.process.SubCgroupPaths. But mostly because we are going to need separate cgroup paths when adopting cgroups.AddPid. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-09-16 13:27:56 -07:00
Kir Kolyshkin	5560020cbb	libct: split addIntoCgroup into V1 and V2 The main idea is to maintain the code separately (and eventually kill V1 implementation). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-09-16 13:27:56 -07:00
Kir Kolyshkin	b39e0d6468	libct: factor out addIntoCgroup from setnsProcess.start Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-09-16 13:27:56 -07:00
Aleksa Sarai	121192ade6	libct: reset CPU affinity by default In certain deployments, it's possible for runc to be spawned by a process with a restrictive cpumask (such as from a systemd unit with CPUAffinity=... configured) which will be inherited by runc and thus the container process by default. The cpuset cgroup used to reconfigure the cpumask automatically for joining processes, but kcommit da019032819a ("sched: Enforce user requested affinity") changed this behaviour in Linux 6.2. The solution is to try to emulate the expected behaviour by resetting our cpumask to correspond with the configured cpuset (in the case of "runc exec", if the user did not configure an alternative one). Normally we would have to parse /proc/stat and /sys/fs/cgroup, but luckily sched_setaffinity(2) will transparently convert an all-set cpumask (even if it has more entries than the number of CPUs on the system) to the correct value for our usecase. For some reason, in our CI it seems that rootless --systemd-cgroup results in the cpuset (presumably temporarily?) being configured such that sched_setaffinity(2) will allow the full set of CPUs. For this particular case, all we care about is that it is different to the original set, so include some special-casing (but we should probably investigate this further...). Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn> Reported-by: Martin Sivak <msivak@redhat.com> Reported-by: Peter Hunt <pehunt@redhat.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-08-28 08:25:46 +10:00
lfbzhm	4d4cedd650	Merge pull request #4796 from astrawind/fix/seccomp-agent-conn-leak libcontainer: close seccomp agent connection to prevent resource leaks	2025-07-04 00:08:21 +08:00
Pavel Liubimov	aa0e7989c4	libcontainer: close seccomp agent connection to prevent resource leaks Add missing defer conn.Close(). Signed-off-by: Pavel Liubimov <prlyubimov@gmail.com>	2025-07-01 12:31:55 +03:00
HirazawaUi	1b39997e73	Preventing containers from being unable to be deleted Signed-off-by: HirazawaUi <695097494plus@gmail.com>	2025-06-19 20:17:50 +08:00
Antonio Ojea	8d180e9658	Add support for Linux Network Devices Implement support for passing Linux Network Devices to the container network namespace. The network device is passed during the creation of the container, before the process is started. It implements the logic defined in the OCI runtime specification. Signed-off-by: Antonio Ojea <aojea@google.com>	2025-06-18 15:52:30 +01:00
Kir Kolyshkin	17570625c0	Use for range over integers This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-31 17:15:06 -07:00
lfbzhm	346c80d714	libct: replace unix.Kill with os.Process.Signal Because we should switch to unix.PidFDSendSignal in new kernels, it has been supported in go runtime. We don't need to add fall back to unix.Kill code here. Signed-off-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-07 14:07:47 -08:00
Kir Kolyshkin	10ca66bff5	runc exec: implement CPU affinity As per - https://github.com/opencontainers/runtime-spec/pull/1253 - https://github.com/opencontainers/runtime-spec/pull/1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-02 19:17:41 -08:00
Kir Kolyshkin	a75076b4a4	Switch to opencontainers/cgroups This removes libcontainer/cgroups packages and starts using those from github.com/opencontainers/cgroups repo. Mostly generated by: git rm -f libcontainer/cgroups find . -type f -name "*.go" -exec sed -i \ 's\|github.com/opencontainers/runc/libcontainer/cgroups\|github.com/opencontainers/cgroups\|g' \ {} + go get github.com/opencontainers/cgroups@v0.0.1 make vendor gofumpt -w . Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-28 15:20:33 -08:00
lfbzhm	d48d9cfefc	Merge pull request #4459 from kolyshkin/prio-nits Fixups to scheduler/priority settings	2024-12-25 23:41:27 +08:00
Kir Kolyshkin	5d3942eec3	libct: unify IOPriority setting For some reason, io priority is set in different places between runc start/run and runc exec: - for runc start/run, it is done in the middle of (linuxStandardInit).Init, close to the place where we exec runc init. - for runc exec, it is done much earlier, in (setnsProcess) start(). Let's move setIOPriority call for runc exec to (linuxSetnsInit).Init, so it is in the same logical place as for runc start/run. Also, move the function itself to init_linux.go as it's part of init. Should not have any visible effect, except part of runc init is run with a different I/O priority. While at it, rename setIOPriority to setupIOPriority, and make it accept the whole configs.Config, for uniformity with other similar functions. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:15:31 -08:00
Kir Kolyshkin	2dc3ea4b87	libct: simplify setIOPriority/setupScheduler calls Move the nil check inside, simplifying the callers. Fixes: `bfbd0305` ("Add I/O priority") Fixes: `770728e1` ("Support `process.scheduler`") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:06:20 -08:00
Kir Kolyshkin	93091e6ac2	libct: don't pass SpecState to init unless needed SpecState field of initConfig is only needed to run hooks that are executed inside a container -- namely CreateContainer and StartContainer. If these hooks are not configured, there is no need to fill, marshal and unmarshal SpecState. While at it, inline updateSpecState as it is trivial and only has one user. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:52:15 -08:00
Kir Kolyshkin	8afeb58398	libct: add/use configs.HasHook This allows to omit a call to c.currentOCIState (which can be somewhat costly when there are many annotations) when the hooks of a given kind won't be run. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:47:09 -08:00
lfbzhm	171c414904	refactor init and setns process Introduce a common parent struct `containerProcess`, let both `initProcess` and `setnsProcess` are inherited from it. Signed-off-by: lfbzhm <lifubang@acmcoder.com>	2024-12-21 19:16:01 -08:00
Kir Kolyshkin	b5bdf592f2	libct: rm initWaiter This initWaiter logic was introduced by commit `4ecff8d9`, but since the logic of /proc/self/exe was moved out of runc init in commit `0e9a335`, this seems unnecessary to have initWaiter. Remove it. This essentially reverts commit `4ecff8d9`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-10-17 08:05:42 -07:00
Kir Kolyshkin	30f8f51eab	runc create/run: warn on rootless + shared pidns + no cgroup Shared pid namespace means `runc kill` (or `runc delete -f`) have to kill all container processes, not just init. To do so, it needs a cgroup to read the PIDs from. If there is no cgroup, processes will be leaked, and so such configuration is bad and should not be allowed. To keep backward compatibility, though, let's merely warn about this for now. Alas, the only way to know if cgroup access is available is by returning an error from Manager.Apply. Amend fs cgroup managers to do so (systemd doesn't need it, since v1 can't work with rootless, and cgroup v2 does not have a special rootless case). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-09-17 22:49:37 -07:00
Sebastiaan van Stijn	30b530ca94	libct/userns: split userns detection from internal userns code Commit `4316df8b53` isolated RunningInUserNS to a separate package to make it easier to consume without bringing in additional dependencies, and with the potential to move it separate in a similar fashion as libcontainer/user was moved to a separate module in commit `ca32014adb`. While RunningInUserNS is fairly trivial to implement, it (or variants of this utility) is used in many codebases, and moving to a separate module could consolidate those implementations, as well as making it easier to consume without large dependency trees (when being a package as part of a larger code base). Commit `1912d5988b` and follow-ups introduced cgo code into the userns package, and code introduced in those commits are not intended for external use, therefore complicating the potential of moving the userns package separate. This commit moves the new code to a separate package; some of this code was included in v1.1.11 and up, but I could not find external consumers of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and `Handles` types (added in `ba0b5e2698`) only exist in main and in non-stable releases (v1.2.0-rc.x), so don't need an alias / deprecation. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-30 20:06:30 +02:00
Kir Kolyshkin	1c505fffdc	Revert "Set temporary single CPU affinity..." There's too much logic here figuring out which CPUs to use. Runc is a low level tool and is not supposed to be that "smart". What's worse, this logic is executed on every exec, making it slower. Some of the logic in (*setnsProcess).start is executed even if no annotation is set, thus making ALL execs slow. Also, this should be a property of a process, rather than annotation. The plan is to rework this. This reverts commit `afc23e3397`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-10 06:31:03 +08:00
lifubang	a853a82677	runc exec: setupRlimits after syscall.rlimit.init() completed Issue: https://github.com/opencontainers/runc/issues/4195 Since https://go-review.googlesource.com/c/go/+/476097, there is a get/set race between runc exec and syscall.rlimit.init, so we need to call setupRlimits after syscall.rlimit.init() completed. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-08 10:40:07 +00:00
Cédric Clerget	afc23e3397	Set temporary single CPU affinity before cgroup cpuset transition. This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset. The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread. Introduces isolated CPU affinity transition OCI runtime annotation org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore the behavior during runc exec. Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes. Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>	2024-04-16 08:59:49 +02:00
utam0k	bfbd0305ba	Add I/O priority Signed-off-by: utam0k <k0ma@utam0k.jp>	2024-03-30 22:31:54 +09:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Aleksa Sarai	ccc76713a7	sync: rename procResume -> procHooksDone The old name was quite confusing, and with the addition of the procMountPlease sync message there are now multiple sync messages that are related to "resuming" runc-init. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:02:11 +08:00
Kir Kolyshkin	41778ddc2c	Fix for host mount ns containers If the container does not have own mount namespace configured (i.e. it shares the mount namespace with the host), its "prestart" (obsoleted) and "createRuntime" hooks are called twice, and its cgroups and Intel RDT settings are also applied twice. The code being removed was originally added by commit `2f2764984` ("Move pre-start hooks after container mounts", Feb 17 2016). At that time, the syncParentHooks() was called from setupRootfs(), which was only used when the container config has mount namespace (NEWNS) enabled. Later, commit `244c9fc426` ("*: console rewrite", Jun 4 2016) spli the relevant part of setupRootfs() into prepareRootfs(). It was still called conditionally (only if mount namespace was enabled). Finally, commit `91ca331474` ("chroot when no mount namespaces is provided", Jan 25 2018) removed the above condition, meaning prepareRootfs(), and thus syncParentHooks(), is now called for any container. Meaning, the special case for when mount namespace is not enabled is no longer needed. Remove it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-28 12:45:39 -07:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	c6e7b1a8ec	libct: initProcess.start: fix sync logic The code in this function became quite complicated and not entirely correct over time. As a result, if an error is returned from parseSync, it might end up stuck waiting for the child to finish. 1. Let's not wait() for the child twice. We already do it in the defer statement (call p.terminate()) when we are returning an error. 2. Remove sentResume and sentRun since we do not want to check if these were sent or not. Instead, introduce and check seenProcReady, as procReady is always expected from runc init. 3. Eliminate the possibility to wrap nil as an error. 4. Make sure we always call shutdown on the sync socket, and do not let shutdown error shadow the ierr. This fixes the issue of stuck `runc runc` with the optimization patch (sending procSeccompDone earlier) applied. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	9583b3d1c2	libct: move killing logic to container.Signal By default, the container has its own PID namespace, and killing (with SIGKILL) its init process from the parent PID namespace also kills all the other processes. Obviously, it does not work that way when the container is sharing its PID namespace with the host or another container, since init is no longer special (it's not PID 1). In this case, killing container's init will result in a bunch of other processes left running (and thus the inability to remove the cgroup). The solution to the above problem is killing all the container processes, not just init. The problem with the current implementation is, the killing logic is implemented in libcontainer's initProcess.wait, and thus only available to libcontainer users, but not the runc kill command (which uses nonChildProcess.kill and does not use wait at all). So, some workarounds exist: - func destroy(c *Container) calls signalAllProcesses; - runc kill implements -a flag. This code became very tangled over time. Let's simplify things by moving the killing all processes from initProcess.wait to container.Signal, and documents the new behavior. In essence, this also makes `runc kill` to automatically kill all container processes when the container does not have its own PID namespace. Document that as well. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:29:25 -07:00
Vipul Newaskar	9fc707e703	Fixed init state error variable Init State Error message was using the err variable instead of uerr, which has been fixed now. The error message should not show "nil" now. Signed-off-by: Vipul Newaskar <vipulnewaskar7@gmail.com>	2022-11-15 09:41:16 +05:30
Kir Kolyshkin	102b8abd26	libct: rm BaseContainer and Container interfaces The only implementation of these is linuxContainer. It does not make sense to have an interface with a single implementation, and we do not foresee other types of containers being added to runc. Remove BaseContainer and Container interfaces, moving their methods documentation to linuxContainer. Rename linuxContainer to Container. Adopt users from using interface to using struct. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-23 11:04:12 -07:00

1 2 3

150 Commits