Commit Graph

150 Commits

Author SHA1 Message Date
Aleksa Sarai ca509e76ff libct: move cmsg helpers to new internal/cmsg package
These helpers all make more sense as a self-contained package and moving
them has the added benefit of removing an unneeded libpathrs dependency
(from libcontainer/utils's import of pathrs-lite) from several test
binaries.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-04-08 01:21:41 +10:00
Kir Kolyshkin 5996fe143a Revert "Preventing containers from being unable to be deleted"
This fixes random failures to start a container in conmon integration
tests (see issue 5151).

I guess we need to find another way to fix issue 4645.

This reverts commit 1b39997e73.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-05 18:44:30 -08:00
Antti Kervinen 700c944c4d libct: fix resetting CPU affinity
unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024
from allowed CPUs of pid, even if cpuset is all ones. As a
consequence, when runc tries to reset CPU affinity to "allow all" by
default, it prevents all containers from CPUs 1024 onwards.

This change uses a huge CPU mask to play safe and get all possible
CPUs enabled with a single sched_setaffinity call.

Fixes: #5023

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2026-03-04 13:06:33 -08:00
Kir Kolyshkin 6c07a37a58 libct: prepareCgroupFD: fall back to container init cgroup
Previously, when prepareCgroupFD would not open container's cgroup
(as configured in config.json and saved to state.json), it returned
a fatal error, as we presumed a container can't exist without its own
cgroup.

Apparently, it can. In a case when container is configured without
cgroupns (i.e. it uses hosts cgroups), and /sys/fs/cgroup is mounted
read-write, a rootful container's init can move itself to an entirely
different cgroup (even a new one that it just created), and then the
original container cgroup is removed by the kernel (or systemd?) as
it has no processes left. By the way, from the systemd point of view
the container is gone. And yet it is still there, and users want
runc exec to work!

And it worked, thanks to the "let's try container init's cgroup"
fallback as added by commit c91fe9aeba ("cgroup2: exec: join the
cgroup of the init process on EBUSY"). The fallback was added for
the entirely different reason, but it happened to work in this very
case, too.

This behavior was broken with the introduction of CLONE_INTO_CGROUP
support.

While it is debatable whether this is a valid scenario when a container
moves itself into a different cgroup, this very setup is used by e.g.
buildkitd running in a privileged kubernetes container (see issue 5089).

To restore the way things are expected to work, add the same "try
container init's cgroup" fallback into prepareCgroupFD.

While at it, simplify the code flow.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:57:25 -08:00
Kir Kolyshkin 1d030fab7d libct: refactor addIntoCgroupV2, fix wrt rootless
1. Refactor addIntoCgroupV2 in an attempt to simplify it.

2. Fix the bug of not trying the init cgroup fallback if
   rootlessCgroup is set. This is a bug because rootlessCgroup
   tells to ignore cgroup join errors, not to never try the fallback.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:56:57 -08:00
Kir Kolyshkin 94133fab97 libct: factor out initProcessCgroupPath
Separate initProcessCgroupPath code out of addIntoCgroupV2.
To be used by the next patch.

While at it, describe the new scenario in which the container's
configured cgroup might not be available.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:52:59 -08:00
Kir Kolyshkin cb31d62f1c Fix exec vs Go 1.26
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: https://github.com/opencontainers/runc/pull/4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <efimverzakov@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-29 13:49:34 -08:00
Curd Becker 536e183451 Replace os.Is* error checking functions with their errors.Is counterpart
Signed-off-by: Curd Becker <me@curd-becker.de>
2025-12-11 03:16:02 +01:00
Kir Kolyshkin c24965b742 libct: newProcessComm: close fds on error
Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin 871052b791 libct: refactor setnsProcess.start
Factor startWithCgroupFD out of start to reduce the start complexity.
This also implements a more future-proof way of calling p.comm.closeChild.

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-14 11:48:25 -07:00
Kir Kolyshkin 4e262509b8 libct: close child fds on prepareCgroupFD error
The (*setns).start is supposed to close child fds once the child has
started, or upon an error. Commit 5af4dd4e6 added a bug -- child fds
are not closed if prepareCgroupFD fails.

Fix by adding a missing call to closeChild.

I'm not sure how to write a good test case for it. Found when working
on PR 4928 (and tested in there).

Fixes: 5af4dd4e6
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-14 11:48:25 -07:00
Aleksa Sarai 93f9a392cf libct: switch to (*CPUSet).Fill
Now that we've updated to golang.org/x/sys@v0.37.0, CPUSet has a Fill
helper that does the equivalent to our underflow trick to make setting
all CPUs efficient.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-10-09 15:39:02 +11:00
Ariel Otilibili 4404cdf94b libcontainer: switch goCreateMountSources() to ctx.AfterFunc
ba0b5e26 ("libcontainer: remove all mount logic from nsexec") introduced
a request function that handles two tasks:
- the exchanges with the request and response channels
- the closing of the request channel.

From 1.21, the closing of the request channel may be done with
context.AfterFunc(). Moreover, context.AfterFunc() is guaranteed to run
once.

Link: https://pkg.go.dev/context#AfterFunc
Suggested-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Ariel Otilibili <otilibil@eurecom.fr>
2025-10-02 16:46:31 +02:00
Kir Kolyshkin 5af4dd4e64 runc exec: use CLONE_INTO_CGROUP when available
It makes sense to make runc exec benefit from clone2(CLONE_INTO_CGROUP),
if it is available. Since it requires a recent kernel and might not work,
implement a fallback to older way of joining the cgroup.

Based on:
 - https://go-review.googlesource.com/c/go/+/417695
 - https://github.com/coreos/go-systemd/pull/458
 - https://github.com/opencontainers/cgroups/pull/26
 - https://github.com/opencontainers/runc/pull/4822

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-09-26 14:27:18 -07:00
Kir Kolyshkin 37b5acc2d7 libct: use manager.AddPid to add exec to cgroup
The main benefit here is when we are using a systemd cgroup driver,
we actually ask systemd to add a PID, rather than doing it ourselves.
This way, we can add rootless exec PID to a cgroup.

This requires newer opencontainers/cgroups and coreos/go-systemd.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-09-16 13:31:16 -07:00
Kir Kolyshkin 5730a141f1 libct: move exec sub-cgroup handling down the line
Remove cgroupPaths field from struct setnsProcess, because:
 - we can get base cgroup paths from p.manager.GetPaths();
 - we can get sub-cgroup paths from p.process.SubCgroupPaths.

But mostly because we are going to need separate cgroup paths when
adopting cgroups.AddPid.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-09-16 13:27:56 -07:00
Kir Kolyshkin 5560020cbb libct: split addIntoCgroup into V1 and V2
The main idea is to maintain the code separately (and eventually kill V1
implementation).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-09-16 13:27:56 -07:00
Kir Kolyshkin b39e0d6468 libct: factor out addIntoCgroup from setnsProcess.start
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-09-16 13:27:56 -07:00
Aleksa Sarai 121192ade6 libct: reset CPU affinity by default
In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but kcommit da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
"runc exec", if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

For some reason, in our CI it seems that rootless --systemd-cgroup
results in the cpuset (presumably temporarily?) being configured such
that sched_setaffinity(2) will allow the full set of CPUs. For this
particular case, all we care about is that it is different to the
original set, so include some special-casing (but we should probably
investigate this further...).

Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Martin Sivak <msivak@redhat.com>
Reported-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-08-28 08:25:46 +10:00
lfbzhm 4d4cedd650 Merge pull request #4796 from astrawind/fix/seccomp-agent-conn-leak
libcontainer: close seccomp agent connection to prevent resource leaks
2025-07-04 00:08:21 +08:00
Pavel Liubimov aa0e7989c4 libcontainer: close seccomp agent connection to prevent resource leaks
Add missing defer conn.Close().

Signed-off-by: Pavel Liubimov <prlyubimov@gmail.com>
2025-07-01 12:31:55 +03:00
HirazawaUi 1b39997e73 Preventing containers from being unable to be deleted
Signed-off-by: HirazawaUi <695097494plus@gmail.com>
2025-06-19 20:17:50 +08:00
Antonio Ojea 8d180e9658 Add support for Linux Network Devices
Implement support for passing Linux Network Devices to the container
network namespace.

The network device is passed during the creation of the container,
before the process is started.

It implements the logic defined in the OCI runtime specification.

Signed-off-by: Antonio Ojea <aojea@google.com>
2025-06-18 15:52:30 +01:00
Kir Kolyshkin 17570625c0 Use for range over integers
This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-31 17:15:06 -07:00
lfbzhm 346c80d714 libct: replace unix.Kill with os.Process.Signal
Because we should switch to unix.PidFDSendSignal in new kernels, it has
been supported in go runtime. We don't need to add fall back to
unix.Kill code here.

Signed-off-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-07 14:07:47 -08:00
Kir Kolyshkin 10ca66bff5 runc exec: implement CPU affinity
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261

CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
   Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
   applied to a given exec and overrides the value from (1).

Add some basic tests.

Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.

Because of the above,
 - it's impossible to really test initial CPU affinity without adding
   debug logging to libcontainer/nsenter;
 - for older kernels, there can be a brief moment when exec's affinity
   is different than either initial or final affinity being set;
 - exec's final CPU affinity, if not specified, can be different
   depending on the kernel, therefore we don't test it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-02 19:17:41 -08:00
Kir Kolyshkin a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
lfbzhm d48d9cfefc Merge pull request #4459 from kolyshkin/prio-nits
Fixups to scheduler/priority settings
2024-12-25 23:41:27 +08:00
Kir Kolyshkin 5d3942eec3 libct: unify IOPriority setting
For some reason, io priority is set in different places between runc
start/run and runc exec:

 - for runc start/run, it is done in the middle of (*linuxStandardInit).Init,
   close to the place where we exec runc init.
 - for runc exec, it is done much earlier, in (*setnsProcess) start().

Let's move setIOPriority call for runc exec to (*linuxSetnsInit).Init,
so it is in the same logical place as for runc start/run.

Also, move the function itself to init_linux.go as it's part of init.

Should not have any visible effect, except part of runc init is run with
a different I/O priority.

While at it, rename setIOPriority to setupIOPriority, and make it accept
the whole *configs.Config, for uniformity with other similar functions.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:15:31 -08:00
Kir Kolyshkin 2dc3ea4b87 libct: simplify setIOPriority/setupScheduler calls
Move the nil check inside, simplifying the callers.

Fixes: bfbd0305 ("Add I/O priority")
Fixes: 770728e1 ("Support `process.scheduler`")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:06:20 -08:00
Kir Kolyshkin 93091e6ac2 libct: don't pass SpecState to init unless needed
SpecState field of initConfig is only needed to run hooks that are
executed inside a container -- namely CreateContainer and
StartContainer.

If these hooks are not configured, there is no need to fill, marshal and
unmarshal SpecState.

While at it, inline updateSpecState as it is trivial and only has one user.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:52:15 -08:00
Kir Kolyshkin 8afeb58398 libct: add/use configs.HasHook
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:47:09 -08:00
lfbzhm 171c414904 refactor init and setns process
Introduce a common parent struct `containerProcess`,
let both `initProcess` and `setnsProcess` are inherited
from it.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
2024-12-21 19:16:01 -08:00
Kir Kolyshkin b5bdf592f2 libct: rm initWaiter
This initWaiter logic was introduced by commit 4ecff8d9, but since the logic of
/proc/self/exe was moved out of runc init in commit 0e9a335, this
seems unnecessary to have initWaiter.

Remove it.

This essentially reverts commit 4ecff8d9.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-10-17 08:05:42 -07:00
Kir Kolyshkin 30f8f51eab runc create/run: warn on rootless + shared pidns + no cgroup
Shared pid namespace means `runc kill` (or `runc delete -f`) have to
kill all container processes, not just init. To do so, it needs a cgroup
to read the PIDs from.

If there is no cgroup, processes will be leaked, and so such
configuration is bad and should not be allowed. To keep backward
compatibility, though, let's merely warn about this for now.

Alas, the only way to know if cgroup access is available is by returning
an error from Manager.Apply. Amend fs cgroup managers to do so (systemd
doesn't need it, since v1 can't work with rootless, and cgroup v2 does
not have a special rootless case).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-17 22:49:37 -07:00
Sebastiaan van Stijn 30b530ca94 libct/userns: split userns detection from internal userns code
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).

Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.

This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-30 20:06:30 +02:00
Kir Kolyshkin 1c505fffdc Revert "Set temporary single CPU affinity..."
There's too much logic here figuring out which CPUs to use. Runc is a
low level tool and is not supposed to be that "smart". What's worse,
this logic is executed on every exec, making it slower. Some of the
logic in (*setnsProcess).start is executed even if no annotation is set,
thus making ALL execs slow.

Also, this should be a property of a process, rather than annotation.

The plan is to rework this.

This reverts commit afc23e3397.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-06-10 06:31:03 +08:00
lifubang a853a82677 runc exec: setupRlimits after syscall.rlimit.init() completed
Issue: https://github.com/opencontainers/runc/issues/4195
Since https://go-review.googlesource.com/c/go/+/476097, there is
a get/set race between runc exec and syscall.rlimit.init, so we
need to call setupRlimits after syscall.rlimit.init() completed.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:07 +00:00
Cédric Clerget afc23e3397 Set temporary single CPU affinity before cgroup cpuset transition.
This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.

Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.

Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.

Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
2024-04-16 08:59:49 +02:00
utam0k bfbd0305ba Add I/O priority
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-03-30 22:31:54 +09:00
Aleksa Sarai ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Aleksa Sarai 8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
Aleksa Sarai ccc76713a7 sync: rename procResume -> procHooksDone
The old name was quite confusing, and with the addition of the
procMountPlease sync message there are now multiple sync messages that
are related to "resuming" runc-init.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:02:11 +08:00
Kir Kolyshkin 41778ddc2c Fix for host mount ns containers
If the container does not have own mount namespace configured (i.e. it
shares the mount namespace with the host), its "prestart" (obsoleted)
and "createRuntime" hooks are called twice, and its cgroups and Intel
RDT settings are also applied twice.

The code being removed was originally added by commit 2f2764984 ("Move
pre-start hooks after container mounts", Feb 17 2016). At that time,
the syncParentHooks() was called from setupRootfs(), which was only
used when the container config has mount namespace (NEWNS) enabled.

Later, commit 244c9fc426 ("*: console rewrite", Jun 4 2016) spli
the relevant part of setupRootfs() into prepareRootfs(). It was still
called conditionally (only if mount namespace was enabled).

Finally, commit 91ca331474 ("chroot when no mount namespaces is
provided", Jan 25 2018) removed the above condition, meaning
prepareRootfs(), and thus syncParentHooks(), is now called for any
container.

Meaning, the special case for when mount namespace is not enabled is no
longer needed.

Remove it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-28 12:45:39 -07:00
Kir Kolyshkin 6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Aleksa Sarai f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin c6e7b1a8ec libct: initProcess.start: fix sync logic
The code in this function became quite complicated and not entirely
correct over time. As a result, if an error is returned from parseSync,
it might end up stuck waiting for the child to finish.

1. Let's not wait() for the child twice. We already do it in the
defer statement (call p.terminate()) when we are returning an error.

2. Remove sentResume and sentRun since we do not want to check if
these were sent or not. Instead, introduce and check seenProcReady, as
procReady is always expected from runc init.

3. Eliminate the possibility to wrap nil as an error.

4. Make sure we always call shutdown on the sync socket, and do not let
   shutdown error shadow the ierr.

This fixes the issue of stuck `runc runc` with the optimization patch
(sending procSeccompDone earlier) applied.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin 9583b3d1c2 libct: move killing logic to container.Signal
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.

Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).

The solution to the above problem is killing all the container
processes, not just init.

The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
 - func destroy(c *Container) calls signalAllProcesses;
 - runc kill implements -a flag.

This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.

In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:29:25 -07:00
Vipul Newaskar 9fc707e703 Fixed init state error variable
Init State Error message was using the err variable instead of uerr, which has been fixed now.
The error message should not show "nil" now.

Signed-off-by: Vipul Newaskar <vipulnewaskar7@gmail.com>
2022-11-15 09:41:16 +05:30
Kir Kolyshkin 102b8abd26 libct: rm BaseContainer and Container interfaces
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.

Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.

Rename linuxContainer to Container.

Adopt users from using interface to using struct.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-23 11:04:12 -07:00