Commit Graph

3053 Commits

Author SHA1 Message Date
Sebastiaan van Stijn ba83c7c7d7 libcontainer/devices: add '//go:fix inline' directives
This allows users to automaticaly migrate to the new location
using `go fix`. It has some limitations, but can help smoothen
the transition; for example, taking this file;

```
package main

import (
	"github.com/opencontainers/runc/libcontainer/devices"
)

func main() {
	_, _ = devices.DeviceFromPath("a", "b")
	_, _ = devices.HostDevices()
	_, _ = devices.GetDevices("a")
}
```

Running `go fix -mod=readonly ./...` will migrate the code;

```
package main

import (
	devices0 "github.com/moby/sys/devices"
)

func main() {
	_, _ = devices0.DeviceFromPath("a", "b")
	_, _ = devices0.HostDevices()
	_, _ = devices0.GetDevices("a")
}
```

updates b345c78dca

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2026-04-04 19:36:43 +02:00
Aleksa Sarai b345c78dca libct/devices: deprecate in favour of moby/sys/devices
The libcontainer/devices package has been moved to moby/sys/devices, so
we can just point users to that and keep some compatibility shims around
until runc 1.6. We don't use it at all so there are no other changes
needed.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-04-02 22:54:14 +11:00
lfbzhm 5b094ed1ac libct: use preopened rootfs more
This uses preopened rootfs in Chdir and pivotRoot.

While at it, add O_PATH when opening oldroot in pivotRoot.

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-29 12:02:38 -07:00
Kir Kolyshkin 28cb321887 Pre-open container root directory
A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.

Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-29 12:02:36 -07:00
Kir Kolyshkin 78b80677f6 libct: minor refactor in mountToRootfs
No change in functionality, just a preparation for the next patch.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-28 23:48:07 -07:00
Kir Kolyshkin 60352524d3 libct: mountCgroupV1: address TODO
Indeed, it does not make sense to prepend c.root once we started using
MkdirAllInRoot in commit 63c29081.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-28 23:48:07 -07:00
Aleksa Sarai 7b40afb6cc merge #5177 into opencontainers/runc:main
Li Fubang (3):
  test: check mount source fds are cleaned up with idmapped mounts
  libct: close mount source fd as soon as possible
  libct: add a nil check for mountError

LGTMs: kolyshkin rata cyphar
2026-03-28 17:32:21 +11:00
Kir Kolyshkin f00b2f9fd5 libct/exeseal: drop own F_SEAL_EXEC
Since golang.org/x/sys@v0.22 it is available from unix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-20 15:57:55 -07:00
lifubang c77e71a3e7 libct: close mount source fd as soon as possible
This commit factors out setupAndMountToRootfs without changing any
logic. Use "Hide whitespace changes" during review to focus on the
actual changes.

The refactor ensures the mount source file descriptor is closed via
defer in each loop iteration, reducing the total number of open FDs
in runc. This helps avoid hitting the file descriptor limit under
high concurrency or when handling many mounts.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-03-20 01:09:49 +00:00
lifubang 0d0fd95731 libct: add a nil check for mountError
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-03-19 15:47:32 +00:00
Kir Kolyshkin 0079bee17f Support specs.LinuxSeccompFlagWaitKillableRecv
This adds support for WaitKillableRecv seccomp flag
(also known as SCMP_FLTATR_CTL_WAITKILL in libseccomp and
as SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV in the kernel).

This requires:
 - libseccomp >= 2.6.0
 - libseccomp-golang >= 0.11.0
 - linux kernel >= 5.19

Note that this flag does not make sense without NEW_LISTENER, and
the kernel returns EINVAL when SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
is set but SECCOMP_FILTER_FLAG_NEW_LISTENER is not set.

For runc this means that .linux.seccomp.listenerPath should also be set,
and some of the seccomp rules should have SCMP_ACT_NOTIFY action. This
is why the flag is tested separately in seccomp-notify.bats.

At the moment the only adequate CI environment for this functionality is
Fedora 43. On all other platforms (including CentOS 10 and Ubuntu 24.04)
it is skipped similar to this:

> ok 251 runc run [seccomp] (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) # skip requires libseccomp >= 2.6.0 and API level >= 7 (current version: 2.5.6, API level: 6)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-16 10:48:42 -07:00
Kir Kolyshkin d2abe47689 libct/configs: exclude Relabel from json [un]marshaling
When deprecating Relabel field, its json attributes were mistakenly
removed, so now it is:
 - saved to JSON under "Relabel" (rather than "relabel");
 - won't be ignored if empty.

Let's fix it before it's too late.

Fixes: 8b2b5e94 ("libct: remove relabeling dead code")

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-10 14:13:11 -07:00
Aleksa Sarai 5f3ac16d18 merge #5152 into opencontainers/runc:main
Kir Kolyshkin (1):
  libct: remove relabeling dead code

LGTMs: cyphar rata
2026-03-08 00:16:03 +09:00
Rodrigo Campos Catelin 2db0c5e8b1 Merge pull request #5155 from cyphar/intelrdt-improve-mkdir
libct: intelrdt: improve directory cleanup logic
2026-03-06 14:40:40 +01:00
Aleksa Sarai 1c35df9ea2 merge #5153 into opencontainers/runc:main
Kir Kolyshkin (1):
  Revert "Preventing containers from being unable to be deleted"

LGTMs: cyphar rata
2026-03-06 18:48:53 +09:00
Aleksa Sarai fbaf5e3161 libct: intelrdt: improve directory cleanup logic
It makes more sense to save whether we should cleanup the directory
after it gets created (to avoid error cases deleting a different
directory) as well as tying this check to the existing os.ErrExist
check rather than doing an extra stat(2).

Fixes: e2baa3ad10 ("Intel RDT: update according to spec changes.")
Suggested-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-06 18:45:36 +09:00
Kir Kolyshkin 5996fe143a Revert "Preventing containers from being unable to be deleted"
This fixes random failures to start a container in conmon integration
tests (see issue 5151).

I guess we need to find another way to fix issue 4645.

This reverts commit 1b39997e73.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-05 18:44:30 -08:00
Kir Kolyshkin 8b2b5e9492 libct: remove relabeling dead code
There is no way to set Mount.Relabel field via OCI spec (config.json),
and so the relabeling code is never used.

My guess it's a leftover from times when runc used to be part of Docker.

Remove it, and mark Relabel field as deprecated.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-05 14:57:21 -08:00
Ismo Puustinen e2baa3ad10 Intel RDT: update according to spec changes.
There is one proposed clarification to the OCI spec: the subdirectory
needs to be deleted. Runc already does that, but the clarification adds
for directory removal only if the directory was created by us.

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-05 12:28:32 +11:00
Antti Kervinen 700c944c4d libct: fix resetting CPU affinity
unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024
from allowed CPUs of pid, even if cpuset is all ones. As a
consequence, when runc tries to reset CPU affinity to "allow all" by
default, it prevents all containers from CPUs 1024 onwards.

This change uses a huge CPU mask to play safe and get all possible
CPUs enabled with a single sched_setaffinity call.

Fixes: #5023

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2026-03-04 13:06:33 -08:00
Aleksa Sarai 625ef531b7 libct: devices: drop deprecated cgroup types
These were all marked deprecated in commit a75076b4a4 ("Switch to
opencontainers/cgroups") when we switched maintenance of our cgroup code
to opencontainers/cgroups.

Users have had ample time to switch to opencontainers/cgroups
themselves, so we can finally remove this.

Note that the whole libcontainer/devices package will be moved to
moby/sys in the near future, so this whole package will be marked
deprecated soon.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-05 00:04:34 +11:00
Aleksa Sarai 6a77ee7864 libct: remove deprecated MPOL_* constants
These were inadvertently added to our exported APIs by commit
eeda7bdf80cca ("Add memory policy support"). We couldn't remove them
from runc 1.4.x, but we deprecated them in commit 3741f9186d
("libct/configs: mark MPOL_* constants as deprecated") and marked them
for removal in runc 1.5. Users should never have used these in the first
place.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-05 00:04:34 +11:00
Aleksa Sarai 87b0804345 libct: remove deprecated HooksList.RunHooks
This was deprecated in commit e6a4870e4ac40 ("libct: better errors for
hooks"), and users have had ample time to migrate to Hooks.Run since.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-04 23:01:30 +11:00
Aleksa Sarai 8fd8e433f8 libct: config: remove deprecated cgroup types
These were all marked deprecated in commit a75076b4a4 ("Switch to
opencontainers/cgroups") when we switched maintenance of our cgroup code
to opencontainers/cgroups.

Users have had ample time to switch to opencontainers/cgroups
themselves, so we can finally remove this.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-04 23:01:16 +11:00
lfbzhm 4d4e064109 Merge pull request #5133 from kolyshkin/usec
libct/specconv: fix panic in initSystemdProps
2026-02-27 18:55:42 +08:00
lfbzhm 8de198f11d Merge pull request #5118 from kolyshkin/lint29
ci: bump golangci-lint to v2.10, fix some prealloc linter warnings
2026-02-27 12:37:12 +08:00
Kir Kolyshkin a48a7cef96 libct/specconv: fix panic in initSystemdProps
There is a chance of panic here -- eliminate it.

Add a test case (which panics before the fix).

Reported-by: Luke Hinds <luke@stacklok.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-26 18:26:46 -08:00
Kir Kolyshkin 392a221293 libct/specconv: TestInitSystemdProps: use t.Run
Use t.Run for individual tests. Add missing desc fields.

Best reviewed with --ignore-all-space.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-26 18:26:46 -08:00
Kir Kolyshkin 6a374e6c1d libcontainer: move example code out of README
Example code in README is outdated (especially since cgroups is moved to
a separate repository) and lacks proper import statements. And, since it
is not code, it is hard to keep it up to date.

Let's move it out to the example_test.go file and refer to it. Note we
still don't run it, but it will be compiled and linted in CI.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-26 09:36:56 -08:00
Kir Kolyshkin 5a6e1e18f9 Preallocate some slices
Fix *some* of the prealloc linter warnings. While it does not make sense
to address all warnings (or add prealloc to the list of linters we run
in CI), some do make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-25 13:48:55 -08:00
Kir Kolyshkin 6c07a37a58 libct: prepareCgroupFD: fall back to container init cgroup
Previously, when prepareCgroupFD would not open container's cgroup
(as configured in config.json and saved to state.json), it returned
a fatal error, as we presumed a container can't exist without its own
cgroup.

Apparently, it can. In a case when container is configured without
cgroupns (i.e. it uses hosts cgroups), and /sys/fs/cgroup is mounted
read-write, a rootful container's init can move itself to an entirely
different cgroup (even a new one that it just created), and then the
original container cgroup is removed by the kernel (or systemd?) as
it has no processes left. By the way, from the systemd point of view
the container is gone. And yet it is still there, and users want
runc exec to work!

And it worked, thanks to the "let's try container init's cgroup"
fallback as added by commit c91fe9aeba ("cgroup2: exec: join the
cgroup of the init process on EBUSY"). The fallback was added for
the entirely different reason, but it happened to work in this very
case, too.

This behavior was broken with the introduction of CLONE_INTO_CGROUP
support.

While it is debatable whether this is a valid scenario when a container
moves itself into a different cgroup, this very setup is used by e.g.
buildkitd running in a privileged kubernetes container (see issue 5089).

To restore the way things are expected to work, add the same "try
container init's cgroup" fallback into prepareCgroupFD.

While at it, simplify the code flow.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:57:25 -08:00
Kir Kolyshkin 1d030fab7d libct: refactor addIntoCgroupV2, fix wrt rootless
1. Refactor addIntoCgroupV2 in an attempt to simplify it.

2. Fix the bug of not trying the init cgroup fallback if
   rootlessCgroup is set. This is a bug because rootlessCgroup
   tells to ignore cgroup join errors, not to never try the fallback.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:56:57 -08:00
Kir Kolyshkin 94133fab97 libct: factor out initProcessCgroupPath
Separate initProcessCgroupPath code out of addIntoCgroupV2.
To be used by the next patch.

While at it, describe the new scenario in which the container's
configured cgroup might not be available.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-02-11 11:52:59 -08:00
lifubang 5560d55bfd libct/specconv: fix partial clear of atime mount flags
When parsing mount options into recAttrSet and recAttrClr,
the code sets attr_clr to individual atime flags (e.g.
MOUNT_ATTR_NOATIME or MOUNT_ATTR_STRICTATIME) when clearing
atime attributes. However, this violates the kernel's
requirement documented in mount_setattr(2)[1]:

> Note that, since the access-time values are an enumeration
> rather than bit values, a caller wanting to transition to a
> different access-time setting cannot simply specify the
> access-time setting in attr_set, but must also include
> MOUNT_ATTR__ATIME in the attr_clr field.  The kernel will
> verify that MOUNT_ATTR__ATIME isn't partially set in
> attr_clr (i.e., either all bits in the MOUNT_ATTR__ATIME
> bit field are either set or clear), and that attr_set
> doesn't have any access-time bits set if MOUNT_ATTR__ATIME
> isn't set in attr_clr.

Passing only a single atime flag (e.g. MOUNT_ATTR_RELATIME) in
attr_clr causes mount_setattr() to fail with EINVAL.

This change ensures that whenever an atime mode is updated,
attr_clr includes MOUNT_ATTR__ATIME to properly reset the
entire access-time attribute field before applying the new mode.

[1] https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-02-06 03:30:55 +00:00
Kir Kolyshkin cb31d62f1c Fix exec vs Go 1.26
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.

One issue with that approach is, a

> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.

(from https://pkg.go.dev/os/exec#Cmd).

This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).

The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.

Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.

[PR 4812]: https://github.com/opencontainers/runc/pull/4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200

Reported-by: Efim Verzakov <efimverzakov@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-29 13:49:34 -08:00
Kir Kolyshkin 82b7597a26 libct: check cmd.Err after exec.Command call
Theoretically, exec.Command can set cmd.Err.

Practically, this should never happen (Linux, Go <= 1.26, exePath is
absolute), but in the unlikely case it does, let's fail early.

This is related to the cloneCmd (to be introduced by the following
commit) which chooses to not copy the Err field. Theoretically,
exec.Command can set Err and so the first call to cmd.Start will fail
(since Err != nil), and the second call to cmd.Start may succeed because
Err == nil. Yet, this scenario is highly unlikely, but better be safe
than sorry.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-29 13:49:04 -08:00
Kir Kolyshkin 593ac3b7d9 libct: use pointers for Process methods
The Process type is quite big (currently 368 bytes on a 64 bit Linux)
and using non-pointer receivers in its methods results in copying which
is totally unnecessary.

Change the methods to use pointer receivers.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-26 14:17:46 -08:00
Kir Kolyshkin 6cd91f665e libct/configs: use pointers for Config methods
The Config type is quite big (currently 554 bytes on a 64 bit Linux)
and using non-pointer receivers in its methods results in copying which
is totally unnecessary.

Change the methods to use pointer receivers.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-26 14:17:44 -08:00
Kir Kolyshkin 2088e000eb libct/configs: Id -> ID
Rename a function parameter (containerId -> containerID) to avoid a
linter warning:

> var-naming: method parameter containerId should be containerID (revive)

In many other places, including config.json (.linux.uidMappings and
.gidMappings) it is already called containerID, so let's rename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-01-26 14:16:19 -08:00
Kir Kolyshkin 652269729d libc/int: use strings.Builder
Generated by modernize@latest (v0.21.0).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-16 15:04:04 -08:00
Akihiro Suda 4dcda051da Merge pull request #5055 from kolyshkin/mpol-2
libct/configs: mark MPOL_* constants as deprecated
2025-12-16 10:39:09 +09:00
Curd Becker 536e183451 Replace os.Is* error checking functions with their errors.Is counterpart
Signed-off-by: Curd Becker <me@curd-becker.de>
2025-12-11 03:16:02 +01:00
Kir Kolyshkin 3741f9186d libct/configs: mark MPOL_* constants as deprecated
Alas, these new constants are already in v1.4.0 release so we can't
remove those right away, but we can mark them as deprecated now
and target removal for v1.5.0.

So,
 - mark them as deprecated;
 - redefine via unix.MPOL_* counterparts;
 - fix the validator code to use unix.MPOL_* directly.

This amends commit a0e809a8.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-08 15:36:29 -08:00
Kir Kolyshkin 8a9b4dcda6 libct: mountFd: close mountFile on error
Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin c24965b742 libct: newProcessComm: close fds on error
Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin 88f897160c libct: startInitialization: add defer close
This function calls Init what normally never returns, so the defer only
works if there is an error and we can safely use it to close those fds
we opened. This was done for most but not all fds.

Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin 1f1ff4be06 Merge pull request #5051 from cyphar/libct-utils-deprecated
libct/utils: remove Deprecated functions
2025-12-02 15:06:01 -08:00
Akihiro Suda 64c3c8eea6 Merge pull request #4994 from kolyshkin/gofumpt-extra
Enable gofumpt extra rules
2025-11-28 09:30:57 +09:00
Aleksa Sarai a412bd93e9 libct/utils: remove Deprecated functions
These were all marked for deprecation in runc 1.5.0, so remove them now
to make sure we don't forget.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-28 11:11:11 +11:00
Aleksa Sarai 195e9551e4 pathrs: add MkdirAllParentInRoot helper
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.

The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00