This allows users to automaticaly migrate to the new location
using `go fix`. It has some limitations, but can help smoothen
the transition; for example, taking this file;
```
package main
import (
"github.com/opencontainers/runc/libcontainer/devices"
)
func main() {
_, _ = devices.DeviceFromPath("a", "b")
_, _ = devices.HostDevices()
_, _ = devices.GetDevices("a")
}
```
Running `go fix -mod=readonly ./...` will migrate the code;
```
package main
import (
devices0 "github.com/moby/sys/devices"
)
func main() {
_, _ = devices0.DeviceFromPath("a", "b")
_, _ = devices0.HostDevices()
_, _ = devices0.GetDevices("a")
}
```
updates b345c78dca
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The libcontainer/devices package has been moved to moby/sys/devices, so
we can just point users to that and keep some compatibility shims around
until runc 1.6. We don't use it at all so there are no other changes
needed.
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
This uses preopened rootfs in Chdir and pivotRoot.
While at it, add O_PATH when opening oldroot in pivotRoot.
Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.
Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Indeed, it does not make sense to prepend c.root once we started using
MkdirAllInRoot in commit 63c29081.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Li Fubang (3):
test: check mount source fds are cleaned up with idmapped mounts
libct: close mount source fd as soon as possible
libct: add a nil check for mountError
LGTMs: kolyshkin rata cyphar
This commit factors out setupAndMountToRootfs without changing any
logic. Use "Hide whitespace changes" during review to focus on the
actual changes.
The refactor ensures the mount source file descriptor is closed via
defer in each loop iteration, reducing the total number of open FDs
in runc. This helps avoid hitting the file descriptor limit under
high concurrency or when handling many mounts.
Signed-off-by: lifubang <lifubang@acmcoder.com>
This adds support for WaitKillableRecv seccomp flag
(also known as SCMP_FLTATR_CTL_WAITKILL in libseccomp and
as SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV in the kernel).
This requires:
- libseccomp >= 2.6.0
- libseccomp-golang >= 0.11.0
- linux kernel >= 5.19
Note that this flag does not make sense without NEW_LISTENER, and
the kernel returns EINVAL when SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
is set but SECCOMP_FILTER_FLAG_NEW_LISTENER is not set.
For runc this means that .linux.seccomp.listenerPath should also be set,
and some of the seccomp rules should have SCMP_ACT_NOTIFY action. This
is why the flag is tested separately in seccomp-notify.bats.
At the moment the only adequate CI environment for this functionality is
Fedora 43. On all other platforms (including CentOS 10 and Ubuntu 24.04)
it is skipped similar to this:
> ok 251 runc run [seccomp] (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) # skip requires libseccomp >= 2.6.0 and API level >= 7 (current version: 2.5.6, API level: 6)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When deprecating Relabel field, its json attributes were mistakenly
removed, so now it is:
- saved to JSON under "Relabel" (rather than "relabel");
- won't be ignored if empty.
Let's fix it before it's too late.
Fixes: 8b2b5e94 ("libct: remove relabeling dead code")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It makes more sense to save whether we should cleanup the directory
after it gets created (to avoid error cases deleting a different
directory) as well as tying this check to the existing os.ErrExist
check rather than doing an extra stat(2).
Fixes: e2baa3ad10 ("Intel RDT: update according to spec changes.")
Suggested-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This fixes random failures to start a container in conmon integration
tests (see issue 5151).
I guess we need to find another way to fix issue 4645.
This reverts commit 1b39997e73.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is no way to set Mount.Relabel field via OCI spec (config.json),
and so the relabeling code is never used.
My guess it's a leftover from times when runc used to be part of Docker.
Remove it, and mark Relabel field as deprecated.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is one proposed clarification to the OCI spec: the subdirectory
needs to be deleted. Runc already does that, but the clarification adds
for directory removal only if the directory was created by us.
Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
unix.CPUSet is limited to 1024 CPUs. Calling
unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024
from allowed CPUs of pid, even if cpuset is all ones. As a
consequence, when runc tries to reset CPU affinity to "allow all" by
default, it prevents all containers from CPUs 1024 onwards.
This change uses a huge CPU mask to play safe and get all possible
CPUs enabled with a single sched_setaffinity call.
Fixes: #5023
Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
These were all marked deprecated in commit a75076b4a4 ("Switch to
opencontainers/cgroups") when we switched maintenance of our cgroup code
to opencontainers/cgroups.
Users have had ample time to switch to opencontainers/cgroups
themselves, so we can finally remove this.
Note that the whole libcontainer/devices package will be moved to
moby/sys in the near future, so this whole package will be marked
deprecated soon.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These were inadvertently added to our exported APIs by commit
eeda7bdf80cca ("Add memory policy support"). We couldn't remove them
from runc 1.4.x, but we deprecated them in commit 3741f9186d
("libct/configs: mark MPOL_* constants as deprecated") and marked them
for removal in runc 1.5. Users should never have used these in the first
place.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This was deprecated in commit e6a4870e4ac40 ("libct: better errors for
hooks"), and users have had ample time to migrate to Hooks.Run since.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These were all marked deprecated in commit a75076b4a4 ("Switch to
opencontainers/cgroups") when we switched maintenance of our cgroup code
to opencontainers/cgroups.
Users have had ample time to switch to opencontainers/cgroups
themselves, so we can finally remove this.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
There is a chance of panic here -- eliminate it.
Add a test case (which panics before the fix).
Reported-by: Luke Hinds <luke@stacklok.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Example code in README is outdated (especially since cgroups is moved to
a separate repository) and lacks proper import statements. And, since it
is not code, it is hard to keep it up to date.
Let's move it out to the example_test.go file and refer to it. Note we
still don't run it, but it will be compiled and linted in CI.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix *some* of the prealloc linter warnings. While it does not make sense
to address all warnings (or add prealloc to the list of linters we run
in CI), some do make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Previously, when prepareCgroupFD would not open container's cgroup
(as configured in config.json and saved to state.json), it returned
a fatal error, as we presumed a container can't exist without its own
cgroup.
Apparently, it can. In a case when container is configured without
cgroupns (i.e. it uses hosts cgroups), and /sys/fs/cgroup is mounted
read-write, a rootful container's init can move itself to an entirely
different cgroup (even a new one that it just created), and then the
original container cgroup is removed by the kernel (or systemd?) as
it has no processes left. By the way, from the systemd point of view
the container is gone. And yet it is still there, and users want
runc exec to work!
And it worked, thanks to the "let's try container init's cgroup"
fallback as added by commit c91fe9aeba ("cgroup2: exec: join the
cgroup of the init process on EBUSY"). The fallback was added for
the entirely different reason, but it happened to work in this very
case, too.
This behavior was broken with the introduction of CLONE_INTO_CGROUP
support.
While it is debatable whether this is a valid scenario when a container
moves itself into a different cgroup, this very setup is used by e.g.
buildkitd running in a privileged kubernetes container (see issue 5089).
To restore the way things are expected to work, add the same "try
container init's cgroup" fallback into prepareCgroupFD.
While at it, simplify the code flow.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Refactor addIntoCgroupV2 in an attempt to simplify it.
2. Fix the bug of not trying the init cgroup fallback if
rootlessCgroup is set. This is a bug because rootlessCgroup
tells to ignore cgroup join errors, not to never try the fallback.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Separate initProcessCgroupPath code out of addIntoCgroupV2.
To be used by the next patch.
While at it, describe the new scenario in which the container's
configured cgroup might not be available.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When parsing mount options into recAttrSet and recAttrClr,
the code sets attr_clr to individual atime flags (e.g.
MOUNT_ATTR_NOATIME or MOUNT_ATTR_STRICTATIME) when clearing
atime attributes. However, this violates the kernel's
requirement documented in mount_setattr(2)[1]:
> Note that, since the access-time values are an enumeration
> rather than bit values, a caller wanting to transition to a
> different access-time setting cannot simply specify the
> access-time setting in attr_set, but must also include
> MOUNT_ATTR__ATIME in the attr_clr field. The kernel will
> verify that MOUNT_ATTR__ATIME isn't partially set in
> attr_clr (i.e., either all bits in the MOUNT_ATTR__ATIME
> bit field are either set or clear), and that attr_set
> doesn't have any access-time bits set if MOUNT_ATTR__ATIME
> isn't set in attr_clr.
Passing only a single atime flag (e.g. MOUNT_ATTR_RELATIME) in
attr_clr causes mount_setattr() to fail with EINVAL.
This change ensures that whenever an atime mode is updated,
attr_clr includes MOUNT_ATTR__ATIME to properly reset the
entire access-time attribute field before applying the new mode.
[1] https://man7.org/linux/man-pages/man2/mount_setattr.2.html
Signed-off-by: lifubang <lifubang@acmcoder.com>
Since [PR 4812], runc exec tries to use clone3 syscall with
CLONE_INTO_CGROUP, falling back to the old method if it is not
supported.
One issue with that approach is, a
> Cmd cannot be reused after calling its [Cmd.Start], [Cmd.Run],
> [Cmd.Output], or [Cmd.CombinedOutput] methods.
(from https://pkg.go.dev/os/exec#Cmd).
This is enforced since Go 1.26, see [CL 728642], and so runc exec
actually fails in specific scenarios (go1.26 and no CLONE_INTO_CGROUP
support).
The easiest workaround is to pre-copy the p.cmd structure (copy = *cmd).
From the [CL 734200] it looks like it is an acceptable way, but it might
break in the future as it also copies the private fields, so let's do a
proper field-by-field copy. If the upstream will add cmd.Clone method,
we will switch to it.
Also, we can probably be fine with a post-copy (once the first Start has
failed), but let's be conservative here and do a pre-copy.
[PR 4812]: https://github.com/opencontainers/runc/pull/4812
[CL 728642]: https://go.dev/cl/728642
[CL 734200]: https://go.dev/cl/734200
Reported-by: Efim Verzakov <efimverzakov@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Theoretically, exec.Command can set cmd.Err.
Practically, this should never happen (Linux, Go <= 1.26, exePath is
absolute), but in the unlikely case it does, let's fail early.
This is related to the cloneCmd (to be introduced by the following
commit) which chooses to not copy the Err field. Theoretically,
exec.Command can set Err and so the first call to cmd.Start will fail
(since Err != nil), and the second call to cmd.Start may succeed because
Err == nil. Yet, this scenario is highly unlikely, but better be safe
than sorry.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The Process type is quite big (currently 368 bytes on a 64 bit Linux)
and using non-pointer receivers in its methods results in copying which
is totally unnecessary.
Change the methods to use pointer receivers.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The Config type is quite big (currently 554 bytes on a 64 bit Linux)
and using non-pointer receivers in its methods results in copying which
is totally unnecessary.
Change the methods to use pointer receivers.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Rename a function parameter (containerId -> containerID) to avoid a
linter warning:
> var-naming: method parameter containerId should be containerID (revive)
In many other places, including config.json (.linux.uidMappings and
.gidMappings) it is already called containerID, so let's rename.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Alas, these new constants are already in v1.4.0 release so we can't
remove those right away, but we can mark them as deprecated now
and target removal for v1.5.0.
So,
- mark them as deprecated;
- redefine via unix.MPOL_* counterparts;
- fix the validator code to use unix.MPOL_* directly.
This amends commit a0e809a8.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function calls Init what normally never returns, so the defer only
works if there is an error and we can safely use it to close those fds
we opened. This was done for most but not all fds.
Reported in issue 5008.
Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.
The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>