zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-04-24 16:39:52 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	28cb321887	Pre-open container root directory A lot of filesystem-related stuff happens inside the container root directory, and we have used its name before. It makes sense to pre-open it and use a os.File handle instead. Function names in internal/pathrs are kept as is for simplicity (and it is an internal package), but they now accept root as os.File. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-29 12:02:36 -07:00
Curd Becker	536e183451	Replace os.Is* error checking functions with their errors.Is counterpart Signed-off-by: Curd Becker <me@curd-becker.de>	2025-12-11 03:16:02 +01:00
Akihiro Suda	64c3c8eea6	Merge pull request #4994 from kolyshkin/gofumpt-extra Enable gofumpt extra rules	2025-11-28 09:30:57 +09:00
Aleksa Sarai	9dbd37e06f	libct: switch final WithProcfd users to WithProcfdFile This probably should've been done as part of commit `d40b3439a9` ("rootfs: switch to fd-based handling of mountpoint targets") but it seems I missed them when doing the rest of the conversions. This also lets us remove utils.WithProcfd entirely, as well as pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the externally-importable "libcontainer/utils" package and so we need to have a deprecation notice to remove it in runc 1.5. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-26 21:03:30 +11:00
Kir Kolyshkin	67840cce4b	Enable gofumpt extra rules Commit `b2f8a74d` "clothed" the naked return as inflicted by gofumpt v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category, not enabled by default. The only other "extra" rule is to group adjacent parameters with the same type, which also makes sense. Enable gofumpt "extra" rules, and reformat the code accordingly. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-11-10 13:18:45 -08:00
Aleksa Sarai	d40b3439a9	rootfs: switch to fd-based handling of mountpoint targets An attacker could race with us during mount configuration in order to trick us into mounting over an unexpected path. This would bypass checkProcMount() and would allow for security profiles to be left unapplied by mounting over /proc/self/attr/... (or even more serious outcomes such as killing the entire system by tricking runc into writing strings to /proc/sysrq-trigger). This is a larger issue with our current mount infrastructure, and the ideal solution would be to rewrite it all to be fd-based (which would also allow us to support the "new" mount API, which also avoids a bunch of other issues with mount(8)). However, such a rewrite is not really workable as a security fix, so this patch is a bit of a compromise approach to fix the issue while also moving us a bit towards that eventual end-goal. The core issue in CVE-2025-52881 is that we currently use the (insecure) SecureJoin to re-resolve mountpoint target paths multiple times during mounting. Rather than generating a string from createMountpoint(), we instead open an os.File handle to the target mountpoint directly and then operate on that handle. This will make it easier to remove utils.WithProcfd() and rework mountViaFds() in the future. The only real issue we need to work around is that we need to re-open the mount target after doing the mount in order to get a handle to the mountpoint -- pathrs.Reopen() doesn't work in this case (it just re-opens the inode under the mountpoint) so we need to do a naive re-open using the full path. Note that if we used move_mount(2) this wouldn't be a problem because we would have a handle to the mountpoint itself. Note that this is still somewhat of a temporary solution -- ideally mountViaFds would use os.File directly to let us avoid some other issues with using bare /proc/... paths, as well as also letting us more easily use the new mount API on modern kernels. Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881 Co-developed-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-01 21:24:06 +11:00
Aleksa Sarai	aee7d3fe35	ci: add lint to forbid the usage of os.Create os.Create is shorthand for open(O_CREAT\|O_TRUNC) without O_EXCL, which is incredibly unsafe for us to do when interacting with a container rootfs (especially before pivot_root) as an attacker could swap the target path with a symlink that points to the host filesystem, causing us to delete the contents of or create host files. We did have a similar bug in CVE-2024-45310, but in that case we (luckily) didn't have O_TRUNC set which avoided the worst possible case. However, os.Create does set O_TRUNC and we were using it in scenarios that may have been exploitable. Because of how easy it us for us to accidentally introduce this kind of bug, we should simply not allow the usage of os.Create in our entire codebase. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-11-01 21:24:04 +11:00
Osama Abdelkader	1adb070b58	criu: replace deprecated strings.Title strings.Title is deprecated since Go 1.18. Replace it with a simple manual capitalization of the first character in criuNsToKey(). Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com> Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-10-12 14:02:21 +03:00
Kir Kolyshkin	ce3cd4234c	criu: simplify isOnTmpfs check in prepareCriuRestoreMounts Instead of generating a list of tmpfs mount and have a special function to check whether the path is in the list, let's go over the list of mounts directly. This simplifies the code and improves readability. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-05-20 16:56:55 -07:00
Kir Kolyshkin	f91fbd34d9	criu: inline makeCriuRestoreMountpoints Since its code is now trivial, and it is only called from a single place, it does not make sense to have it as a separate function. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-05-20 16:56:55 -07:00
Kir Kolyshkin	b8aa5481db	criu: ignore cgroup early in prepareCriuRestoreMounts It makes sense to ignore cgroup mounts much early in the code, saving some time on unnecessary operations. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-05-20 16:56:55 -07:00
Kir Kolyshkin	0c93d41c65	criu: improve prepareCriuRestoreMounts 1. Replace the big "if !" block with the if block and continue, simplifying the code flow. 2. Move comments closer to the code, improving readability. This commit is best reviewed with --ignore-all-space or similar. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-05-20 16:56:55 -07:00
Rodrigo Campos	c3a41d77db	Merge pull request #4696 from avagin/criu-vs-exec criu: Add time namespace to container config after checkpoint/restore	2025-04-01 14:54:33 -03:00
Kir Kolyshkin	17570625c0	Use for range over integers This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-31 17:15:06 -07:00
Andrei Vagin	b68cbdff34	criu: Add time namespace to container config after checkpoint/restore Since v3.14, CRIU always restores processes into a time namespace to prevent backward jumps of monotonic and boottime clocks. This change updates the container configuration to ensure that `runc exec` launches new processes within the container's time namespace. Fixes #2610 Signed-off-by: Andrei Vagin <avagin@gmail.com>	2025-03-26 15:12:01 +00:00
Kir Kolyshkin	9510ffb658	Fix a few staticcheck QF1001 warnings Like these: > libcontainer/criu_linux.go:959:3: QF1001: could apply De Morgan's law (staticcheck) > !(req.GetType() == criurpc.CriuReqType_FEATURE_CHECK \|\| > ^ > libcontainer/rootfs_linux.go:360:19: QF1001: could apply De Morgan's law (staticcheck) > if err == nil \|\| !(errors.Is(err, unix.EPERM) \|\| errors.Is(err, unix.EBUSY)) { > ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-25 16:06:44 -07:00
Prajwal S N	05e83fc600	deps: bump go-criu to v7 Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>	2025-03-05 01:02:53 +05:30
Kir Kolyshkin	a75076b4a4	Switch to opencontainers/cgroups This removes libcontainer/cgroups packages and starts using those from github.com/opencontainers/cgroups repo. Mostly generated by: git rm -f libcontainer/cgroups find . -type f -name "*.go" -exec sed -i \ 's\|github.com/opencontainers/runc/libcontainer/cgroups\|github.com/opencontainers/cgroups\|g' \ {} + go get github.com/opencontainers/cgroups@v0.0.1 make vendor gofumpt -w . Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-28 15:20:33 -08:00
Daniel Levi-Minzi	1d047e44ed	expose criu options for link remap and skip in flight Signed-off-by: Daniel Levi-Minzi <dleviminzi@gmail.com>	2025-02-25 10:35:31 -05:00
Kir Kolyshkin	8afeb58398	libct: add/use configs.HasHook This allows to omit a call to c.currentOCIState (which can be somewhat costly when there are many annotations) when the hooks of a given kind won't be run. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:47:09 -08:00
Kir Kolyshkin	47dc185880	Add runc_nocriu build tag This allows to make a 17% smaller runc binary by not compiling in checkpoint/restore support. It turns out that google.golang.org/protobuf package, used by go-criu, is quite big, and go linker can't drop unused stuff if reflection is used anywhere in the code. Currently there's no alternative to using protobuf in go-criu, and since not all users use c/r, let's provide them an option for a smaller binary. For the reference, here's top10 biggest vendored packages, as reported by gsa[1]: $ gsa runc \| grep vendor \| head │ 8.59% │ google.golang.org/protobuf │ 1.3 MB │ vendor │ │ 5.76% │ github.com/opencontainers/runc │ 865 kB │ vendor │ │ 4.05% │ github.com/cilium/ebpf │ 608 kB │ vendor │ │ 2.86% │ github.com/godbus/dbus/v5 │ 429 kB │ vendor │ │ 1.25% │ github.com/urfave/cli │ 188 kB │ vendor │ │ 0.90% │ github.com/vishvananda/netlink │ 135 kB │ vendor │ │ 0.59% │ github.com/sirupsen/logrus │ 89 kB │ vendor │ │ 0.56% │ github.com/checkpoint-restore/go-criu/v6 │ 84 kB │ vendor │ │ 0.51% │ golang.org/x/sys │ 76 kB │ vendor │ │ 0.47% │ github.com/seccomp/libseccomp-golang │ 71 kB │ vendor │ And here is a total binary size saving when `runc_nocriu` is used. For non-stripped binaries: $ gsa runc-cr runc-nocr \| tail -3 │ -17.04% │ runc-cr │ 15 MB │ 12 MB │ -2.6 MB │ │ │ runc-nocr │ │ │ │ └─────────┴──────────────────────────────────────────┴──────────┴──────────┴─────────┘ And for stripped binaries: │ -17.01% │ runc-cr-stripped │ 11 MB │ 8.8 MB │ -1.8 MB │ │ │ runc-nocr-stripped │ │ │ │ └─────────┴──────────────────────────────────────────┴──────────┴──────────┴─────────┘ [1]: https://github.com/Zxilly/go-size-analyzer Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-09 11:19:23 -08:00
Kir Kolyshkin	c487840f75	Remove main package dependency on criurpc Commit `7f64fb47` made the main package, and runc/libcontainer's CriuOpts depend on criu/rpc. This is not good; among the other things, it makes it complicated to make c/r optional. Let's switch CriuOpts.ManageCgroupsMode to a string (yes, it's an APIt breaking change) and move the cgroup mode string parsing to libcontainer. While at it, let's better document ManageCgroupsMode. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-09 11:19:23 -08:00
yangzhao.hjh	324fcea4ec	Terminate execution for criu that does not meet version requirements Signed-off-by: yangzhao.hjh <yangzhao.hjh@alibaba-inc.com>	2024-10-11 09:56:07 +08:00
Aleksa Sarai	1410a6988d	rootfs: consolidate mountpoint creation logic The logic for how we create mountpoints is spread over each mountpoint preparation function, when in reality the behaviour is pretty uniform with only a handful of exceptions. So just move it all to one function that is easier to understand. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-07-25 14:16:05 +10:00
Kir Kolyshkin	e676dac523	libct/criu: simplify checkCriuFeatures Since criu 2.12, rpcOpts is not needed when checking criu features. As we requires criu >= 3.0 in Checkpoint, we can remove rpcOpts. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-05-09 15:18:29 -07:00
Kir Kolyshkin	f6a8c9b816	libct: checkCriuFeatures: return underlying error Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-05-09 15:18:29 -07:00
Akihiro Suda	3f4a73d632	TestCheckpoint: skip on ErrCriuMissingFeatures Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2023-12-19 18:47:03 +09:00
Aleksa Sarai	cdff09ab87	rootfs: fix 'can we mount on top of /proc' check Our previous test for whether we can mount on top of /proc incorrectly assumed that it would only be called with bind-mount sources. This meant that having a non bind-mount entry for a pseudo-filesystem (like overlayfs) with a dummy source set to /proc on the host would let you bypass the check, which could easily lead to security issues. In addition, the check should be applied more uniformly to all mount types, so fix that as well. And add some tests for some of the tricky cases to make sure we protect against them properly. Fixes: `331692baa7` ("Only allow proc mount if it is procfs") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:42 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Kir Kolyshkin	efbebb39b5	libct: rename root to stateDir in struct Container The name "root" (or "containerRoot") is confusing; one might think it is the root of container's file system (the directory we chroot into). Rename to stateDir for clarity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-10-04 14:57:10 +11:00
Akihiro Suda	0a5cd69462	Merge pull request #3995 from kolyshkin/rm-unix-nolint bump golangci-lint; remove nolint annotations for unix errno comparisons	2023-08-25 16:54:57 +09:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Kir Kolyshkin	f62f0bdfbf	Remove nolint annotations for unix errno comparisons golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains the fix [1] whitelisting all errno comparisons for errors coming from x/sys/unix. Thus, these annotations are no longer necessary. Hooray! [1] https://github.com/polyfloyd/go-errorlint/pull/47 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 17:28:10 -07:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	38676931ed	criu: do not add log file into error message As we now log the log file name in logCriuErrors. While at it, there is no need to use var.String() with %s as it is done by the runtime. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:39:33 -07:00
Kir Kolyshkin	c77aaa3f95	criu checkpoint/restore: print errors from criu log When criu fails, it does not give us much context to understand what was the cause of an error -- for that, we need to take a look into its log file. This is somewhat complicated to do (as you can see in parts of checkpoint.bats removed by this commit), and not very user-friendly. Add a function to find and log errors from criu logs, together with some preceding context, in case either checkpoint or restore has failed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:33:20 -07:00
Kir Kolyshkin	e4478e9fff	criuSwrk: simplify switch 1. Use "switch t" since we only check t. 2. Remove unneeded t assignment. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:33:20 -07:00
Kir Kolyshkin	cb981e510b	libct: move criu-related stuff to separate file No code change, only added periods to some comments to make godot happy. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:16:01 -07:00

39 Commits