zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-04-25 08:56:43 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	0079bee17f	Support specs.LinuxSeccompFlagWaitKillableRecv This adds support for WaitKillableRecv seccomp flag (also known as SCMP_FLTATR_CTL_WAITKILL in libseccomp and as SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV in the kernel). This requires: - libseccomp >= 2.6.0 - libseccomp-golang >= 0.11.0 - linux kernel >= 5.19 Note that this flag does not make sense without NEW_LISTENER, and the kernel returns EINVAL when SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV is set but SECCOMP_FILTER_FLAG_NEW_LISTENER is not set. For runc this means that .linux.seccomp.listenerPath should also be set, and some of the seccomp rules should have SCMP_ACT_NOTIFY action. This is why the flag is tested separately in seccomp-notify.bats. At the moment the only adequate CI environment for this functionality is Fedora 43. On all other platforms (including CentOS 10 and Ubuntu 24.04) it is skipped similar to this: > ok 251 runc run [seccomp] (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) # skip requires libseccomp >= 2.6.0 and API level >= 7 (current version: 2.5.6, API level: 6) Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-16 10:48:42 -07:00
Kir Kolyshkin	b2f8a74de5	all: format sources with gofumpt v0.9.1 Since gofumpt v0.9.0 there's a new formatting rule to "clothe" any naked returns. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-10-07 17:08:56 -07:00
jokemanfire	859feb8e44	build(seccomp): Add audit support for loong64 Co-authored-by: Rodrigo Campos <rata@users.noreply.github.com> Signed-off-by: jokemanfire <hu.dingyang@zte.com.cn>	2025-07-16 09:39:11 +08:00
Kir Kolyshkin	a31efe7045	libct/seccomp/patchbpf: use binary.NativeEndian It is available since Go 1.21 and is defined during compile time (i.e. based on GOARCH during build). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-09-11 22:06:58 -07:00
Akihiro Suda	e7848482e2	Revert "libcontainer: seccomp: pass around *os.File for notifyfd" This reverts commit `20b95f23ca`. > Conflicts: > libcontainer/init_linux.go Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2024-07-03 17:28:12 +09:00
Sebastiaan van Stijn	c14213399a	remove pre-go1.17 build-tags Removed pre-go1.17 build-tags with go fix; go fix -mod=readonly ./... Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-29 15:45:25 +02:00
Aleksa Sarai	ccc500c427	seccomp: patchbpf: always include native architecture in stub It turns out that on ppc64le (at least), Docker doesn't include any architectures in the list of allowed architectures. libseccomp interprets this as "just include the default architecture" but patchbpf would return a no-op ENOSYS stub, which would lead to the exact issues that commit `7a8d7162f9` ("seccomp: prepend -ENOSYS stub to all filters") fixed for other architectures. So, just always include the running architecture in the list. There's no real downside. Ref: https://bugzilla.suse.com/show_bug.cgi?id=1192051#c6 Reported-by: Fabian Vogt <fvogt@suse.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-03-29 12:01:47 +11:00
Aleksa Sarai	b288abeaa5	seccomp: patchbpf: rename nativeArch -> linuxAuditArch Calling the Linux AUDIT_* architecture constants "native" leads to confusing code when we are getting the actual native architecture of the running system. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-03-29 12:01:47 +11:00
lifubang	109dcadd9d	fix two typos Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-03 20:08:17 +08:00
Aleksa Sarai	20b95f23ca	libcontainer: seccomp: pass around os.File for notifyfd os.File is correctly tracked by the garbage collector, and there's no need to use raw file descriptors for this code. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	076745a40f	runc features: add seccomp filter flags Amend runc features to print seccomp flags. Two set of flags are added: * known flags are those that this version of runc is aware of; * supported flags are those that can be set; normally, this is the same set as known flags, but due to older version of kernel and/or libseccomp, some known flags might be unsupported. This commit also consolidates three different switch statements dealing with flags into one, in func setFlag. A note is added to this function telling what else to look for when adding new flags. Unfortunately, it also adds a list of known flags, that should be kept in sync with the switch statement. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-11-29 17:24:32 -08:00
Kir Kolyshkin	2cd05e44b6	libct/seccomp/patchbpf: rm duplicated code In findLastSyscalls, we convert libseccomp.ArchNative to the real libseccomp architecture, but archToNative already does that, so this code is redundant. Remove the redundant code, and move its comment to archToNative. Fixes: `7a8d7162f` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-11-01 18:47:16 -07:00
Kir Kolyshkin	26dc55ef1a	seccomp: fix flag test to actually check the value Add a debug print of seccomp flags value, so the test can check those (without using something like strace, that is). Amend the flags setting test with the numeric values expected, and the logic to check those. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-08-30 19:16:08 -07:00
Kir Kolyshkin	c7dc8b1fed	libct/seccomp/patchbpf: support SPEC_ALLOW Commit `58ea21daef` added support for seccomp flags such as SPEC_ALLOW, but it does not work as expected, because since commit `7a8d7162f9` we do not use libseccomp-golang's Load(), but handle flags separately in patchbfp. This fixes setting SPEC_ALLOW flag. Add a comment to not forget to amend filterFlags when adding new flags. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-08-29 15:48:10 -07:00
Aleksa Sarai	6a79271c31	seccomp: patchbpf: minor cleanups Define sizeof(int) as a constant, and also return ENOSYS earlier in the filter if it doesn't increase the number of instructions we generate (this is a negligible performance improvement but it does make it easier to understand the generated filter stub). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2022-05-23 16:36:07 +10:00
Aleksa Sarai	be6488a5a9	seccomp: enosys: always return -ENOSYS for setup(2) on s390(x) On s390x, syscalls above 255 are multiplexed using the (now otherwise unused) setup(2) syscall (syscall number 0). If the kernel supports the syscall then it will correctly translate the syscall number such that seccomp will correctly detect it -- however, for unknown syscalls the syscall number remains unchanged. This can be verified by running the following program under strace: int main(void) { scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP); seccomp_load(ctx); return syscall(439, AT_FDCWD, "asdf", X_OK, 0); } Which will then die with the following signal (on pre-5.8 kernels): --- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x3ffb3006c22, si_syscall=__NR_setup, si_arch=AUDIT_ARCH_S390X} --- (Note that the si_syscall is __NR_setup, not __NR_faccessat2.) As a result, the -ENOSYS handling we had previously did not work completely correctly on s390x because any syscall not supported by the kernel would be treated as syscall number 0 rather than the actual syscall number. Always returning -ENOSYS will not cause any issues because in all of the cases where this multiplexing occurs, seccomp will see the remapped syscall number -- and no userspace program will call setup(2) intentionally (the syscall has not existed in Linux for decades and was originally a hack used early in Linux init prior to spawning pid1 -- so you will get -ENOSYS from the kernel anyway). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2022-05-23 16:36:07 +10:00
Akihiro Suda	1d7b297128	libct/seccomp: add riscv64 Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-05-11 17:23:45 -07:00
Alban Crequy	2b025c0173	Implement Seccomp Notify This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 13:04:24 +02:00
Alban Crequy	c55530bedc	vendoring: Use libseccomp with notify support The notify support has been merged in libseccomp-golang in this PR: https://github.com/seccomp/libseccomp-golang/pull/59 Also, we update to new API of libseccomp-golang so code doesn't break. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 12:38:12 +02:00
Kir Kolyshkin	d8da00355e	*: add go-1.17+ go:build tags Go 1.17 introduce this new (and better) way to specify build tags. For more info, see https://golang.org/design/draft-gobuild. As a way to seamlessly switch from old to new build tags, gofmt (and gopls) from go 1.17 adds the new tags along with the old ones. Later, when go < 1.17 is no longer supported, the old build tags can be removed. Now, as I started to use latest gopls (v0.7.1), it adds these tags while I edit. Rather than to randomly add new build tags, I guess it is better to do it once for all files. Mind that previous commits removed some tags that were useless, so this one only touches packages that can at least be built on non-linux. Brought to you by go1.17 fmt ./... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:58:22 -07:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
Kir Kolyshkin	a7cfb23b88	*: stop using pkg/errors Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	f6a0899b7f	: use errors.As and errors.Is Do this for all errors except one from unix.. This fixes a bunch of errorlint warnings, like these libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) if le, ok := err.(Error); ok { ^ libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) lerr, ok := err.(Error) ^ libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) _, ok := err.(*stateTransitionError) ^ libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint) switch err { ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Giuseppe Scrivano	c61f606254	libcontainer: honor seccomp defaultErrnoRet https://github.com/opencontainers/runtime-spec/pull/1087 added support for defaultErrnoRet to the OCI runtime specs. If a defaultErrnoRet is specified, disable patching the generated libseccomp cBPF. Closes: https://github.com/opencontainers/runc/issues/2943 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2021-05-17 09:23:32 +02:00
Danail Branekov	7b3e0bcf29	Ensure the scratch pipe is read during ExportBPF There is a potential deadlock where the ExportBPF method call writes to a pipe but the pipe is not read until after the method call returns. ExportBPF might fill the pipe buffer, in which case it will block waiting for a read on the other side which can't happen until the method returns. Here we concurrently read from the pipe into a buffer to ensure ExportBPF will always return. Co-authored-by: Kieron Browne <kbrowne@vmware.com> Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Kieron Browne <kbrowne@vmware.com> Signed-off-by: Danail Branekov <danailster@gmail.com>	2021-03-30 12:29:35 +03:00
Mauricio Vásquez	5c0342ba2c	libcontainer: fix bad conversion from syscall.Errno to error The correct way to do that conversion according to https://pkg.go.dev/syscall#Errno is: ``` err = nil if errno != 0 { err = errno } ``` In this case the error check will always report a false positive in unix.RawSyscall(unix.SYS_SECCOMP, ...), probably nobody has faced this problem because the code takes the other path in most of the cases. Fixes: `7a8d7162f9` ("seccomp: prepend -ENOSYS stub to all filters") Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>	2021-02-03 16:12:33 -05:00
Aleksa Sarai	7a8d7162f9	seccomp: prepend -ENOSYS stub to all filters Having -EPERM is the default was a fairly significant mistake from a future-proofing standpoint in that it makes any new syscall return a non-ignorable error (from glibc's point of view). We need to correct this now because faccessat2(2) is something glibc critically needs to have support for, but they're blocked on container runtimes because we return -EPERM unconditionally (leading to confusion in glibc). This is also a problem we're probably going to keep running into in the future. Unfortunately there are several issues which stop us from having a clean solution to this problem: 1. libseccomp has several limitations which require us to emulate behaviour we want: a. We cannot do logic based on syscall number, meaning we cannot specify a "largest known syscall number"; b. libseccomp doesn't know in which kernel version a syscall was added, and has no API for "minimum kernel version" so we cannot simply ask libseccomp to generate sane -ENOSYS rules for us. c. Additional seccomp rules for the same syscall are not treated as distinct rules -- if rules overlap, seccomp will merge them. This means we cannot add per-syscall -EPERM fallbacks; d. There is no inverse operation for SCMP_CMP_MASKED_EQ; e. libseccomp does not allow you to specify multiple rules for a single argument, making it impossible to invert OR rules for arguments. 2. The runtime-spec does not have any way of specifying: a. The errno for the default action; b. The minimum kernel version or "newest syscall at time of profile creation"; nor c. Which syscalls were intentionally excluded from the allow list (weird syscalls that are no longer used were excluded entirely, but Docker et al expect those syscalls to get EPERM not ENOSYS). 3. Certain syscalls should not return -ENOSYS (especially only for certain argument combinations) because this could also trigger glibc confusion. This means we have to return -EPERM for certain syscalls but not as a global default. 4. There is not an obvious (and reasonable) upper limit to syscall numbers, so we cannot create a set of rules for each syscall above the largest syscall number in libseccomp. This means we must handle inverse rules as described below. 5. Any syscall can be specified multiple times, which can make generation of hotfix rules much harder. As a result, we have to work around all of these things by coming up with a heuristic to stop the bleeding. In the future we could hopefully improve the situation in the runtime-spec and libseccomp. The solution applied here is to prepend a "stub" filter which returns -ENOSYS if the requested syscall has a larger syscall number than any syscall mentioned in the filter. The reason for this specific rule is that syscall numbers are (roughly) allocated sequentially and thus newer syscalls will (usually) have a larger syscall number -- thus causing our filters to produce -ENOSYS if the filter was written before the syscall existed. Sadly this is not a perfect solution because syscalls can be added out-of-order and the syscall table can contain holes for several releases. Unfortuntely we do not have a nicer solution at the moment because there is no library which provides information about which Linux version a syscall was introduced in. Until that exists, this workaround will have to be good enough. The above behaviour only happens if the default action is a blocking action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the default action is permissive then we don't do any patching. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-01-28 23:11:22 +11:00

28 Commits