This adds support for WaitKillableRecv seccomp flag
(also known as SCMP_FLTATR_CTL_WAITKILL in libseccomp and
as SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV in the kernel).
This requires:
- libseccomp >= 2.6.0
- libseccomp-golang >= 0.11.0
- linux kernel >= 5.19
Note that this flag does not make sense without NEW_LISTENER, and
the kernel returns EINVAL when SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV
is set but SECCOMP_FILTER_FLAG_NEW_LISTENER is not set.
For runc this means that .linux.seccomp.listenerPath should also be set,
and some of the seccomp rules should have SCMP_ACT_NOTIFY action. This
is why the flag is tested separately in seccomp-notify.bats.
At the moment the only adequate CI environment for this functionality is
Fedora 43. On all other platforms (including CentOS 10 and Ubuntu 24.04)
it is skipped similar to this:
> ok 251 runc run [seccomp] (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) # skip requires libseccomp >= 2.6.0 and API level >= 7 (current version: 2.5.6, API level: 6)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.
Enable gofumpt "extra" rules, and reformat the code accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The commit mentioned below has missed these changes.
Fixes: 17570625 ("Use for range over integers")
Signed-off-by: Ariel Otilibili <otilibil@eurecom.fr>
It is available since Go 1.21 and is defined during compile time
(i.e. based on GOARCH during build).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It turns out that on ppc64le (at least), Docker doesn't include any
architectures in the list of allowed architectures. libseccomp
interprets this as "just include the default architecture" but patchbpf
would return a no-op ENOSYS stub, which would lead to the exact issues
that commit 7a8d7162f9 ("seccomp: prepend -ENOSYS stub to all
filters") fixed for other architectures.
So, just always include the running architecture in the list. There's
no real downside.
Ref: https://bugzilla.suse.com/show_bug.cgi?id=1192051#c6
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Calling the Linux AUDIT_* architecture constants "native" leads to
confusing code when we are getting the actual native architecture of the
running system.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
*os.File is correctly tracked by the garbage collector, and there's no
need to use raw file descriptors for this code.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This test (initially added by commit 58ea21daef and later amended in
commit 26dc55ef1a) currently has two major deficiencies:
1. All possible flag combinations, and their respective numeric values,
have to be explicitly listed. Currently we support 3 flags, so
there is only 2^3 - 1 = 7 combinations, but adding more flags will
become increasingly difficult (for example, 5 flags will result in
31 combinations).
2. The test requires kernel 4.17 (for SECCOMP_FILTER_FLAG_SPEC_ALLOW),
and not doing any tests when running on an older kernel. This, too,
will make it more difficult to add extra flags in the future.
Both issues can be solved by using runc features which now prints all
known and supported runc flags. We still have to hardcode the numeric
values of all flags, but most of the other work is coded now.
In particular:
* The test only uses supported flags, meaning it can be used with
older kernels, removing the limitation (2) above.
* The test calculates the powerset (all possible combinations) of
flags and their numeric values. This makes it easier to add more
flags, removing the limitation (1) above.
* The test will fail (in flags_value) if any new flags will be added
to runc but the test itself is not amended.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Amend runc features to print seccomp flags. Two set of flags are added:
* known flags are those that this version of runc is aware of;
* supported flags are those that can be set; normally, this is the same
set as known flags, but due to older version of kernel and/or
libseccomp, some known flags might be unsupported.
This commit also consolidates three different switch statements dealing
with flags into one, in func setFlag. A note is added to this function
telling what else to look for when adding new flags.
Unfortunately, it also adds a list of known flags, that should be
kept in sync with the switch statement.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In findLastSyscalls, we convert libseccomp.ArchNative to the real
libseccomp architecture, but archToNative already does that, so
this code is redundant.
Remove the redundant code, and move its comment to archToNative.
Fixes: 7a8d7162f
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This makes libseccomp produce a BPF which uses a binary tree for
syscalls (instead of linear set of if statements).
It does not make sense to enable binary tree for small set of rules,
so don't do that if we have less than 8 syscalls (the number is chosen
arbitrarily).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Add a debug print of seccomp flags value, so the test can check
those (without using something like strace, that is).
Amend the flags setting test with the numeric values expected, and the
logic to check those.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 58ea21daef added support for seccomp flags such as
SPEC_ALLOW, but it does not work as expected, because since commit
7a8d7162f9 we do not use libseccomp-golang's Load(), but
handle flags separately in patchbfp.
This fixes setting SPEC_ALLOW flag.
Add a comment to not forget to amend filterFlags when adding new flags.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Define sizeof(int) as a constant, and also return ENOSYS earlier in the
filter if it doesn't increase the number of instructions we generate
(this is a negligible performance improvement but it does make it easier
to understand the generated filter stub).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
On s390x, syscalls above 255 are multiplexed using the (now otherwise
unused) setup(2) syscall (syscall number 0). If the kernel supports the
syscall then it will correctly translate the syscall number such that
seccomp will correctly detect it -- however, for unknown syscalls the
syscall number remains unchanged. This can be verified by running the
following program under strace:
int main(void)
{
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP);
seccomp_load(ctx);
return syscall(439, AT_FDCWD, "asdf", X_OK, 0);
}
Which will then die with the following signal (on pre-5.8 kernels):
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP,
si_call_addr=0x3ffb3006c22, si_syscall=__NR_setup,
si_arch=AUDIT_ARCH_S390X} ---
(Note that the si_syscall is __NR_setup, not __NR_faccessat2.)
As a result, the -ENOSYS handling we had previously did not work
completely correctly on s390x because any syscall not supported by the
kernel would be treated as syscall number 0 rather than the actual
syscall number.
Always returning -ENOSYS will not cause any issues because in all of the
cases where this multiplexing occurs, seccomp will see the remapped
syscall number -- and no userspace program will call setup(2)
intentionally (the syscall has not existed in Linux for decades and was
originally a hack used early in Linux init prior to spawning pid1 -- so
you will get -ENOSYS from the kernel anyway).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Two new seccomp actions have been added to the libseccomp-golang
dependency, which can be now supported by runc, too.
ActKillThread kills the thread that violated the rule. It is the same as
ActKill. All other threads from the same thread group will continue to
execute.
ActKillProcess kills the process that violated the rule. All threads in
the thread group are also terminated. This action is only usable when
libseccomp API level 3 or higher is supported.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These are just boilerplate and are only really useful for the two
actions which require us to set a default errno/aux value (ActErrno and
ActTrace).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
Go 1.17 introduce this new (and better) way to specify build tags.
For more info, see https://golang.org/design/draft-gobuild.
As a way to seamlessly switch from old to new build tags, gofmt (and
gopls) from go 1.17 adds the new tags along with the old ones.
Later, when go < 1.17 is no longer supported, the old build tags
can be removed.
Now, as I started to use latest gopls (v0.7.1), it adds these tags
while I edit. Rather than to randomly add new build tags, I guess
it is better to do it once for all files.
Mind that previous commits removed some tags that were useless,
so this one only touches packages that can at least be built
on non-linux.
Brought to you by
go1.17 fmt ./...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes using runc with podman on my system (Fedora 34).
> $ podman --runtime `pwd`/runc run --rm --memory 4M fedora echo it works
> Error: unable to start container process: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied
The problem is, libseccomp returns EPERM when a redundant rule (i.e. the
rule with the same action as the default one) is added, and podman (on
my machine) sets the following rules in config.json:
<....>
"seccomp": {
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"bdflush",
"io_pgetevents",
<....>
],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1
},
<....>
(Note that defaultErrnoRet is not set, but it defaults to 1).
With this commit, it works:
> $ podman --runtime `pwd`/runc run --memory 4M fedora echo it works
> it works
Add an integration test (that fails without the fix).
Similar crun commit:
* https://github.com/containers/crun/commit/08229f3fb904c5ea19a7d9
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Rather than silently ignoring unknown syscalls, print a warning.
While at it, fix imports ordering (stdlib, others, ours).
[v2: demote Warn to Debug]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do this for all errors except one from unix.*.
This fixes a bunch of errorlint warnings, like these
libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
if le, ok := err.(Error); ok {
^
libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
lerr, ok := err.(Error)
^
libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
_, ok := err.(*stateTransitionError)
^
libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Most of these were false positives or cases where we want to ignore the
lint, but the change to the BPF generation is actually useful.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
TestPatchHugeSeccompFilterDoesNotBlock is only testing the
disassembleFilter function. There is no need to invoke PatchAndLoad
which has the side effect of loading a seccomp profile.
Co-authored-by: Danail Branekov <danailster@gmail.com>
Co-authored-by: Kieron Browne <kbrowne@vmware.com>
Signed-off-by: Kieron Browne <kbrowne@vmware.com>
Signed-off-by: Danail Branekov <danailster@gmail.com>
There is a potential deadlock where the ExportBPF method call writes to
a pipe but the pipe is not read until after the method call returns.
ExportBPF might fill the pipe buffer, in which case it will block
waiting for a read on the other side which can't happen until the method
returns.
Here we concurrently read from the pipe into a buffer to ensure
ExportBPF will always return.
Co-authored-by: Kieron Browne <kbrowne@vmware.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Kieron Browne <kbrowne@vmware.com>
Signed-off-by: Danail Branekov <danailster@gmail.com>
This test fails to compile on i386:
> libcontainer/seccomp/patchbpf/enosys_linux_test.go:180:20: constant 3735928559 overflows int
> libcontainer/seccomp/patchbpf/enosys_linux_test.go:204:19: constant 3735928559 overflows int
> libcontainer/seccomp/patchbpf/enosys_linux_test.go:227:25: constant 3735928559 overflows int
This is because golang.org/x/net/bpf returns an int from their emulated
BPF VM implementation when they should really be returning uint32.
Fix by switching to uint32 in the test code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The correct way to do that conversion according to
https://pkg.go.dev/syscall#Errno is:
```
err = nil
if errno != 0 {
err = errno
}
```
In this case the error check will always report a false positive in
unix.RawSyscall(unix.SYS_SECCOMP, ...), probably nobody has faced this
problem because the code takes the other path in most of the cases.
Fixes: 7a8d7162f9 ("seccomp: prepend -ENOSYS stub to all filters")
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
This function is called by `InitSeccomp`, but only when compiled
with seccomp (and cgo) enabled, so should not be needed for other
situations.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>