If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).
The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This helper was added for runc-dmz in commit dac417174, but runc-dmz was
later removed in commit 871057d, which forgot to remove the helper.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
filepath-securejoin has a bunch of extra hardening features and is very
well-tested, so we should use it instead of our own homebrew solution.
A lot of rootfs_linux.go callers pass a SecureJoin'd path, which means
we need to keep the wrapper helpers in utils, but at least the core
logic is no longer in runc. In future we will want to remove this dodgy
logic and just use file handles for everything (using libpathrs,
ideally).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
While we use SecureJoin to try to make all of our target paths inside
the container safe, SecureJoin is not safe against an attacker than can
change the path after we "resolve" it.
os.MkdirAll can inadvertently follow symlinks and thus an attacker could
end up tricking runc into creating empty directories on the host (note
that the container doesn't get access to these directories, and the host
just sees empty directories). However, this could potentially cause DoS
issues by (for instance) creating a directory in a conf.d directory for
a daemon that doesn't handle subdirectories properly.
In addition, the handling for creating file bind-mounts did a plain
open(O_CREAT) on the SecureJoin'd path, which is even more obviously
unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an
attacker to cause data loss...). Regardless of the symlink issue,
opening an untrusted file could result in a DoS if the file is a hung
tty or some other "nasty" file. We can use mknodat to safely create a
regular file without opening anything anyway (O_CREAT|O_EXCL would also
work but it makes the logic a bit more complicated, and we don't want to
open the file for any particular reason anyway).
libpathrs[1] is the long-term solution for these kinds of problems, but
for now we can patch this particular issue by creating a more restricted
MkdirAll that refuses to resolve symlinks and does the creation using
file descriptors. This is loosely based on a more secure version that
filepath-securejoin now has[2] and will be added to libpathrs soon[3].
[1]: https://github.com/openSUSE/libpathrs
[2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0
[3]: https://github.com/openSUSE/libpathrs/issues/10
Fixes: CVE-2024-45310
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
There's too much logic here figuring out which CPUs to use. Runc is a
low level tool and is not supposed to be that "smart". What's worse,
this logic is executed on every exec, making it slower. Some of the
logic in (*setnsProcess).start is executed even if no annotation is set,
thus making ALL execs slow.
Also, this should be a property of a process, rather than annotation.
The plan is to rework this.
This reverts commit afc23e3397.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Go 1.23 tightens access to internal symbols, and even puts runc into
"hall of shame" for using an internal symbol (recently added by commit
da68c8e3). So, while not impossible, it becomes harder to access those
internal symbols, and it is a bad idea in general.
Since Go 1.23 includes https://go.dev/cl/588076, we can clean the
internal rlimit cache by setting the RLIMIT_NOFILE for ourselves,
essentially disabling the rlimit cache.
Once Go 1.22 is no longer supported, we will remove the go:linkname hack.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As reported in issue #4195, the new version(since 1.19) of go runtime
will cache rlimit-nofile. Before executing execve, the rlimit-nofile
of the process will be restored with the cache. In runc, this will
cause the rlimit-nofile set by the parent process for the container
to become invalid. It can be solved by clearing the cache.
Signed-off-by: ls-ggg <335814617@qq.com>
(cherry picked from commit f9f8abf310)
Signed-off-by: lifubang <lifubang@acmcoder.com>
This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.
The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.
Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.
Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.
Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
Since Go 1.16, [Go issue 1435][1] is solved, and the stdlib syscall
implementations work on Linux. While they are a bit more
flexible/heavier-weight than the implementations that were copied to
libcontainer/system (working across all threads), we compile with Cgo,
and using the libc wrappers should be just as suitable.
[1]: https://github.com/golang/go/issues/1435
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.
This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).
It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.
This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.
[1]: https://github.com/torvalds/linux/commit/bfedb589252c01fa505ac9f6f2a3d5d68d707ef4
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.
Thus, these annotations are no longer necessary. Hooray!
[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When starting a new container, and the very last step of executing of a
user process fails (last lines of (*linuxStandardInit).Init), it is too
late to print a proper error since both the log pipe and the init pipe
are closed.
This is partially mitigated by using exec.LookPath() which is supposed
to say whether we will be able to execute or not. Alas, it fails to do
so when the binary to be executed resides on a filesystem mounted with
noexec flag.
A workaround would be to use access(2) with X_OK flag. Alas, it is not
working when runc itself is a setuid (or setgid) binary. In this case,
faccessat2(2) with AT_EACCESS can be used, but it is only available
since Linux v5.8.
So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to
access(2) for non-setuid runc, and do nothing for setuid runc (as there
is nothing we can do). Note that this check if in addition to whatever
exec.LookPath does.
Fixes https://github.com/opencontainers/runc/issues/3520
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is not used since commit 5e7b48f7c0 (23 Mar 2017).
In case there are external users, they should switch to
opencontainers/selinux.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If the container binary to be run is removed in between runc create
and runc start, the latter spits the following error:
> can't exec user process: no such file or directory
This is a bit confusing since we don't see what file is missing.
Wrap the unix.Exec error into os.PathError, like in many other cases,
to provide some context. Remove the error wrapping from
(*linuxStandardInit).Init as it is now redundant.
With this patch, the error is now:
> exec /bin/false: no such file or directory
Reported-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kir Kolyshkin (3):
libct/system: add I and P process states
libct/system.Stat: fix/improve/speedup
libct/system/proc_test: fix, improve, add benchmark
LGTMs: thaJeztah cyphar
Go 1.17 introduce this new (and better) way to specify build tags.
For more info, see https://golang.org/design/draft-gobuild.
As a way to seamlessly switch from old to new build tags, gofmt (and
gopls) from go 1.17 adds the new tags along with the old ones.
Later, when go < 1.17 is no longer supported, the old build tags
can be removed.
Now, as I started to use latest gopls (v0.7.1), it adds these tags
while I edit. Rather than to randomly add new build tags, I guess
it is better to do it once for all files.
Mind that previous commits removed some tags that were useless,
so this one only touches packages that can at least be built
on non-linux.
Brought to you by
go1.17 fmt ./...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those states are available since Linux 4.14 (kernel commits
8ef9925b02c23e3838d5 and 06eb61844d841d003). Before this
patch, they were shown as unknown.
This is mostly cosmetical.
Note that I is described in /proc/pid/status as just "idle", although
elsewhere it says it's an idle kernel thread. Let's have it as "idle"
for brevity.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Remove PID field as it is useless.
2. Rewrite parseStat() to make it faster and more correct:
- do not use fmt.Scanf as it is very slow;
- avoid splitting data into 20+ fields, of which we only need 2;
- make sure to not panic on short lines and other bad input;
- add some bad input tests (some fail with old code);
- use LastIndexByte instead of LastIndex.
Benchmarks:
before (from the previous commit message):
> BenchmarkParseStat-4 116415 10804 ns/op
> BenchmarkParseRealStat-4 240 4781769 ns/op
after:
> BenchmarkParseStat-4 1164948 1068 ns/op
> BenchmarkParseRealStat-4 331 3458315 ns/op
We are seeing 10x speedup in a synthetic benchmark, and about 1.4x
speedup in a real world benchmark.
While at it, do not ignore any possible errors, and properly wrap those.
[v2: use pkg/errors more, remove t.Logf from test]
[v3: rebased; drop pkg/errors; gofumpt'ed]
[v4: rebased; improved description]
[v5: rebased; mention bad input tests, added second benchmark results]
[v6: remove PID field, do not use strings.Split, further speedup]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Add a test case that tests parentheses in command.
2. Replace individual comparisons with reflect.DeepEqual.
This also fixes wrong %-style types in Fatalf statements.
3. Replace Fatalf with Errorf so we don't bail out on the first
failure, and do not check result on error.
4. Add two benchmarks. On my laptop, they show:
BenchmarkParseStat-4 116415 10804 ns/op
BenchmarkParseRealStat-4 240 4781769 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.
Signed-off-by: Maksim An <maksiman@microsoft.com>
Errors from unix.* are always bare and thus can be used directly.
Add //nolint:errorlint annotation to ignore errors such as these:
libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
case errno == unix.ERANGE:
^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if e != unix.EINVAL {
^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if err != unix.EINVAL && err != unix.EPERM {
^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were deprecated and moved; the stubs were included in the
last two (rc94, rc95) releases, so external consumers would have
the chance to update their code.
Removing this so that this doesn't get into v1.0.0 GA
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Most of these were false positives or cases where we want to ignore the
lint, but the change to the BPF generation is actually useful.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.
Looking at uses of these utilities (outside of runc itself);
`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.
[1]: https://grep.app/search?current=2&q=.RunningInUserNS
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
GetProcessStartTime was deprecated over three Years ago in
439eaa3584, so we may as well remove
it now.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>