Commit Graph

293 Commits

Author SHA1 Message Date
Aleksa Sarai 3b57e45cbf mount: add support for ridmap and idmap
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.

This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai 8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai 5ae88daf06 idmap: allow arbitrary idmap mounts regardless of userns configuration
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.

We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai 09822c3da8 configs: disallow ambiguous userns and timens configurations
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).

In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.

It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Aleksa Sarai 3bab7e9223 configs: clean up error messages for Host[UG]ID
If a user has misconfigured their userns mappings, they need to know
which id specifically is not mapped. There's no need to be vague.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Aleksa Sarai 1912d5988b *: actually support joining a userns with a new container
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.

The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.

In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.

As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.

Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.

Finally, add an integration test to make sure we don't regress this again.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:08 +11:00
Aleksa Sarai 669f4dbef8 configs: validate: add validation for bind-mount fsflags
Bind-mounts cannot have any filesystem-specific "data" arguments,
because the kernel ignores the data argument for MS_BIND and
MS_BIND|MS_REMOUNT and we cannot safely try to override the flags
because those would affect mounts on the host (these flags affect the
superblock).

It should be noted that there are cases where the filesystem-specified
flags will also be ignored for non-bind-mounts but those are kernel
quirks and there's no real way for us to work around them. And users
wouldn't get any real benefit from us adding guardrails to existing
kernel behaviour.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-11-02 07:50:03 +11:00
Rodrigo Campos 4bf8b55594 libct: Remove old comment
We changed it in PR:
	https://github.com/opencontainers/runtime-spec/pull/1225

But we missed to remove this comment.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-11-01 12:48:42 +01:00
Zheao.Li 98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Aleksa Sarai 7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
utam0k 770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Rodrigo Campos b17c6f237d validator: Relax warning for not abs mount dst path
The runtime spec now allows relative mount dst paths, so remove the
comment saying we will switch this to an error later and change the
error messages to reflect that.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-09-11 16:02:41 +02:00
Kailun Qin e1584831b6 libct/cg: add CFS bandwidth burst for CPU
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.

This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.

The patch adds the support/control for CFS bandwidth burst.

runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120

Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
2023-09-06 23:23:30 +08:00
Kir Kolyshkin 6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Aleksa Sarai 1f25724a96 configs: fix idmapped mounts json field names
In the runc state JSON we always use snake_case. This is a no-op change,
but it will cause any existing container state files to be incorrectly
parsed. Luckily, commit fbf183c6f8 ("Add uid and gid mappings to
mounts") has never been in a runc release so we can change this before a
1.2.z release.

Fixes: fbf183c6f8 ("Add uid and gid mappings to mounts")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-15 19:54:24 -07:00
Aleksa Sarai aa5f4c1137 tests: add several timens tests
These are not exhaustive, but at least confirm that the feature is not
obviously broken (we correctly set the time offsets).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-10 19:01:31 +10:00
Aleksa Sarai 9acfd7b1a3 timens: minor cleanups
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-10 18:59:55 +10:00
Aleksa Sarai 0866112e81 merge #3876 into opencontainers/runc:main
Chethan Suresh (1):
  Support time namespace

LGTMs: kolyskin cyphar
Closes #3876
2023-08-10 18:27:17 +10:00
Rodrigo Campos 19d26a6596 Revert "libct/validator: Error out on non-abs paths"
This reverts commit 881e92a3fd and adjust
the code so the idmap validations are strict.

We now only throw a warning and the container is started just fine.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-08-08 13:45:31 +02:00
Chethan Suresh ebc2e7c435 Support time namespace
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset

Example runtime spec

"timeOffsets": {
    "monotonic": {
        "secs": 172800,
        "nanosecs": 0
    },
    "boottime": {
        "secs": 604800,
        "nanosecs": 0
    }
}

Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
2023-08-03 10:12:01 +05:30
Ruediger Pluem da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
lifubang c875ea8529 use the length of UIDMappings/GIDMappings to check whether empty or not
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-07-23 11:59:18 +08:00
Francis Laniel c47f58c4e9 Capitalize [UG]idMappings as [UG]IDMappings
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Rodrigo Campos fbf183c6f8 Add uid and gid mappings to mounts
Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
Rodrigo Campos 881e92a3fd libct/validator: Error out on non-abs paths
This was a warning already and it was requested to make this an error
while we will add validation of idmap mounts:
	https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318

I've also tested a k8s cluster and the config.json generated by
containerd didn't use any relative paths. I tested one pod, so it was
definitely not an extensive test.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-07 12:00:33 +02:00
Kir Kolyshkin 2a7dcbbb40 libct: fix shared pidns detection
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.

The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.

Fix the issue, add a test case (which fails without the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:23:29 -07:00
utam0k d9230602e9 Implement to set a domainname
opencontainers/runtime-spec#1156

Signed-off-by: utam0k <k0ma@utam0k.jp>
2023-04-12 13:31:20 +00:00
wineway 81c379fa8b support SCHED_IDLE for runc cgroupfs
Signed-off-by: wineway <wangyuweihx@gmail.com>
2023-01-31 15:19:05 +08:00
Eric Ernst e29e57b5fc libcontainer: configs: ensure can build on darwin
configs package can no longer be built on non-Linux OS, such as Darwin.

When running `GOOS=darwin go build` on the packge, we had the following
errors:
```
./configs/mount.go:34:16: undefined: unix.MountAttr
./configs/mount.go:47:22: undefined: unix.MS_BIND
```

Let's ensure that the linux specific bits are handled in mount_linux.go,
and introduce a _unsupported file, similar to how cgroups file is
handled within the package. This'll facilitate utilization of the pkg
for other projects that care about Darwin.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-17 15:57:08 -08:00
Kir Kolyshkin 6462e9de67 runc update: implement memory.checkBeforeUpdate
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.

In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.

In cgroup v2, same operation succeeds, and the container is OOM killed.

Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.

Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.

Add some test cases.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-02 17:15:26 -07:00
Kir Kolyshkin 45cc290f02 libct: fixes for godoc 1.19
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.

Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.

Some more fixes (and enhancements) might be required.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-08-16 09:53:54 -07:00
Alban Crequy 58ea21daef seccomp: add support for flags
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW

Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.

See similar commit in crun:
https://github.com/containers/crun/commit/fefabffa2816ea343068ed036a86944393db189a

Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
2022-07-28 16:25:26 +02:00
cdoern c0be1aa2d1 export blockIODevice
the struct blockIODevice is used in an exported struct but it is not itself exported rendering that type inaccessible to
outside projects

Signed-off-by: cdoern <cdoern@redhat.com>
2022-06-13 13:40:39 -04:00
Kir Kolyshkin 48006d0007 libct/configs/validate: rootlessEUIDMount: speedup
1. Fix function docs. In particular, remove the part
   which is not true ("verifies that the user isn't trying to set up any
   mounts they don't have the rights to do"), and fix the part that
   says "that doesn't resolve to root" (which is no longer true since
   commit d8b669400a).

2. Replace fmt.Sscanf (which is slow and does lots of allocations)
   with strings.TrimPrefix and strconv.Atoi.

3. Add a benchmark for rootlessEUIDMount. Comparing the old and the new
   implementations:

	name                 old time/op    new time/op    delta
	RootlessEUIDMount-4    1.01µs ± 2%    0.16µs ± 1%  -84.15%  (p=0.008 n=5+5)

	name                 old alloc/op   new alloc/op   delta
	RootlessEUIDMount-4      224B ± 0%       80B ± 0%  -64.29%  (p=0.008 n=5+5)

	name                 old allocs/op  new allocs/op  delta
	RootlessEUIDMount-4      7.00 ± 0%      1.00 ± 0%  -85.71%  (p=0.008 n=5+5)

Note this code is already tested (in rootless_test.go).

Fixes: d8b669400a
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-17 13:39:55 -07:00
Markus Lehtonen 1d5c331042 configs/validate: looser validation for RDT
Don't require CAT or MBA because we don't detect those correctly (we
don't support L2 or L3DATA/L3CODE for example, and in the future
possibly even more). With plain "ClosId mode" we don't really care: we
assign the container to a pre-configured CLOS without trying to do
anything smarter.

Moreover, this was a duplicate/redundant check anyway, as for CAT and
MBA there is another specific sanity check that is done if L3 or MB
is specified in the config.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2022-02-18 16:24:50 +02:00
Kir Kolyshkin 0d21515038 libct: remove Validator interface
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.

Having validate.Validator field in Factory does not make sense for all
the same reasons.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Sebastiaan van Stijn 7173c53906 Merge pull request #3350 from kolyshkin/mount-cmds
libct: Mount: rm {Pre,Post}mountCmds
2022-02-03 10:12:56 +01:00
Kir Kolyshkin 1b14d97484 libct/configs: rm Windows TODO
It's clear at this point that runc won't support Windows.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 18:59:47 -08:00
Kir Kolyshkin 0fec1c2d8c libct: Mount: rm {Pre,Post}mountCmds
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 15:51:55 -08:00
Akihiro Suda 382eba4354 Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Requires kernel >= 5.12.

The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.

Fix issue 2823

Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-07 17:39:57 +09:00
Aleksa Sarai cdce249635 merge branch 'pr-3057'
Fraser Tweedale (1):
  chown cgroup to process uid in container namespace

LGTMs: kolyshkin cyphar
Closes #3057
2021-12-07 17:06:19 +11:00
Akihiro Suda 520702dac5 Add runc features command
Fix issue 3274

See `types/features/features.go`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-11-30 16:40:39 +09:00
Fraser Tweedale 35d20c4e0b chown cgroup to process uid in container namespace
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  https://github.com/opencontainers/runtime-spec/pull/1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
2021-11-30 08:52:59 +10:00
Mengjiao Liu a9bb11ec3c Fix the conversion of sysctl variable dots and slashes
Signed-off-by: Mengjiao Liu <mengjiao.liu@daocloud.io>
2021-11-04 11:45:15 +08:00
Mengjiao Liu 0f933d54fe Rename package validate_test to package validate
Signed-off-by: Mengjiao Liu <mengjiao.liu@daocloud.io>
2021-11-04 11:45:15 +08:00
Kir Kolyshkin 972aea3af0 libct/configs/validate: allow / in sysctl names
Runtime spec says:

> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.

and sysctl(8) says:

> variable
>    The name of a key to read from. An example is
>    kernel.ostype. The '/' separator is also accepted in place of a '.'.

Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.

Add some more test data where / is used as a separator.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-29 09:45:55 -07:00
Akihiro Suda 4d17654479 Merge pull request #2576 from kinvolk/alban/userns-2484-take2
Open bind mount sources from the host userns
2021-10-28 14:50:33 +09:00
Kir Kolyshkin 5516294172 Remove io/ioutil use
See https://golang.org/doc/go1.16#ioutil

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-14 13:46:02 -07:00
Alban Crequy 9c444070ec Open bind mount sources from the host userns
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:

  # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
  starting container process caused: process_linux.go:459:
  container init caused: rootfs_linux.go:58:
  mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
  to rootfs at \"/tmp/inaccessible\" caused:
  stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"

To solve this problem, this patch performs the following:

1. in nsexec.c, it opens the source path in the host userns (so we have
   the right permissions to open it) but in the container mntns (so the
   kernel cross mntns mount check let us mount it later:
   https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).

2. in nsexec.c, it passes the file descriptors of the source to the
   child process with SCM_RIGHTS.

3. In runc-init in Golang, it finishes the mounts while inside the
   userns even without access to the some components of the source
   paths.

Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.

This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.

This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.

Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).

This patch takes care of using O_CLOEXEC on mount fds, and close them
early.

Fixes: #2484.

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-10-12 15:13:45 +02:00
Kir Kolyshkin 097c6d7425 libct/cg: simplify getting cgroup manager
1. Make Rootless and Systemd flags part of config.Cgroups.

2. Make all cgroup managers (not just fs2) return error (so it can do
   more initialization -- added by the following commits).

3. Replace complicated cgroup manager instantiation in factory_linux
   by a single (and simple) libcontainer/cgroups/manager.New() function.

4. getUnifiedPath is simplified to check that only a single path is
   supplied (rather than checking that other paths, if supplied,
   are the same).

[v2: can't -> cannot]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-23 09:11:44 -07:00