Commit Graph

82 Commits

Author SHA1 Message Date
Aleksa Sarai 87b0804345 libct: remove deprecated HooksList.RunHooks
This was deprecated in commit e6a4870e4ac40 ("libct: better errors for
hooks"), and users have had ample time to migrate to Hooks.Run since.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2026-03-04 23:01:30 +11:00
Antti Kervinen eda7bdf80c Add memory policy support
Implement support for Linux memory policy in OCI spec PR:
https://github.com/opencontainers/runtime-spec/pull/1282

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2025-10-07 15:06:37 +03:00
Kir Kolyshkin 89e59902c4 Modernize code for Go 1.24
Brought to you by

	modernize -fix -test ./...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-08-27 19:11:02 -07:00
Kir Kolyshkin 71bd84f32e Merge pull request #4784 from kolyshkin/cgr-fup
cgroups separation followup
2025-06-19 10:32:33 -07:00
Kir Kolyshkin d22a42113d libct/configs: stop using deprecated id
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-06-18 18:14:54 -07:00
Kir Kolyshkin b25bcaa8b3 libct/configs: fix/improve deprecation notices
The per-file deprecation in cgroup_deprecated.go is not working,
let's replace it.

Link to Hooks.Run in Hook.Run deprecation notice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-06-18 18:14:46 -07:00
Kir Kolyshkin a10d338eb2 libct/configs: add package docstring
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-06-18 18:10:51 -07:00
Antonio Ojea 8d180e9658 Add support for Linux Network Devices
Implement support for passing Linux Network Devices to the container
network namespace.

The network device is passed during the creation of the container,
before the process is started.

It implements the logic defined in the OCI runtime specification.

Signed-off-by: Antonio Ojea <aojea@google.com>
2025-06-18 15:52:30 +01:00
Kir Kolyshkin 7fdec327a0 Use any instead of interface{}
The keyword is available since Go 1.18 (see
https://pkg.go.dev/builtin#any).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-31 17:15:06 -07:00
Kir Kolyshkin 3a33b6a3df Make state.json 25% smaller
This makes the state.json file 1303 bytes or almost 25% smaller (when
using the default spec, YMMV) by omitting default values.

Before: 5496 bytes
After: 4193 bytes

(With cgroups#9 applied, the new size is 3424, which is almost 40%
savings, compared to the original).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-19 15:51:52 -07:00
Kir Kolyshkin 10ca66bff5 runc exec: implement CPU affinity
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261

CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
   Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
   applied to a given exec and overrides the value from (1).

Add some basic tests.

Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.

Because of the above,
 - it's impossible to really test initial CPU affinity without adding
   debug logging to libcontainer/nsenter;
 - for older kernels, there can be a brief moment when exec's affinity
   is different than either initial or final affinity being set;
 - exec's final CPU affinity, if not specified, can be different
   depending on the kernel, therefore we don't test it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-02 19:17:41 -08:00
Kir Kolyshkin a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Kir Kolyshkin 6c9ddcc648 libct: switch from libct/devices to libct/cgroups/devices/config
Use the old package name as an alias to minimize the patch.

No functional change; this just eliminates a bunch of deprecation
warnings.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-31 16:51:09 -08:00
Kir Kolyshkin 6171da6005 libct/configs: add HookList.SetDefaultEnv
1. Make CommandHook.Command a pointer, which reduces the amount of data
   being copied when using hooks, and allows to modify command hooks.

2. Add SetDefaultEnv, which is to be used by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
lfbzhm d48d9cfefc Merge pull request #4459 from kolyshkin/prio-nits
Fixups to scheduler/priority settings
2024-12-25 23:41:27 +08:00
Kir Kolyshkin 7334ee01e6 libct/configs: rm IOPrioClassMapping
This is an internal implementation detail and should not be either
public or visible.

Amend setIOPriority to do own class conversion.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:17:44 -08:00
Kir Kolyshkin 8afeb58398 libct: add/use configs.HasHook
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:47:09 -08:00
utam0k bfbd0305ba Add I/O priority
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-03-30 22:31:54 +09:00
Aleksa Sarai 482e56379a configs: make id mappings int64 to better handle 32-bit
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:

  failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
  parsing id map failed: invalid format in line "         0          0 4294967295": integer overflow on token 4294967295

This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.

In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 12:14:32 +11:00
Zheao.Li 98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Aleksa Sarai 7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
utam0k 770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Kir Kolyshkin 6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Chethan Suresh ebc2e7c435 Support time namespace
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset

Example runtime spec

"timeOffsets": {
    "monotonic": {
        "secs": 172800,
        "nanosecs": 0
    },
    "boottime": {
        "secs": 604800,
        "nanosecs": 0
    }
}

Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
2023-08-03 10:12:01 +05:30
Ruediger Pluem da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
Francis Laniel c47f58c4e9 Capitalize [UG]idMappings as [UG]IDMappings
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
utam0k d9230602e9 Implement to set a domainname
opencontainers/runtime-spec#1156

Signed-off-by: utam0k <k0ma@utam0k.jp>
2023-04-12 13:31:20 +00:00
Alban Crequy 58ea21daef seccomp: add support for flags
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW

Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.

See similar commit in crun:
https://github.com/containers/crun/commit/fefabffa2816ea343068ed036a86944393db189a

Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
2022-07-28 16:25:26 +02:00
Kir Kolyshkin 1b14d97484 libct/configs: rm Windows TODO
It's clear at this point that runc won't support Windows.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 18:59:47 -08:00
Akihiro Suda 520702dac5 Add runc features command
Fix issue 3274

See `types/features/features.go`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-11-30 16:40:39 +09:00
Sascha Grunert 4a4d4f109b Add support for seccomp actions ActKillThread and ActKillProcess
Two new seccomp actions have been added to the libseccomp-golang
dependency, which can be now supported by runc, too.

ActKillThread kills the thread that violated the rule. It is the same as
ActKill. All other threads from the same thread group will continue to
execute.

ActKillProcess kills the process that violated the rule. All threads in
the thread group are also terminated. This action is only usable when
libseccomp API level 3 or higher is supported.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-09-09 17:47:00 +10:00
Alban Crequy 2b025c0173 Implement Seccomp Notify
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.

A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.

As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:

	writing syncT <act>: writing syncT: <err>

By adjusting the code path, now they just look like this
	writing syncT <act>: <err>

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-09-07 13:04:24 +02:00
Kir Kolyshkin a7cfb23b88 *: stop using pkg/errors
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin 7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn 7e7ff8722a errcheck: libcontainer/configs
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:17:45 +02:00
Giuseppe Scrivano c61f606254 libcontainer: honor seccomp defaultErrnoRet
https://github.com/opencontainers/runtime-spec/pull/1087 added support
for defaultErrnoRet to the OCI runtime specs.

If a defaultErrnoRet is specified, disable patching the generated
libseccomp cBPF.

Closes: https://github.com/opencontainers/runc/issues/2943

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-05-17 09:23:32 +02:00
Sebastiaan van Stijn 4126b807cc libcontainer/configs: add missing type for hooknames
Commit ccdd75760c introduced the HookName type
for hooks, but only set this type on the Prestart const, but not for the
other hooks.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-02 11:08:24 +02:00
Kir Kolyshkin c342872276 libct/config: fix a data race
As reported by go test -race ./libcontainer/configs:

=== RUN   TestCommandHookRunTimeout
==================
WARNING: DATA RACE
Read at 0x00c000202230 by goroutine 23:
  os/exec.(*Cmd).Wait()
      /usr/lib/golang/src/os/exec/exec.go:502 +0x91
  github.com/opencontainers/runc/libcontainer/configs.Command.Run()
      /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:390 +0x58c
  github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout()
      /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:1123 +0x202

Previous write at 0x00c000202230 by goroutine 27:
  os/exec.(*Cmd).Wait()
      /usr/lib/golang/src/os/exec/exec.go:505 +0xb4
  github.com/opencontainers/runc/libcontainer/configs.Command.Run.func1()
      /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:373 +0x55

Goroutine 23 (running) created at:
  testing.(*T).Run()
      /usr/lib/golang/src/testing/testing.go:1168 +0x5bb
  testing.runTests.func1()
      /usr/lib/golang/src/testing/testing.go:1439 +0xa6
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:1123 +0x202
  testing.runTests()
      /usr/lib/golang/src/testing/testing.go:1437 +0x612
  testing.(*M).Run()
      /usr/lib/golang/src/testing/testing.go:1345 +0x3b3
  main.main()
      _testmain.go:69 +0x236

Goroutine 27 (running) created at:
  github.com/opencontainers/runc/libcontainer/configs.Command.Run()
      /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config.go:372 +0x415
  github.com/opencontainers/runc/libcontainer/configs_test.TestCommandHookRunTimeout()
      /home/kir/go/src/github.com/opencontainers/runc/libcontainer/configs/config_test.go:223 +0x3ed
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:1123 +0x202
==================
    testing.go:1038: race detected during execution of test
--- FAIL: TestCommandHookRunTimeout (0.10s)

Apparently, the issue is we call two Wait()s for the same command
which can race internally.

Fix is easy -- since we already have a waiting goroutine,
wait for it to return instead of calling a second Wait().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-05 13:02:47 -08:00
Sebastiaan van Stijn 677baf22d2 libcontainer: isolate libcontainer/devices
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:21 +01:00
Giuseppe Scrivano a63f99fcc5 Add support for umask
Signed-off-by: Ashley Cui <acui@redhat.com>
2020-08-20 11:39:43 -04:00
Sebastiaan van Stijn 901dccf05d vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-30 22:08:54 +02:00
Renaud Gaubert 2f7bdf9d3b Tests the new Hook
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-19 02:39:20 +00:00
Renaud Gaubert ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Giuseppe Scrivano 41aa19662b libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
blacktop 84373aaa56 Add SCMP_ACT_LOG as a valid Seccomp action (#1951)
Signed-off-by: blacktop <blacktop@users.noreply.github.com>
2019-09-26 11:03:03 -04:00
W. Trevor King e23868603a libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).

And drop HookState, since there's no need for a local alias for
specs.State.

Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start().  I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.

I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2].  Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.

I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks.  But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568).  I've left that alone too.

I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty.  Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard.  I
think that ort of thing is very lo-risk, because:

* We shouldn't be copy/pasting this, right?  DRY for the win :).
* There's only ever a few lines between the guard and the guarded
  loop.  That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these.  Guarding with the wrong
  slice is certainly not the only thing you can break with a sloppy
  copy/paste.

But I'm not a maintainer ;).

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-11-14 06:49:49 -08:00
Xiaochen Shen 27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Aleksa Sarai fd3a6e6c83 libcontainer: handle unset oomScoreAdj corectly
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:

> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.

Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00