Commit Graph

101 Commits

Author SHA1 Message Date
Kir Kolyshkin e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Giuseppe Scrivano c61f606254 libcontainer: honor seccomp defaultErrnoRet
https://github.com/opencontainers/runtime-spec/pull/1087 added support
for defaultErrnoRet to the OCI runtime specs.

If a defaultErrnoRet is specified, disable patching the generated
libseccomp cBPF.

Closes: https://github.com/opencontainers/runc/issues/2943

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-05-17 09:23:32 +02:00
Kir Kolyshkin 1f1e91b1a0 libct/specconv: check mount destination is absolute
Per OCI runtime spec, mount destination MUST be absolute. Let's check
that and return an error if not.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 11:26:16 -07:00
Qiang Huang 2d38476c96 Merge pull request #2840 from kolyshkin/ignore-kmem
Ignore kernel memory settings
2021-04-13 09:44:14 +08:00
Kir Kolyshkin 52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00
Kir Kolyshkin 27bb1bd5ea libct/specconv/CreateCgroupConfig: don't set c.Parent default
c.Parent is only used by systemd cgroup drivers, and both v1 and v2
drivers do have code to set the default if it is empty, so setting
it here is redundant.

In addition, in case of cgroup v2 rootless container setting it here
is harmful as the default should be user.slice not system.slice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 19:50:37 -07:00
Kir Kolyshkin f0dec0b4bf libct/specconv/CreateCgroupConfig: nit
Do not call libcontainerUtils.CleanPath in case its result will
not be used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 19:33:01 -07:00
Iceber Gu fa52df9493 libcontainer: fix the file mode of the device
Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io>
2021-02-17 15:08:22 +08:00
Aleksa Sarai 7a8d7162f9 seccomp: prepend -ENOSYS stub to all filters
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

 1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
       specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
       added, and has no API for "minimum kernel version" so we cannot
       simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
       distinct rules -- if rules overlap, seccomp will merge them. This
       means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
       single argument, making it impossible to invert OR rules for
       arguments.

 2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
       creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
       (weird syscalls that are no longer used were excluded entirely,
       but Docker et al expect those syscalls to get EPERM not ENOSYS).

 3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

 4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

 5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.

Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.

The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-28 23:11:22 +11:00
Sebastiaan van Stijn 4fc2de77e9 libcontainer/devices: remove "Device" prefix from types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:23 +01:00
Sebastiaan van Stijn 677baf22d2 libcontainer: isolate libcontainer/devices
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:21 +01:00
Ashok Pon Kumar fcf210d631 Fix goreport warnings of ineffassign and misspell
Signed-off-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
2020-10-02 09:03:45 +05:30
Kenta Tada 3d5dec2f44 libcontainer: remove the unused variable from spec
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-10-01 12:29:48 +09:00
Sebastiaan van Stijn 8bf216728c use string-concatenation instead of sprintf for simple cases
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-30 10:51:59 +02:00
Kir Kolyshkin b006f4a180 libct/cgroups: support Cgroups.Resources.Unified
Add support for unified resource map (as per [1]), and add some test
cases for the new functionality.

[1] https://github.com/opencontainers/runtime-spec/pull/1040

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-09-24 15:29:35 -07:00
Giuseppe Scrivano a63f99fcc5 Add support for umask
Signed-off-by: Ashley Cui <acui@redhat.com>
2020-08-20 11:39:43 -04:00
Cesar Talledo 0709202da7 Remove runc default devices that overlap with spec devices.
Runc has a set of default devices that it includes in Linux containers
(e.g., /dev/null, /dev/random, /dev/tty, etc.)

However if the container's OCI spec includes all or a subset of those same devices,
runc is currently not detecting the redundancy, causing it to create a lib
container config that has redundant device configurations.

This causes a failure in rootless mode, in particular when the /dev/tty device
has a redundant config:

container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:70: creating device nodes caused: open /tmp/busyboxtest/rootfs/dev/tty: no such device or address"

The reason this fails in rootless mode only is that in this case runc sets up
/dev/tty not by doing mknod (it's not allowed within a user-ns) but rather by
creating a regular file under /dev/tty and bind-mounting the host's /dev/tty to
the container's /dev/tty. When this operation is done redundantly, it fails the
second time.

This change fixes this problem by ensuring runc checks for redundant devices
between the OCI spec it receives and the default devices it configures. If
a redundant device is detected, the OCI spec takes priority.

The change adds both a unit test and an integration test to verify the
behavior. Without this fix, this new integration test fails as shown above.

Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>
2020-08-07 16:46:15 -07:00
Renaud Gaubert ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Kir Kolyshkin 4189cb65f8 cgroups: remove cgroup.Resources.CpuMax
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:15:38 -07:00
Giuseppe Scrivano 41aa19662b libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
Aleksa Sarai 24388be71e configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 60e21ec26e specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai b2bec9806f cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Pradyumna Agrawal 4aa9101477 Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig

Closes #2397

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-05-11 13:38:14 -07:00
lifubang d2a9c5da37 using default allowed devices when linux resources is null
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-16 11:40:44 +08:00
Akihiro Suda cc183ca662 Merge pull request #2242 from AkihiroSuda/vendor-systemd
vendor: update go-systemd and godbus
2020-03-25 02:40:22 +09:00
Akihiro Suda 492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Akihiro Suda aa269315a4 cgroup2: add CpuMax conversion
Fix #2243

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:58:39 +09:00
Kir Kolyshkin 1cd71dfd71 systemd properties: support for *Sec values
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.

This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.

This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Kir Kolyshkin 4c5c3fb960 Support for setting systemd properties via annotations
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).

This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.

Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.

Example usage: add the following to runtime spec (config.json):

```
	"annotations": {
		"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
		"org.systemd.property.CollectMode":"'inactive-or-failed'"
	},
```

and start the container (e.g. `runc --systemd-cgroup run $ID`).

The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".

The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.

In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.

NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.

[1] https://developer.gnome.org/glib/stable/gvariant-text.html

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Boris Popovschi 7c439cc6f6 Added conversion for cpu.weight v2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-12 11:32:34 +02:00
Julio Montes cd7c59d042 libcontainer: export createCgroupConfig
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Akihiro Suda faf673ee45 cgroup2: port over eBPF device controller from crun
The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c

Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author
Giuseppe Scrivano agreed to relicense the file in Apache License 2.0:
https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397

See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 14:01:46 +09:00
Aleksa Sarai 8296826da5 specconv: always set "type: bind" in case of MS_BIND
We discovered in umoci that setting a dummy type of "none" would result
in file-based bind-mounts no longer working properly, which is caused by
a restriction for when specconv will change the device type to "bind" to
work around rootfs_linux.go's ... issues.

However, bind-mounts don't have a type (and Linux will ignore any type
specifier you give it) because the type is copied from the source of the
bind-mount. So we should always overwrite it to avoid user confusion.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-04-08 15:08:08 +10:00
Filipe Brandenburger 4b2b978291 Add cgroup name to error message
More information should help troubleshoot an issue when this error occurs.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 10:25:00 -07:00
Mrunal Patel 4769cdf607 Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Ace-Tang 16d55f17a8 libcontainer: fix potential panic if spec.Process is nil
for the code logic, pointer 'spec.Process' should be judge first
to avoid panic.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-06 11:55:30 +08:00
Aleksa Sarai 9a3a8a5ebf libcontainer: implement CLONE_NEWCGROUP
This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.

This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-23 16:23:00 -04:00
Xiaochen Shen 27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Mrunal Patel a00bf01908 Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Jonathan Marler 1499c746a1 Move spec.Linux.IntelRdt check to spec.Linux != nil block
Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>
2018-10-04 21:30:55 -06:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Alban Crequy 3321aa1af7 Fix regression with mounts with non-absolute source path
PR #1753 introduced a test on the mount flags but the binary operator
was wrong, see https://github.com/opencontainers/runc/pull/1753#discussion_r203445652

This was noticed when investigating https://github.com/opencontainers/runtime-tools/issues/651

Symptoms: in the container, /proc/self/mountinfo displays some mounts as
follow:

296 279 0:67 / /tmp rw,nosuid - tmpfs /home/dpark/go/src/github.com/opencontainers/runc/tmpfs rw,size=65536k,mode=755

Signed-off-by: Alban Crequy <alban@kinvolk.io>
2018-07-18 18:30:49 +02:00
Qiang Huang dd67ab10d7 Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
dlorenc 40680b2d37 Make the setupSeccomp function public.
This function is useful for converting from the OCI spec format to the one used by runC/libcontainer.

Signed-off-by: dlorenc <lorenc.d@gmail.com>
2018-04-17 10:47:22 -07:00
Michael Crosby d56f6cc202 Merge pull request #1753 from wking/do-not-require-bind-mount-type
libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
2018-04-16 11:01:53 -04:00
Michael Crosby 9f0eca2a94 Merge pull request #1777 from nalind/no-config-for-extant-netns
Only configure networking when creating a net ns
2018-04-12 10:55:02 -04:00
Nalin Dahyabhai 4521d4b19c Only configure networking when creating a net ns
When joining an existing namespace, don't default to configuring a
loopback interface in that namespace.

Its creator should have done that, and we don't want to fail to create
the container when we don't have sufficient privileges to configure the
network namespace.

Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
2018-04-11 13:28:19 -04:00
Aleksa Sarai fd3a6e6c83 libcontainer: handle unset oomScoreAdj corectly
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:

> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.

Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
W. Trevor King 0aa6e4e5d3 libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
From the "Creating a bind mount" section of mount(2) [1]:

> If mountflags includes MS_BIND (available since Linux 2.4), then
> perform a bind mount...
>
> The filesystemtype and data arguments are ignored.

This commit adds support for configurations that leave the OPTIONAL
type [2] unset for bind mounts.  There's a related spec-example change
in flight with [3], although my personal preference would be a more
explicit spec for the whole mount structure [4].

[1]: http://man7.org/linux/man-pages/man2/mount.2.html
[2]: https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config.md#L102
[3]: https://github.com/opencontainers/runtime-spec/pull/954
[4]: https://github.com/opencontainers/runtime-spec/pull/771

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-03-07 10:23:42 -08:00