Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 108 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,114 @@ on this list as being implementation requests. Some of the ideas on this list
are rather rough and unrefined. They serve as entry points for exploring the
associated problem space.

**When implementing ideas on this list or ideas inspired by this list please
point that out explicitly and clearly in the associated patches and Cc
`Christian Brauner <brauner (at) kernel (dot) org`.**
* **When implementing ideas on this list or ideas inspired by this list
please point that out explicitly and clearly in the associated patches
and Cc `Christian Brauner <brauner (at) kernel (dot) org`.**

* Move the item you are working to the In-Progress section.
Please add your github handle or mail address to the issue so we can
ping you.

## In-Progress

### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system

Currently, the kernel only allows extended attributes in the
`user.*` namespace to be attached to directory and regular file
inodes. It would be tremendously useful to allow them to be
associated with socket inodes, too.

**Usecase:** There are two syslog RFCs in use today: RFC3164 and
RFC5424. `glibc`'s `syslog()` API generates events close to the
former, but there are programs which would like to generate the
latter instead (as it supports structured logging). The two formats
are not backwards compatible: a client sending RFC5424 messages to a
server only understanding RFC3164 will cause an ugly mess. On Linux
there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
`syslog()`, which is used in a one-way, fire-and-forget style. This
means that feature negotation is not really possible within the
protocol. Various tools bind mount the socket inode into `chroot()`
and container environments, hence it would be fantastic to associate
supported feature information directly with the inode (and thus
outside of the protocol) to make it easy for clients to determine
which features are spoken on a socket, in a way that survives bind
mounts. Implementation idea would be that syslog daemons
implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
(or something like that) on the socket inode, and clearly inform
clients in a natural and simple way that they'd be happy to parse
the newer format. Also see:
https://github.com/systemd/systemd/issues/19251 – This idea could
also be extended to other sockets and other protocols: by setting
some extended attribute on a socket inodes, services could advertise
which protocols they support on them. For example D-Bus sockets
could carry `user.dbus` set to `1`, and Varlink sockets
`user.varlink` set to `1` and so on.

### Support detached mounts with `pivot_root()`

The new rootfs must currently refer to an attached mount. This restriction
seems unnecessary. We should allow the new rootfs to refer to a detached
mount.

This will allow a service- or container manager to create a new rootfs as
a detached, private mount that isn't exposed anywhere in the filesystem and
then `pivot_root()` into it.

Since `pivot_root()` only takes path arguments the new rootfs would need to
be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
`pivot_root()` syscall operating on file descriptors instead of paths.

### Create mount namespace with custom rootfs via `open_tree()` and `fsmount()`

Add `OPEN_TREE_NAMESPACE` flag to `open_tree()` and `FSMOUNT_NAMESPACE` flag
to `fsmount()` that create a new mount namespace with the specified mount tree
as the rootfs mounted on top of a copy of the real rootfs. These return a
namespace file descriptor instead of a mount file descriptor.

This allows `OPEN_TREE_NAMESPACE` to function as a combined
`unshare(CLONE_NEWNS)` and `pivot_root()`.

When creating containers the setup usually involves using `CLONE_NEWNS` via
`clone3()` or `unshare()`. This copies the caller's complete mount namespace.
The runtime will also assemble a new rootfs and then use `pivot_root()` to
switch the old mount tree with the new rootfs. Afterward it will recursively
unmount the old mount tree thereby getting rid of all mounts.

Copying all of these mounts only to get rid of them later is wasteful. With a
large mount table and a system where thousands of containers are spawned in
parallel this quickly becomes a bottleneck increasing contention on the
semaphore.

**Use-Case:** Container runtimes can create an extremely minimal rootfs
directly:

```c
fd_mntns = open_tree(-EBADF, "/var/lib/containers/wootwoot", OPEN_TREE_NAMESPACE);
```

This creates a mount namespace where "wootwoot" has become the rootfs. The
caller can `setns()` into this new mount namespace and assemble additional
mounts without copying and destroying the entire parent mount table.

### Query mount information via file descriptor with `statmount()`

Extend `struct mnt_id_req` to accept a file descriptor and introduce
`STATMOUNT_BY_FD` flag. When a valid fd is provided and `STATMOUNT_BY_FD`
is set, `statmount()` returns mount info about the mount the fd is on.

This works even for "unmounted" mounts (mounts that have been unmounted using
`umount2(mnt, MNT_DETACH)`), if you have access to a file descriptor on that
mount. These unmounted mounts will have no mountpoint and no valid mount
namespace, so `STATMOUNT_MNT_POINT` and `STATMOUNT_MNT_NS_ID` are unset in
`statmount.mask` for such mounts.

**Use-Case:** Query mount information directly from a file descriptor without
needing the mount ID, which is particularly useful for detached or unmounted
mounts.

---

### TODO

### xattrs for pidfd

Expand Down Expand Up @@ -376,20 +481,6 @@ Namespace-able loop and block devices, usable inside user namespaces.
**Use-Case:** Allow mounting images inside nspawn containers, and using
RootImage= and friends in the systemd user manager.

### Support detached mounts with `pivot_root()`

The new rootfs must currently refer to an attached mount. This restriction
seems unnecessary. We should allow the new rootfs to refer to a detached
mount.

This will allow a service- or container manager to create a new rootfs as
a detached, private mount that isn't exposed anywhere in the filesystem and
then `pivot_root()` into it.

Since `pivot_root()` only takes path arguments the new rootfs would need to
be passed via `/proc/<pid>/fd/<nr>`. In the long run we should add a new
`pivot_root()` syscall operating on file descriptors instead of paths.

### Device cgroup guard to allow `mknod()` in non-initial userns

If a container manager restricts its unprivileged (user namespaced)
Expand Down Expand Up @@ -532,39 +623,6 @@ in case the process dies and its PID is quickly recycled. (This
assumes systemd can acquire a pidfd of the foreign process without
races, for example via `SCM_PIDFD` and `SO_PEERPIDFD` or similar.)

### Ability to put user xattrs on `S_IFSOCK` socket entrypoint inodes in the file system

Currently, the kernel only allows extended attributes in the
`user.*` namespace to be attached to directory and regular file
inodes. It would be tremendously useful to allow them to be
associated with socket inodes, too.

**Usecase:** There are two syslog RFCs in use today: RFC3164 and
RFC5424. `glibc`'s `syslog()` API generates events close to the
former, but there are programs which would like to generate the
latter instead (as it supports structured logging). The two formats
are not backwards compatible: a client sending RFC5424 messages to a
server only understanding RFC3164 will cause an ugly mess. On Linux
there's only a single `/dev/log` AF_UNIX/SOCK_DGRAM socket backing
`syslog()`, which is used in a one-way, fire-and-forget style. This
means that feature negotation is not really possible within the
protocol. Various tools bind mount the socket inode into `chroot()`
and container environments, hence it would be fantastic to associate
supported feature information directly with the inode (and thus
outside of the protocol) to make it easy for clients to determine
which features are spoken on a socket, in a way that survives bind
mounts. Implementation idea would be that syslog daemons
implementing RFC5425 could simply set an xattr `user.rfc5424` to `1`
(or something like that) on the socket inode, and clearly inform
clients in a natural and simple way that they'd be happy to parse
the newer format. Also see:
https://github.com/systemd/systemd/issues/19251 – This idea could
also be extended to other sockets and other protocols: by setting
some extended attribute on a socket inodes, services could advertise
which protocols they support on them. For example D-Bus sockets
could carry `user.dbus` set to `1`, and Varlink sockets
`user.varlink` set to `1` and so on.

### Open thread-group leader via `pidfd_open()`

Extend `pidfd_open()` to allow opening the thread-group leader based on the
Expand Down