Rootfull, rootless containers on Btrfs and ZFS

I recently tried to automate my rootless container setup inside an Ansible playbook. It didn’t turn out well, mainly because I couldn’t handle every disk scenario (e.g. single disk or RAID; getting disk’s UUID; mounting Btrfs root volume; mount options). After a couple of days without success, I decided to write a blog post about it instead (better remind my future self of what to do, step by step).

All of my computers either have ZFS or Btrfs on root. Therefore, I’ll discuss how to make nerdctl and podman (in both rootless and rootfull mode) work on such filesystems. I don’t use Docker personally (because everyone does) so it won’t be covered here1.

Motivation

The default overlayfs storage backend2 of containerd (what nerdctl uses behind the scene) and podman pretty much works out of the box. So, here are the reasons why I’m trying not to use it:

  • I like using non-default settings (tinkering is fun).

  • containerd and podman have ZFS and Btrfs storage backends (mostly since they are Copy on Write filesystems) so using them makes my systems feel more cohesive.

  • A lot of people just add themselves to docker group3. Meanwhile, setting up rootless container to work nicely isn’t usually straightforward and there are quite a bunch of shortcomings4.

With that out of the way, let’s dive into the setup.

Prerequisites

  • A ZFS pool or Btrfs filesystem to store container layers if either one of these storage backends is used

  • cgroup v2 enabled (either with systemd or cgroupfs)

  • /etc/subuid and /etc/subgid properly configured5 (on AlpineLinux you’ll also need shadow-subids package)

  • nerdctl requires at least rootlesskit and slirp4netns for rootless mode

Choosing storage backends

Things are straightforward in rootfull mode. You can just stick with the backend (ZFS/Btrfs) that is also your filesystem with little to zero configuration6.

Rootless mode, on the other hand, comes with some limitations:

  • ZFS doesn’t grant unprivileged users all the capabilities needed to run containers7, so we can’t use it (yet?).

  • For Btrfs, podman supports it well while nerdctl currently doesn’t seem to work8.

Here’s what we’ll use for rootless mode:

nerdctlpodman
ZFSnativeoverlay9
Btrfsoverlaybtrfs

Notice that we’ll make use of native snapshotter for nerdctl on ZFS. Why not the default overlayfs? The answer is because ZFS doesn’t like having an overlayfs mount being on top of it10. Other snapshotters (fuse-overlayfs, stargz) probably will also work, but they require installing corresponding gRPC helper binaries and setting up additional services alongside containerd, so native is the easiest choice here.

Now comes the actual setup process.

nerdctl

Rootfull mode

ZFS

We’ll start with ZFS. The README in ZFS snapshotter repository already tells us what to do:

  1. Set up a ZFS filesystem. The ZFS filesystem name is arbitrary, but the mount point needs to be /var/lib/containerd/io.containerd.snapshotter.v1.zfs, when the containerd root is set to /var/lib/containerd/.
$ zfs create -o mountpoint=/var/lib/containerd/io.containerd.snapshotter.v1.zfs your-zpool/containerd
  1. Start containerd.

Pretty simple, right? For something new, I’ll do it the Ansible way:

- name: Create ZFS dataset for containerd
  community.general.zfs:
    name: rpool/ROOT/containerd
    extra_zfs_properties:
      devices: off
      xattr: sa
      acltype: posixacl
      canmount: on
      mountpoint: /var/lib/containerd/io.containerd.snapshotter.v1.zfs

Btrfs

It’s pretty similar with Btrfs storage backend. Most Btrfs on root setups don’t mount the root subvolume on /, and you probably would want to keep container layers when switching the subvolume mounted on /, so there are some extra steps involved:

# Mount the root subvolume somewhere first, assuming it is /dev/sda1 in this case
mount -t btrfs -o rw,noatime,user_subvol_rm_allowed,subvol=/ /dev/sda1 /mnt

# Create a top level subvolume (rootid=5) for containerd' storage
btrfs subvolume create /mnt/@containerd

Then stick the newly created subvolume into /etc/fstab and mount it:

# Replace /dev/sda1 with something more proper, like UUID=...
/dev/sda1	/var/lib/containerd/io.containerd.snapshotter.v1.btrfs	btrfs	rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containerd 0 2

Configuring nerdctl

After you are done with setting up the storage, enable containerd service with your system’s service manager.

You can start using nerdctl immediately: nerdctl --snapshotter zfs run --rm -it alpine:edge

To avoid specifying the snapshotter backend every time in the CLI, a configuration file may be created:

/etc/nerdctl/nerdctl.toml:

snapshotter = "zfs"

VoilĂ ! Enjoy your new rootfull container setup!

Rootless mode

Starting containerd

In rootless mode, you need containerd_rootless.sh script and a way to run it on user login (for convenient sake).

If you use systemd, great! nerdctl already provides containerd-rootless-setuptools.sh script that does all the job for you. Just follow the instruction!

For other service managers, if yours support creating user services (runit and dinit do), use it. Otherwise, just start containerd_rootless.sh inside ~/.profile (or a similar file) or using your desktop environment’s autostart mechanism.

Personally, I use AlpineLinux with OpenRC, which currently doesn’t have this functionality. For demonstration, I’ll set the daemon up using superd11:

  • First, obviously we need to start superd on user login (the author suggests doing it with your desktop environment)
  • Now, create a service file for containerd daemon (adapted from the example systemd’s one):

~/.config/services/containerd.service:

[Unit]
Description=containerd (rootless)

[Service]
ExecStart=/full/path/to/containerd_rootless.sh
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
Type=simple
KillMode=mixed
  • The final step is to enable the service and start it: superctl enable --now containerd

Configuring nerdctl

We’re done with starting containerd daemon inside a user namespace. It’s time to configure nerdctl. If overlay snapshotter is chosen (when Btrfs is the underlying filesystem), there is nothing else to be done. For native snapshotter, we, again, create a configuration file:

~/.config/nerdctl/nerdctl.toml:

# The only important field is "snapshotter"
debug_full = false
snapshotter = "native"
insecure_registry = false

At this point, running nerdctl commands should work without issues.

buildkitd

Unlike podman, which has buildah baked into the binary, nerdctl relies on buildkitd daemon and buildctl command for building container image.

In rootfull mode, it’s just as easy as starting buildkitd daemon the way your OS provides.

In rootless mode, buildkitd daemon should be started after containerd in the same user namespace. Again, I’ll demonstrate the service setting up process using superd. Also, containerd worker will be used here instead of the default OCI worker, as it appears to speed up loading images into containerd.

~/.config/services/buildkit.service:

[Unit]
Description=BuildKit Daemon (Rootless)
After=containerd.service

[Service]
Type=simple
# containerd uses 'default' namespace.
# For Kubernetes the namespace is 'k8s.io'.
# The default namespace of buildkitd is 'buildkit'.
ExecStart=/path/to/containerd-rootless-setuptool.sh nsenter -- /usr/bin/buildkitd --addr=unix:///run/user/<uid>/buildkit-default/buildkitd.sock --root=/home/user/.local/share/buildkit-default --containerd-worker-namespace=default --containerd-worker-snapshotter=native
ExecReload=/bin/kill -s HUP $MAINPID
RestartSec=2
Restart=on-failure
KillMode=mixed

buildkitd can be forced to use containerd worker in its configuration file:

~/.config/buildkit/buildkitd.toml:

[worker.oci]
  enabled = false

[worker.containerd]
  enabled = true
  rootless = true

podman

Rootfull

The filesystem setup in rootfull mode for podman is roughly the same process as nerdctl. You just need to change the storage mountpoint from /var/lib/containerd/io.containerd.snapshotter.v1.{zfs,btrfs} to /var/lib/containers/storage and create a ZFS dataset / Btrfs subvolume there. Additionally, a system configuration is required:

/etc/containers/storage.conf:

[storage]
  driver = "zfs" # or "btrfs"
  runroot = "/run/containers/storage"
  graphroot = "/var/lib/containers/storage"
  rootless_storage_path = "$HOME/.local/share/containers/storage"

[storage.options]
  pull_options = {enable_partial_images = "false", use_hard_links = "false", ostree_repos=""}

# podman uses legacy mount for ZFS
[storage.options.zfs]
  fsname = "rpool/ROOT/containers"
  mountopt = "nodev"

podman can be run without the podman daemon, you don’t need to do anything extra beside what’s mentioned above.

Rootless

In Btrfs case, podman will fall back to the system configuration, and we can just use it as is (no configuration required). If you intend to use podman in both rootless and rootfull modes, a good way to manage the container storage is to use separated nested subvolumes inside the same top-level one:

btrfs subvolume create /mnt/@containers
btrfs subvolume create /mnt/@containers/your_user
btrfs subvolume create /mnt/@containers/root

chown your_user:your_user /mnt/@containers/your_user

And your /etc/fstab should now look like this:

/dev/sda1	/var/lib/containers/storage	btrfs	rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containers/root 0 2
/dev/sda1	/home/your_user/.local/share/containers/storage	btrfs	rw,noatime,nodev,compress-force=zstd,rescue=usebackuproot,ssd,space_cache=v2,commit=60,subvol=/@containers/your_user 0 2

For ZFS, since we have to use fuse-overlayfs, let’s override the system settings with a user’s configuration file:

~/.config/containers/storage.conf:

[storage]
  driver = "overlay"

[storage.options.overlay]
  force_mask = "private"
  mount_program = "/usr/bin/fuse-overlayfs"
  mountopt = "nodev"

Conclusion

I’m quite happy with the experiment so far. podman and nerdctl, since, have been working nicely throughout my simple day-to-day usage. Though, there are still things I want to try out in the future:

  • docker in rootless mode (since compared to nerdctl, it supports Btrfs)
  • stargz snapshotter (the new hot thing, and its features look promising)

I’ll probably write another blog post if they appear to be interesting, and if I have some free time to test them in the future. See you then, and thanks for reading this until the end!


  1. Docker has an extensive document for setting up rootless container. I think it is good enough already. In short Btrfs works while ZFS support is absent. ↩︎

  2. containerd uses the term “snapshotter” while podman calls it “storage driver”↩︎

  3. you can do the same with nerdctl by the way, though it is highly discouraged. ↩︎

  4. see https://github.com/containers/podman/blob/main/rootless.md ↩︎

  5. detailed instruction available at https://rootlesscontaine.rs/getting-started/common/↩︎

  6. be aware that containerd’s Btrfs storage implementation has some performance issues, e.g. not using Btrfs quota (see containerd/containerd#4217, containerd/containerd#6067 and containerd/containerd#6581↩︎

  7. check out this answer↩︎

  8. I opened issue containerd/containerd#7514 on GitHub. You can keep track of the bug there. ↩︎

  9. the overlay storage driver of podman can be configured to use fuse-overlayfs if the default overlayfs doesn’t work ↩︎

  10. see issue openzfs/zfs#8648.
    Bonus tip for AlpineLinux users: akms mounts an overlay filesystem inside /tmp/akms to build kernel modules by default. So, you either disable this behavior in /etc/akms.conf or don’t create <your_root_pool>/tmp dataset in the first place (mount /tmp as tmpfs instead, which you should always do). ↩︎

  11. another option is to use s6-rc. ArtixLinux has a wonderful guide on the topic. ↩︎