Fixing the Subpath Volume Vulnerability in Kubernetes

Wednesday, April 04, 2018

Fixing the Subpath Volume Vulnerability in Kubernetes

On March 12, 2018, the Kubernetes Product Security team disclosed CVE-2017-1002101, which allowed containers using subpath volume mounts to access files outside of the volume. This means that a container could access any file available on the host, including volumes for other containers that it should not have access to.

The vulnerability has been fixed and released in the latest Kubernetes patch releases. We recommend that all users upgrade to get the fix. For more details on the impact and how to get the fix, please see the announcement. (Note, some functional regressions were found after the initial fix and are being tracked in issue #61563).

This post presents a technical deep dive on the vulnerability and the solution.

Kubernetes Background

To understand the vulnerability, one must first understand how volume and subpath mounting works in Kubernetes.

Before a container is started on a node, the kubelet volume manager locally mounts all the volumes specified in the PodSpec under a directory for that Pod on the host system. Once all the volumes are successfully mounted, it constructs the list of volume mounts to pass to the container runtime. Each volume mount contains information that the container runtime needs, the most relevant being:

  • Path of the volume in the container
  • Path of the volume on the host (/var/lib/kubelet/pods/<pod uid>/volumes/<volume type>/<volume name>)

When starting the container, the container runtime creates the path in the container root filesystem, if necessary, and then bind mounts it to the provided host path.

Subpath mounts are passed to the container runtime just like any other volume. The container runtime does not distinguish between a base volume and a subpath volume, and handles them the same way. Instead of passing the host path to the root of the volume, Kubernetes constructs the host path by appending the Pod-specified subpath (a relative path) to the base volume’s host path.

For example, here is a spec for a subpath volume mount:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    <snip>
    volumeMounts:
    - mountPath: /mnt/data
      name: my-volume
      subPath: dataset1
  volumes:
  - name: my-volume
    emptyDir: {}

In this example, when the Pod gets scheduled to a node, the system will:

  • Set up an EmptyDir volume at /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • Construct the host path for the subpath mount: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + dataset1
  • Pass the following mount information to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/dataset1 on the host.
  • The container runtime starts the container.

The Vulnerability

The vulnerability with subpath volumes was discovered by Maxim Ivanov, by making a few observations:

  • Subpath references files or directories that are controlled by the user, not the system.
  • Volumes can be shared by containers that are brought up at different times in the Pod lifecycle, including by different Pods.
  • Kubernetes passes host paths to the container runtime to bind mount into the container.

The basic example below demonstrates the vulnerability. It takes advantage of the observations outlined above by:

  • Using an init container to setup the volume with a symlink.
  • Using a regular container to mount that symlink as a subpath later.
  • Causing kubelet to evaluate the symlink on the host before passing it into the container runtime.
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  initContainers:
  - name: prep-symlink
    image: "busybox"
    command: ["bin/sh", "-ec", "ln -s / /mnt/data/symlink-door"]
    volumeMounts:
    - name: my-volume
      mountPath: /mnt/data
  containers:
  - name: my-container
    image: "busybox"
    command: ["/bin/sh", "-ec", "ls /mnt/data; sleep 999999"]
    volumeMounts:
    - mountPath: /mnt/data
      name: my-volume
      subPath: symlink-door
  volumes:
  - name: my-volume
    emptyDir: {}

For this example, the system will:

  • Setup an EmptyDir volume at /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • Pass the following mount information for the init container to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume on the host.
  • The container runtime starts the init container.
  • The init container creates a symlink inside the container: /mnt/data/symlink-door -> /, and then exits.
  • Kubelet starts to prepare the volume mounts for the normal containers.
  • It constructs the host path for the subpath volume mount: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/ + symlink-door.
  • And passes the following mount information to the container runtime:
    • Container path: /mnt/data
    • Host path: /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty-dir/my-volume/symlink-door
  • The container runtime bind mounts /mnt/data in the container root filesystem to /var/lib/kubelet/pods/1234/volumes/kubernetes.io~empty~dir/my-volume/symlink-door
  • However, the bind mount resolves symlinks, which in this case, resolves to / on the host! Now the container can see all of the host’s filesystem through its mount point /mnt/data.

This is a manifestation of a symlink race, where a malicious user program can gain access to sensitive data by causing a privileged program (in this case, kubelet) to follow a user-created symlink.

It should be noted that init containers are not always required for this exploit, depending on the volume type. It is used in the EmptyDir example because EmptyDir volumes cannot be shared with other Pods, and only created when a Pod is created, and destroyed when the Pod is destroyed. For persistent volume types, this exploit can also be done across two different Pods sharing the same volume.

The Fix

The underlying issue is that the host path for subpaths are untrusted and can point anywhere in the system. The fix needs to ensure that this host path is both:

  • Resolved and validated to point inside the base volume.
  • Not changeable by the user in between the time of validation and when the container runtime bind mounts it.

The Kubernetes product security team went through many iterations of possible solutions before finally agreeing on a design.

Idea 1

Our first design was relatively simple. For each subpath mount in each container:

  • Resolve all the symlinks for the subpath.
  • Validate that the resolved path is within the volume.
  • Pass the resolved path to the container runtime.

However, this design is prone to the classic time-of-check-to-time-of-use (TOCTTOU) problem. In between steps 2) and 3), the user could change the path back to a symlink. The proper solution needs some way to “lock” the path so that it cannot be changed in between validation and bind mounting by the container runtime. All the subsequent ideas use an intermediate bind mount by kubelet to achieve this “lock” step before handing it off to the container runtime. Once a bind mount is performed, the mount source is fixed and cannot be changed.

Idea 2

We went a bit wild with this idea:

  • Create a working directory under the kubelet’s pod directory. Let’s call it dir1.
  • Bind mount the base volume to under the working directory, dir1/volume.
  • Chroot to the working directory dir1.
  • Inside the chroot, bind mount volume/subpath to subpath. This ensures that any symlinks get resolved to inside the chroot environment.
  • Exit the chroot.
  • On the host again, pass the bind mounted dir1/subpath to the container runtime.

While this design does ensure that the symlinks cannot point outside of the volume, it was ultimately rejected due to difficulties of implementing the chroot mechanism in 4) across all the various distros and environments that Kubernetes has to support, including containerized kubelets.

Idea 3

Coming back to earth a little bit, our next idea was to:

  • Bind mount the subpath to a working directory under the kubelet’s pod directory.
  • Get the source of the bind mount, and validate that it is within the base volume.
  • Pass the bind mount to the container runtime.

In theory, this sounded pretty simple, but in reality, 2) was quite difficult to implement correctly. Many scenarios had to be handled where volumes (like EmptyDir) could be on a shared filesystem, on a separate filesystem, on the root filesystem, or not on the root filesystem. NFS volumes ended up handling all bind mounts as a separate mount, instead of as a child to the base volume. There was additional uncertainty about how out-of-tree volume types (that we couldn’t test) would behave.

The Solution

Given the amount of scenarios and corner cases that had to be handled with the previous design, we really wanted to find a solution that was more generic across all volume types. The final design that we ultimately went with was to:

  • Resolve all the symlinks in the subpath.
  • Starting with the base volume, open each path segment one by one, using the openat() syscall, and disallow symlinks. With each path segment, validate that the current path is within the base volume.
  • Bind mount /proc/<kubelet pid>/fd/<final fd> to a working directory under the kubelet’s pod directory. The proc file is a link to the opened file. If that file gets replaced while kubelet still has it open, then the link will still point to the original file.
  • Close the fd and pass the bind mount to the container runtime.

Note that this solution is different for Windows hosts, where the mounting semantics are different than Linux. In Windows, the design is to:

  • Resolve all the symlinks in the subpath.
  • Starting with the base volume, open each path segment one by one with a file lock, and disallow symlinks. With each path segment, validate that the current path is within the base volume.
  • Pass the resolved subpath to the container runtime, and start the container.
  • After the container has started, unlock and close all the files.

Both solutions are able to address all the requirements of:

  • Resolving the subpath and validating that it points to a path inside the base volume.
  • Ensuring that the subpath host path cannot be changed in between the time of validation and when the container runtime bind mounts it.
  • Being generic enough to support all volume types.

Acknowledgements

Special thanks to many folks involved with handling this vulnerability:

  • Maxim Ivanov, who responsibly disclosed the vulnerability to the Kubernetes Product Security team.
  • Kubernetes storage and security engineers from Google, Microsoft, and RedHat, who developed, tested, and reviewed the fixes.
  • Kubernetes test-infra team, for setting up the private build infrastructure
  • Kubernetes patch release managers, for coordinating and handling all the releases.
  • All the production release teams that worked to deploy the fix quickly after release.

If you find a vulnerability in Kubernetes, please follow our responsible disclosure process and let us know; we want to do our best to make Kubernetes secure for all users.

– Michelle Au, Software Engineer, Google; and Jan Šafránek, Software Engineer, Red Hat