Pod API: Information that the kata runtime is missing #2637

c3d · 2021-09-14T16:01:10Z

Purpose and motivation

This Feature Request is intended to document and present the kind of issues that a runtime like Kata Containers encounters with existing Kubernetes and container runtime APIs, for example due to API designs historically shaped by constraints that do not apply well to a VM-based runtime. We decided, during the Architecture Committee on Sep 7, 2021, to first document these as a Kata Containers issue, in order to be able to share our needs with other projects.

A primary motivation for this document is Confidential Containers, but the examples given below show that the problem is more ancient. This is also intended to describe the problem more than suggest solutions, with the assumption that the majority of the solution will happen outside of Kata.

Examples of issues

This issue uses three examples to illustrate the kinds of problems we run into:

CPU requests are not propagated to the runtime, making it difficult to size the virtual machine correctly (Kata VM with a CPU request >1 does not receive appropriate number of vCPUs if limit not set #2071).
Attaching CNIs directly to Kata Containers to improve performance ([RFC] Direct Attachable CNIs For Kata Containers #1922)
Block devices need special handling, e.g. we want them mounted in the guest and not the host.

These three examples are only meant to illustrates specific interactions, broadly related to the common runtime interface (CRI), common network interface (CNI) and common storage interface (CSI). We will later refer to the "category" of issue as "CPU", "network" and "disk" respectively.

There are other issues that can be seen as related, notably related to device mapping (e.g. #2185), changes in the authentication and trust domains, notably as it pertains to confidential containers (e.g. #1834). Also, some of these topics have already been discussed outside of Kata Containers, e.g. the Sandbox API.

Overview of the problem

Historically, the existing APIs and flows between the various components were defined largely based on how the underlying operating system functions are used in order to implement the containers. As a result, some APIs will simply not contain data that Kata Containers might need , provide the data at the wrong time, in the wrong order or to the wrong component, or perform operations that are detrimental to how Kata needs to do things.

This is illustrated by the three issues above:

the CPU issue used as an example is caused by a lack of data, specifically the number of requested CPUs being absent in the data that Kata receives. This forces Kata to hot-plug CPUs using data "guessed" from the CPU limit, and seriously restricts for example CPU-bound network performance in cases where we run with 1 CPU whereas runc would run with all host CPUs.
The network issue is an example where the data does not currently flow easily between components the way we want. For example, k8s might talk to the CNI directly and not pass information to the runtime at all, but for a virtual machine, we need some details in order to setup the network efficiently.
The disk issue is an example where the host does things that are somewhat "harmful", in this case mounting disks, and may even cause logical or security problems that will become more apparent with confidential containers, like possibly forcing us to expose guest-owned disk encryption secrets to the host.

Detailed problem description

Pod API - Need for changes outside of Kata Containers

There is little Kata Containers can do (but workaround or guess) when some data is missing. Therefore, it is likely that we will need a set of changes to happen outside of Kata Containers in order to get things fixed "the right way". This can probably be done in an incremental way, since Kata Containers manages to workaround a number of issues for now.

I suggest to call Pod API an API that would describe pods in a way that would be suitable for Kata Containers. We used to refer to the problem by talking about the Sandbox API, but this is really a containerd issue that addresses a subset of what is being discussed here.

Missing data

While we know at the moment that we miss values like CPU request, it seems likely that there are other cases we did not see yet. A good example could be support for NUMA or NUMA-awareness (hinted at by #2594). It may also be the case that we need to adjust the data. A good example is the device remapping that happens between host and guest devices to support SR-IOV and the SR-IOV device plug-in, requiring environment variable rewrites.

Therefore, it seems probably desirable for the Pod API to receive a copy of the entirety of the original workload specification, ideally in a sufficiently structured way that we can actually find fields in it easily. The downside, of course, is that this could practically mean integration "all of Kubernetes" into the Kata Runtime, and may even be flatly rejected if it exposes secrets.

Data flow between components

It is "natural" for a host-based runtime like runc to bother as little as possible about disks or networking. Separation of concerns is generally good, so it is perfectly sensible for the Kubernetes design to separate CRI, CSI and CNI.

The problem, however, is that this is presently a "top-down" approach, which does not make it easy for cross-component interactions. Having a runtime "intercept" a disk mount, for example, and be able to tell Kubernetes "Sorry, I'm taking care of this mount" (because it will be an in-guest mount) is therefore desirable in our case. Note that similar "reworks" of the interfaces are already in progress to enable confidential containers. As was presented during a recent Confidential Containers presentation, reworking the general control and data flow may be necessary.

Solving this is a complicated problem, which may involve being able to introduce the ability to have some "cross talk" between components that are currently seen as independent from one another. Being able to add "hooks" on operations such as volume mounts or networking setup may be a solution, with a runtime being able to tell CSI or CNI that it will take care of specific APIs.

This could also be implemented with a more general mechanism that would allow a components to insert itself in around an API, with the ability to use the original API.

Timing considerations

An even more problematic issue is when the timing "is not right", i.e. when the existing flow assumes a sequence of events that is no longer valid.

An example of this is that in the "normal" scenario, image download conceptually happens before pod creation. In the Confidential Containers case, we need a (confidential) pod to download images in. As an aside, an interesting implication of this is that we can currently enumerate the images that a cluster has already downloaded to decide if we need to pull an image, but in Confidential Containers this is not possible. A similar issue exists with respect to starting a container, because with Confidential Containers this will involve an attestation step, which implies that the guest, not Kubernetes, ultimately decides if and when to actually start the workload.

Action plan

Discuss on this issue, and modify the description above until we are satisfied with it
Present the issue to stakeholders outside of the Kata Containers team (containerd, crio, kubelet, CNI and CSI at least)
Open issues to describe the APIs themselves

The text was updated successfully, but these errors were encountered:

wllenyj · 2021-09-17T12:22:58Z

Other issues:
The checkpoint/restore api of the pod level, (kubernetes/enhancements#1990).

Current checkpoint/restore interfaces are based on container level, which is simply not possible when using vm-based snapshots, migration.

egernst · 2021-09-21T03:38:18Z

Other issues:
The checkpoint/restore api of the pod level, (kubernetes/enhancements#1990).

Current checkpoint/restore interfaces are based on container level, which is simply not possible when using vm-based snapshots, migration.

I'm curious if there are cases where a checkpoint/restore in Kubernetes would happen at just the container level (from a user perspective). I'll need to read more of that KEP, though... so long as the requests are passed down to the runtime to execute this, I think we should be able to manage?

egernst · 2021-09-21T03:41:38Z

the CPU issue used as an example is caused by a lack of data, specifically the number of requested CPUs being absent in the data that Kata receives. This forces Kata to hot-plug CPUs using data "guessed" from the CPU limit, and seriously restricts for example CPU-bound network performance in cases where we run with 1 CPU whereas runc would run with all host CPUs.

I'm not sure I agree with the specific example -- we can see that the resource is left unconstrained (cfs quota == -1) in this no-limit case, and make a decision to hotplug all or % of CPUs (provide more vCPU threads).

egernst · 2021-09-21T03:48:18Z

The area that we could use more detail is being provided the memory/cpu details at sandbox time (in case of pod lifecycle) in order to help us appropriately setup the queues for the various IO devices (virtio-net/virtiofs). This should be feasible to implement, and the required changes in Kubernetes is targeting 1.23 (code freeze mid October).

egernst · 2021-09-21T03:51:38Z

I'd need to better understand the rest of the image pull discussions that have been happening in the CC calls. In particular, I know that Ali folks have been progressing in this space; I'm curious to understand how they're working through the ordering limitation (ie, ensuring an image pull is associated with a given sandbox).

Similar; I'm not sure I appreciate yet the attestation flow relative to the container "start" - I would think that attestation would be done in create.

wllenyj · 2021-09-23T07:26:46Z

ensuring an image pull is associated with a given sandbox

Get SandboxName from SandboxConfig of the PullImageRequest, then get the SandboxID based on the SandboxName from SandboxNameIndex of cri.

	name := server.MakeSandboxName(r.SandboxConfig.GetMetadata())
	sandboxID := cc.SandboxNameIndex.GetKeyByName(name)

containerd/containerd@3fd5083#diff-c1021ec6c3cecde436373b5c0a3eb6527258e1c3fe264efacc626dcb16e42c41R176

c3d · 2022-01-25T13:57:37Z

See a use case for this in #2941

Make the license checker ignore drawio files. Commit is effectively a cherry pick of 11ea324 (which fixed kata-containers#2637). Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>

c3d added feature New functionality needs-review Needs to be assessed by the team. labels Sep 14, 2021

katacontainersbot added this to To do in Issue backlog Sep 14, 2021

c3d self-assigned this Sep 14, 2021

c3d removed the needs-review Needs to be assessed by the team. label Sep 21, 2021

c3d added this to To do in Confidential containers via automation Sep 21, 2021

c3d added area/api Application Programming Interface discussion This is for issues that are meant to record a discussion rather than describe code changes. and removed feature New functionality labels Sep 21, 2021

c3d mentioned this issue Jan 25, 2022

Sandbox sizing feature #2941

Merged

3 tasks

c3d mentioned this issue Feb 27, 2023

RFC: Separate trust realms for tenant and host #1834

Open

c3d mentioned this issue Mar 22, 2023

runtime-rs: Allow hypervisors to access OCI spec / annotations #6508

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod API: Information that the kata runtime is missing #2637

Pod API: Information that the kata runtime is missing #2637

Pod API: Information that the kata runtime is missing #2637

Pod API: Information that the kata runtime is missing #2637

Comments

Purpose and motivation

Examples of issues

Overview of the problem

Detailed problem description

Pod API - Need for changes outside of Kata Containers

Missing data

Data flow between components

Timing considerations

Action plan