[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod API: Information that the kata runtime is missing #2637

Open
c3d opened this issue Sep 14, 2021 · 7 comments
Open

Pod API: Information that the kata runtime is missing #2637

c3d opened this issue Sep 14, 2021 · 7 comments
Assignees
Labels
area/api Application Programming Interface discussion This is for issues that are meant to record a discussion rather than describe code changes.

Comments

@c3d
Copy link
Member
c3d commented Sep 14, 2021

Purpose and motivation

This Feature Request is intended to document and present the kind of issues that a runtime like Kata Containers encounters with existing Kubernetes and container runtime APIs, for example due to API designs historically shaped by constraints that do not apply well to a VM-based runtime. We decided, during the Architecture Committee on Sep 7, 2021, to first document these as a Kata Containers issue, in order to be able to share our needs with other projects.

A primary motivation for this document is Confidential Containers, but the examples given below show that the problem is more ancient. This is also intended to describe the problem more than suggest solutions, with the assumption that the majority of the solution will happen outside of Kata.

Examples of issues

This issue uses three examples to illustrate the kinds of problems we run into:

These three examples are only meant to illustrates specific interactions, broadly related to the common runtime interface (CRI), common network interface (CNI) and common storage interface (CSI). We will later refer to the "category" of issue as "CPU", "network" and "disk" respectively.

There are other issues that can be seen as related, notably related to device mapping (e.g. #2185), changes in the authentication and trust domains, notably as it pertains to confidential containers (e.g. #1834). Also, some of these topics have already been discussed outside of Kata Containers, e.g. the Sandbox API.

Overview of the problem

Historically, the existing APIs and flows between the various components were defined largely based on how the underlying operating system functions are used in order to implement the containers. As a result, some APIs will simply not contain data that Kata Containers might need , provide the data at the wrong time, in the wrong order or to the wrong component, or perform operations that are detrimental to how Kata needs to do things.

This is illustrated by the three issues above:

  • the CPU issue used as an example is caused by a lack of data, specifically the number of requested CPUs being absent in the data that Kata receives. This forces Kata to hot-plug CPUs using data "guessed" from the CPU limit, and seriously restricts for example CPU-bound network performance in cases where we run with 1 CPU whereas runc would run with all host CPUs.
  • The network issue is an example where the data does not currently flow easily between components the way we want. For example, k8s might talk to the CNI directly and not pass information to the runtime at all, but for a virtual machine, we need some details in order to setup the network efficiently.
  • The disk issue is an example where the host does things that are somewhat "harmful", in this case mounting disks, and may even cause logical or security problems that will become more apparent with confidential containers, like possibly forcing us to expose guest-owned disk encryption secrets to the host.

Detailed problem description

Pod API - Need for changes outside of Kata Containers

There is little Kata Containers can do (but workaround or guess) when some data is missing. Therefore, it is likely that we will need a set of changes to happen outside of Kata Containers in order to get things fixed "the right way". This can probably be done in an incremental way, since Kata Containers manages to workaround a number of issues for now.

I suggest to call Pod API an API that would describe pods in a way that would be suitable for Kata Containers. We used to refer to the problem by talking about the Sandbox API, but this is really a containerd issue that addresses a subset of what is being discussed here.

Missing data

While we know at the moment that we miss values like CPU request, it seems likely that there are other cases we did not see yet. A good example could be support for NUMA or NUMA-awareness (hinted at by #2594). It may also be the case that we need to adjust the data. A good example is the device remapping that happens between host and guest devices to support SR-IOV and the SR-IOV device plug-in, requiring environment variable rewrites.

Therefore, it seems probably desirable for the Pod API to receive a copy of the entirety of the original workload specification, ideally in a sufficiently structured way that we can actually find fields in it easily. The downside, of course, is that this could practically mean integration "all of Kubernetes" into the Kata Runtime, and may even be flatly rejected if it exposes secrets.

Data flow between components

It is "natural" for a host-based runtime like runc to bother as little as possible about disks or networking. Separation of concerns is generally good, so it is perfectly sensible for the Kubernetes design to separate CRI, CSI and CNI.

The problem, however, is that this is presently a "top-down" approach, which does not make it easy for cross-component interactions. Having a runtime "intercept" a disk mount, for example, and be able to tell Kubernetes "Sorry, I'm taking care of this mount" (because it will be an in-guest mount) is therefore desirable in our case. Note that similar "reworks" of the interfaces are already in progress to enable confidential containers. As was presented during a recent Confidential Containers presentation, reworking the general control and data flow may be necessary.

Solving this is a complicated problem, which may involve being able to introduce the ability to have some "cross talk" between components that are currently seen as independent from one another. Being able to add "hooks" on operations such as volume mounts or networking setup may be a solution, with a runtime being able to tell CSI or CNI that it will take care of specific APIs.

This could also be implemented with a more general mechanism that would allow a components to insert itself in around an API, with the ability to use the original API.

Timing considerations

An even more problematic issue is when the timing "is not right", i.e. when the existing flow assumes a sequence of events that is no longer valid.

An example of this is that in the "normal" scenario, image download conceptually happens before pod creation. In the Confidential Containers case, we need a (confidential) pod to download images in. As an aside, an interesting implication of this is that we can currently enumerate the images that a cluster has already downloaded to decide if we need to pull an image, but in Confidential Containers this is not possible. A similar issue exists with respect to starting a container, because with Confidential Containers this will involve an attestation step, which implies that the guest, not Kubernetes, ultimately decides if and when to actually start the workload.

Action plan

  1. Discuss on this issue, and modify the description above until we are satisfied with it
  2. Present the issue to stakeholders outside of the Kata Containers team (containerd, crio, kubelet, CNI and CSI at least)
  3. Open issues to describe the APIs themselves
@c3d c3d added feature New functionality needs-review Needs to be assessed by the team. labels Sep 14, 2021
@c3d c3d self-assigned this Sep 14, 2021
@wllenyj
Copy link
Contributor
wllenyj commented Sep 17, 2021

Other issues:
The checkpoint/restore api of the pod level, (kubernetes/enhancements#1990).

Current checkpoint/restore interfaces are based on container level, which is simply not possible when using vm-based snapshots, migration.

@egernst
Copy link
Member
egernst commented Sep 21, 2021

Other issues:
The checkpoint/restore api of the pod level, (kubernetes/enhancements#1990).

Current checkpoint/restore interfaces are based on container level, which is simply not possible when using vm-based snapshots, migration.

I'm curious if there are cases where a checkpoint/restore in Kubernetes would happen at just the container level (from a user perspective). I'll need to read more of that KEP, though... so long as the requests are passed down to the runtime to execute this, I think we should be able to manage?

@egernst
Copy link
Member
egernst commented Sep 21, 2021

the CPU issue used as an example is caused by a lack of data, specifically the number of requested CPUs being absent in the data that Kata receives. This forces Kata to hot-plug CPUs using data "guessed" from the CPU limit, and seriously restricts for example CPU-bound network performance in cases where we run with 1 CPU whereas runc would run with all host CPUs.

I'm not sure I agree with the specific example -- we can see that the resource is left unconstrained (cfs quota == -1) in this no-limit case, and make a decision to hotplug all or % of CPUs (provide more vCPU threads).

@egernst
Copy link
Member
egernst commented Sep 21, 2021

The area that we could use more detail is being provided the memory/cpu details at sandbox time (in case of pod lifecycle) in order to help us appropriately setup the queues for the various IO devices (virtio-net/virtiofs). This should be feasible to implement, and the required changes in Kubernetes is targeting 1.23 (code freeze mid October).

@egernst
Copy link
Member
egernst commented Sep 21, 2021

I'd need to better understand the rest of the image pull discussions that have been happening in the CC calls. In particular, I know that Ali folks have been progressing in this space; I'm curious to understand how they're working through the ordering limitation (ie, ensuring an image pull is associated with a given sandbox).

Similar; I'm not sure I appreciate yet the attestation flow relative to the container "start" - I would think that attestation would be done in create.

@c3d c3d removed the needs-review Needs to be assessed by the team. label Sep 21, 2021
@c3d c3d added this to To do in Confidential containers via automation Sep 21, 2021
@c3d c3d added area/api Application Programming Interface discussion This is for issues that are meant to record a discussion rather than describe code changes. and removed feature New functionality labels Sep 21, 2021
@wllenyj
Copy link
Contributor
wllenyj commented Sep 23, 2021

ensuring an image pull is associated with a given sandbox

Get SandboxName from SandboxConfig of the PullImageRequest, then get the SandboxID based on the SandboxName from SandboxNameIndex of cri.

	name := server.MakeSandboxName(r.SandboxConfig.GetMetadata())
	sandboxID := cc.SandboxNameIndex.GetKeyByName(name)

containerd/containerd@3fd5083#diff-c1021ec6c3cecde436373b5c0a3eb6527258e1c3fe264efacc626dcb16e42c41R176

@c3d
Copy link
Member Author
c3d commented Jan 25, 2022

See a use case for this in #2941

@c3d c3d mentioned this issue Jan 25, 2022
3 tasks
dborquez pushed a commit to dborquez/kata-containers that referenced this issue Jun 5, 2023
Make the license checker ignore drawio files.

Commit is effectively a cherry pick of
11ea324 (which fixed kata-containers#2637).

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Application Programming Interface discussion This is for issues that are meant to record a discussion rather than describe code changes.
Development

No branches or pull requests

3 participants