-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod API: Information that the kata runtime is missing #2637
Comments
Other issues: Current checkpoint/restore interfaces are based on container level, which is simply not possible when using vm-based snapshots, migration. |
I'm curious if there are cases where a checkpoint/restore in Kubernetes would happen at just the container level (from a user perspective). I'll need to read more of that KEP, though... so long as the requests are passed down to the runtime to execute this, I think we should be able to manage? |
I'm not sure I agree with the specific example -- we can see that the resource is left unconstrained (cfs quota == -1) in this no-limit case, and make a decision to hotplug all or % of CPUs (provide more vCPU threads). |
The area that we could use more detail is being provided the memory/cpu details at sandbox time (in case of pod lifecycle) in order to help us appropriately setup the queues for the various IO devices (virtio-net/virtiofs). This should be feasible to implement, and the required changes in Kubernetes is targeting 1.23 (code freeze mid October). |
I'd need to better understand the rest of the image pull discussions that have been happening in the CC calls. In particular, I know that Ali folks have been progressing in this space; I'm curious to understand how they're working through the ordering limitation (ie, ensuring an image pull is associated with a given sandbox). Similar; I'm not sure I appreciate yet the attestation flow relative to the container "start" - I would think that attestation would be done in create. |
Get SandboxName from SandboxConfig of the PullImageRequest, then get the SandboxID based on the SandboxName from SandboxNameIndex of cri. name := server.MakeSandboxName(r.SandboxConfig.GetMetadata())
sandboxID := cc.SandboxNameIndex.GetKeyByName(name) |
See a use case for this in #2941 |
Make the license checker ignore drawio files. Commit is effectively a cherry pick of 11ea324 (which fixed kata-containers#2637). Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Purpose and motivation
This Feature Request is intended to document and present the kind of issues that a runtime like Kata Containers encounters with existing Kubernetes and container runtime APIs, for example due to API designs historically shaped by constraints that do not apply well to a VM-based runtime. We decided, during the Architecture Committee on Sep 7, 2021, to first document these as a Kata Containers issue, in order to be able to share our needs with other projects.
A primary motivation for this document is Confidential Containers, but the examples given below show that the problem is more ancient. This is also intended to describe the problem more than suggest solutions, with the assumption that the majority of the solution will happen outside of Kata.
Examples of issues
This issue uses three examples to illustrate the kinds of problems we run into:
These three examples are only meant to illustrates specific interactions, broadly related to the common runtime interface (CRI), common network interface (CNI) and common storage interface (CSI). We will later refer to the "category" of issue as "CPU", "network" and "disk" respectively.
There are other issues that can be seen as related, notably related to device mapping (e.g. #2185), changes in the authentication and trust domains, notably as it pertains to confidential containers (e.g. #1834). Also, some of these topics have already been discussed outside of Kata Containers, e.g. the Sandbox API.
Overview of the problem
Historically, the existing APIs and flows between the various components were defined largely based on how the underlying operating system functions are used in order to implement the containers. As a result, some APIs will simply not contain data that Kata Containers might need , provide the data at the wrong time, in the wrong order or to the wrong component, or perform operations that are detrimental to how Kata needs to do things.
This is illustrated by the three issues above:
runc
would run with all host CPUs.Detailed problem description
Pod API - Need for changes outside of Kata Containers
There is little Kata Containers can do (but workaround or guess) when some data is missing. Therefore, it is likely that we will need a set of changes to happen outside of Kata Containers in order to get things fixed "the right way". This can probably be done in an incremental way, since Kata Containers manages to workaround a number of issues for now.
I suggest to call Pod API an API that would describe pods in a way that would be suitable for Kata Containers. We used to refer to the problem by talking about the Sandbox API, but this is really a containerd issue that addresses a subset of what is being discussed here.
Missing data
While we know at the moment that we miss values like CPU request, it seems likely that there are other cases we did not see yet. A good example could be support for NUMA or NUMA-awareness (hinted at by #2594). It may also be the case that we need to adjust the data. A good example is the device remapping that happens between host and guest devices to support SR-IOV and the SR-IOV device plug-in, requiring environment variable rewrites.
Therefore, it seems probably desirable for the Pod API to receive a copy of the entirety of the original workload specification, ideally in a sufficiently structured way that we can actually find fields in it easily. The downside, of course, is that this could practically mean integration "all of Kubernetes" into the Kata Runtime, and may even be flatly rejected if it exposes secrets.
Data flow between components
It is "natural" for a host-based runtime like
runc
to bother as little as possible about disks or networking. Separation of concerns is generally good, so it is perfectly sensible for the Kubernetes design to separate CRI, CSI and CNI.The problem, however, is that this is presently a "top-down" approach, which does not make it easy for cross-component interactions. Having a runtime "intercept" a disk mount, for example, and be able to tell Kubernetes "Sorry, I'm taking care of this mount" (because it will be an in-guest mount) is therefore desirable in our case. Note that similar "reworks" of the interfaces are already in progress to enable confidential containers. As was presented during a recent Confidential Containers presentation, reworking the general control and data flow may be necessary.
Solving this is a complicated problem, which may involve being able to introduce the ability to have some "cross talk" between components that are currently seen as independent from one another. Being able to add "hooks" on operations such as volume mounts or networking setup may be a solution, with a runtime being able to tell CSI or CNI that it will take care of specific APIs.
This could also be implemented with a more general mechanism that would allow a components to insert itself in around an API, with the ability to use the original API.
Timing considerations
An even more problematic issue is when the timing "is not right", i.e. when the existing flow assumes a sequence of events that is no longer valid.
An example of this is that in the "normal" scenario, image download conceptually happens before pod creation. In the Confidential Containers case, we need a (confidential) pod to download images in. As an aside, an interesting implication of this is that we can currently enumerate the images that a cluster has already downloaded to decide if we need to pull an image, but in Confidential Containers this is not possible. A similar issue exists with respect to starting a container, because with Confidential Containers this will involve an attestation step, which implies that the guest, not Kubernetes, ultimately decides if and when to actually start the workload.
Action plan
The text was updated successfully, but these errors were encountered: