[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API #1354

Merged

Conversation

ingvagabund
Copy link
Contributor
@ingvagabund ingvagabund commented Feb 26, 2024

Is your feature request related to a problem? Please describe.

The descheduler eviction policy is built on top of the eviction API. The API currently does not support eviction requests that are not completed right away. Instead, any eviction needs to either succeed or be rejected in response. Nevertheless, there are cases where an eviction request is expected to only initiate eviction. While getting confirmation or rejection of the eviction initiation (or its promise).

Describe the solution you'd like

Utilize evacuation API as an alternative to eviction API. As an interim solution (until the evacuation API is available) allow to interpret pods with a special annotation as a request for eviction initiation instead of expecting an immediate eviction.

Resolves: #1397

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 26, 2024
@ingvagabund ingvagabund force-pushed the evictions-in-background branch 2 times, most recently from 320169c to 5af0e72 Compare February 26, 2024 10:12
@ingvagabund ingvagabund force-pushed the evictions-in-background branch 5 times, most recently from a53c30d to ec699a1 Compare May 7, 2024 13:43
@songtao98
Copy link
songtao98 commented May 11, 2024

Hi, @ingvagabund . I have some questions about how does this KEP works in Descheduling Framework.

  1. Do you expect to use this new eviction handling mechanism to replace existing PodEvictor (who call Eviciton API directly), so that all users evict their pods through it?
  2. Do you consider to take one more step to make it as a extension point, e.g. Evict, so that users can decide which way to evict their pods, and existing PodEvictor or your extension by this KEP can both be implemented as a EvictPlugin(different with existing EvictorPlugin)?

@ingvagabund ingvagabund changed the title KEP: Extend descheduler eviction handling for eviction requests KEP-1397: descheduler integration with evacuation API as an alternative to eviction API May 11, 2024
@ingvagabund ingvagabund force-pushed the evictions-in-background branch 5 times, most recently from f1649b7 to c5cae17 Compare May 11, 2024 18:16
@ingvagabund
Copy link
Contributor Author

@songtao98 thank you for asking the right questions.

  1. Do you expect to use this new eviction handling mechanism to replace existing PodEvictor (who call Eviciton API directly), so that all users evict their pods through it?

Eventually, that's the ultimate goal. However, the first implementations will provide both eviction API and evacuation API. So we have some soft time for early adopters to share their experience and corner case for which evacuation API does not work/scale well. Followed by discussions with evacuation API designers about possible resolutions before the new API goes GA.

  1. Do you consider to take one more step to make it as a extension point, e.g. Evict, so that users can decide which way to evict their pods, and existing PodEvictor or your extension by this KEP can both be implemented as a EvictPlugin(different with existing EvictorPlugin)?

Not right now. Ultimately, the new evacuation API should be a special case of the eviction API. The details are yet to be discussed.

@songtao98
Copy link

thank you for asking the right questions.

@ingvagabund Thank you for your reply! For question 1, the strategy sounds great. For question 2, I'm not sure if someone has ever proposed a custom Evictor mechanism(to customize the real evictor like PodEvictor, not currently EvictorPlugin), but we think this capability will eventually be meaningful and already implemented one. We are also looking forward to more discussion about details, and if the community think it's the right way, we will be more than happy to contribute : )

@ingvagabund
Copy link
Contributor Author

We think this capability will eventually be meaningful and already implemented one. We are also looking forward to more discussion about details, and if the community think it's the right way, we will be more than happy to contribute : )

Few sig calls back we discussed the possibility of writing mini-keps. Turning the low level eviction mechanism into a plugin is a good example for it :) If there are good use cases where a user can benefit from a custom eviction mechanism. Would you be willing to write a proposal for this? The current proposal focuses mainly on integration with the new evacuation API (and annotation as an interim solution). Pluginazing the evictor itself is related, yet it deserves its own attention.

@songtao98
Copy link

Few sig calls back we discussed the possibility of writing mini-keps. Turning the low level eviction mechanism into a plugin is a good example for it :) If there are good use cases where a user can benefit from a custom eviction mechanism. Would you be willing to write a proposal for this? The current proposal focuses mainly on integration with the new evacuation API (and annotation as an interim solution). Pluginazing the evictor itself is related, yet it deserves its own attention.

For sure! We are certainly willing to draft a proposal for customizing the eviction mechanism. However, it might take some time as I need to organize some details and perform the necessary abstractions. BTW, regarding the mini-keps you mentioned, are there any specific format requirements? If not, I will follow the current KEP template.

|**Pod nomination and sorting**|Preferring pods with both `descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress` annotations.|Preferring pods that have a corresponding evacuation CR present.|
|**Cache reconstruction**|When a descheduler gets restarted and a new internal caches (with tracked annotations) cleared the descheduler lists all pods and populate the cache with any pod that has both annotation (`descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress`) present. In case only the first annotation is present but the eviction request was already created, the update event handler will catch the second annotation addition and the cache gets synced. In the worst case the limit of max number of pods evicted gets exceeded.|The descheduler waits until all evacution CRs are synced.|

Given the evucation API is still a work in progress without any existing implementation, the annotation based eviction can be seen as an v1alpha1 implemenation of this proposal. Integration with the evacuation API as an v1alphaN or beta implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the evucation API is still a work in progress without any existing implementation,

Typo here "evucation"

@ingvagabund ingvagabund force-pushed the evictions-in-background branch 3 times, most recently from 6fccd48 to 27acbd2 Compare May 14, 2024 11:34
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 14, 2024
This is where to call out areas of the design that require closure before deciding
to implement the design.

* The evacution API proposal mentions the evacuation instigator should remove

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe cancellation will not be needed? #1354 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in the comment.

The descheduler will account for existing requests and update the internal counters accordingally.
In the future a mechanism for deleting too old evacuation requests can be introduced.
I.e. based on a new `--max-evacuation-request-ttl` option.
* The evacution API proposal mentions more than a one entity can request an eviction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the descheduler have a use for a custom value of .spec.progressDeadlineSeconds? We are thinking of using a unified value (1800 (30m)) in all of the components.

Might be good to clarify deschedulers position on this.

As it currently stands, the entity who creates the Evacuation has a control of this value. The following instigators who do not create the object, do not have the controll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are thinking of using a unified value (1800 (30m)) in all of the components.

We can make it the default and configurable if needed. Can an evacuation request be created without .spec.progressDeadlineSeconds? I.e. can we expected the default value will get set to a reasonable value/30m?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea, I will add a default of 30m to the API.

from the descheduler's point of view?
**Answer**: The first implementation will consider all pods with an existing
evacuation request as "already getting evacuated". Independent of the pod priorities.
* With evacution requests a plugin does not evict a pod right away.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move to just using the Evacuations? Would it be simpler to reason about by the plugins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you more elaborate on what you mean? In my mind evacuations = evacuation requests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot what I meant originally.

Can we pass the Evacuation objects through the plugins then and have them decide on their state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eviction/evacuation is exposed/expected to be exposed through a generic interface. All the knowledge about how eviction/evacuation processes are implemented is currently expected to be exposed through filters/sorters. So no plugin implementation depends on a particular eviction/evacuation implementation. Some community adopters would like to turn the eviction/evacuation handle into a plugin as well.

@knelasevero
Copy link
Contributor

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

@knelasevero
Copy link
Contributor

In Filip's proposal (I need to read it in more detail, but I skimmed it already), there is this notion for Evacuator Prioritization and multiple evacuators (and a lot of other details not considered here). To me it sounds like when going from alpha implementation (here) to beta, and trying to assimilate the other complexities from the real evacuator API too much would change to call the graduation here a graduation from alpha to beta? Maybe it would be another alpha for a while before becoming beta? (or, again, as above comment, it would be worth to consider all those things from the start...)

- Evacuation API based evictions
- Identifying corner cases that are not addressed by the evacuation API
- `descheduler.alpha.kubernetes.io/request-evict-only` annotation is deprecated
- Workloads do not set `descheduler.alpha.kubernetes.io/request-evict-only` annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atiratree then also implementation around prioritization, detailed status reporting, and cancellation policies as per the Evacuation API? Then, not a beta graduation, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, second alpha would be more appropriate.

Also, Workloads do not set descheduler.alpha.kubernetes.io/request-evict-only annotation is not actionable. Maybe s/do/should/ and document this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation will not get promoted to beta without the evacuation API integration. We will be rolling at least two v1alpha versions.

Copy link
Contributor Author
@ingvagabund ingvagabund Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workloads do not set descheduler.alpha.kubernetes.io/request-evict-only annotation is not actionable. Maybe s/do/should/ and document this?

What about Workloads are not expected to set descheduler.alpha.kubernetes.io/request-evict-only annotation when deprecated?

2. Error code, resp. response text is checked to distinguish between
a genuine error and a confirmation of an eviction in background.

##### Evacuation API (v1alphaN or beta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, i see v1alphaN is considered, I missed this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe more than considered it should be a must given the expected changes

Copy link
Contributor Author
@ingvagabund ingvagabund Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "v1alphaN and beta"

@knelasevero
Copy link
Contributor

This looks great and is very detailed. I need to read both proposals a bit more deeply to give more insights, but wanted to give my initial thoughts so the KEP can go forward

annotation or confronting the internal caches.
2. Each descheduling plugin nominates a set of pods to be evicted.
3. All the nominated pods are sorted based on the eviction-in-progress first priority.
4. Pod eviction: the descheduler starts evicting nominated pods until a limit is reached

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the eviction going to be handled in each plugin separately as it is currently?

What are the guarantees that a plugin will use the Evacuator/Evictor correctly? E.g. calling pre-checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the eviction going to be handled in each plugin separately as it is currently?

Yes. The current mechanism will be persisted.

What are the guarantees that a plugin will use the Evacuator/Evictor correctly? E.g. calling pre-checks?

None. Each plugin is given a handler for evicting/evacuating a pod. No pod will be aware whether an eviction or evacuation API is invoked under the hood.

By pre-checks you mean excluding pods with already existing evacuation requests?

##### Evacuation API (v1alphaN or beta)

1. Policy configuration
1. The cluster administrator configures the descheduler to enable the new functionality.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the descheduler kubernetes version backwards compatibility? There should be probably a check if the Evacuation API resource exists in the cluster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run a check for whether an eviction API is supported in

evictionPolicyGroupVersion, err := eutils.SupportEviction(rs.Client)
. I expect a similar check for the evacuation API.

all the internal counters to take into account all pods that are subjects
to background eviction. By listing evacuation requests or confronting the internal caches.
2. Each descheduling plugin nominates a set of pods to be evicted.
3. All the nominated pods are sorted based on the eviction-in-progress first priority.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: eviction in progress is recognized by a presence of Evacuation object

Copy link
Contributor Author
@ingvagabund ingvagabund Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. The descheduler is expected to read all existing evacuation objects/CRs and classify corresponding pods as "evacuation requested or in progress".

to background eviction. By listing evacuation requests or confronting the internal caches.
2. Each descheduling plugin nominates a set of pods to be evicted.
3. All the nominated pods are sorted based on the eviction-in-progress first priority.
4. Pod eviction: the descheduler creates an evacuation request for each nominated pod until a limit is reached

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation detail, but

  • it should add a evacuation.coordination.k8s.io/instigator_${EVACUATION_INSTIGATOR_SUBDOMAIN} finalizer to the Evacuation when creating it
  • if the Evacuation exists it should ensure the finalizer is present

- The upstream eviction API does not currently support evictions in background
implemented by external components. For that, there's no community
dedicated error code nor a response text for this type of eviction.
- The eviction in background error code will be temporary mapped to 429 (TooManyRequests).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. No retries.

There is also a standing issue about the confusion of today's Eviction API request errors: kubernetes/kubernetes#106286. I am not sure if it is a good idea to add another behavior dependent on external component to that. At least having a distinguished code could make it more manegable

- Evacuation API based evictions
- Identifying corner cases that are not addressed by the evacuation API
- `descheduler.alpha.kubernetes.io/request-evict-only` annotation is deprecated
- Workloads do not set `descheduler.alpha.kubernetes.io/request-evict-only` annotation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose, second alpha would be more appropriate.

Also, Workloads do not set descheduler.alpha.kubernetes.io/request-evict-only annotation is not actionable. Maybe s/do/should/ and document this?

- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
- Annotation based evictions
- Workloads do not set `descheduler.alpha.kubernetes.io/request-evict-only` annotation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confusing to see the deprecation in alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on how many alphas we will have. The annotation gets deprecated the moment the evacuation API is implemented as alpha. While we will need to wait until the evacuation API gets promoted to beta before we can promote the eviction-in-background to beta.


```go
type PodEvacuator interface {
EvacutePod(ctx context.Context, pod *v1.Pod) error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what guarantee we have each plugin will call this interface? Isn't it better to override/change the Evictor interface?

Copy link
Contributor Author
@ingvagabund ingvagabund Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Evictor interface will get replaced by PodEvacuator interface under the hood. Currently, all plugins are expected to invoke <plugin>.handle.Evictor().Evict method. The method will invoke the evacuation API once available and get renamed to a more suitable name.

Comment on lines +482 to +487
* The evacuation API proposal mentions more than a one entity can request an eviction.
Should the descheduler take these into account as well?
What if another entity decides to evict a pod that is of a low priority
from the descheduler's point of view?
**Answer**: The first implementation will consider all pods with an existing
evacuation request as "already getting evacuated". Independent of the pod priorities.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important part to decide for the descheduler is, if it wants to subscribe to evacuating these lower evacuations as well. This is done by setting the finalizer and prevents the evacuation cancellation.

Copy link
Contributor Author
@ingvagabund ingvagabund Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This depends on how a user will see these evacuations. We could introduce a new option for this case so the descheduler sets the finalizer even for these pods and count them as "marked for eviction by the descheduler".

Have you considered a concept of a maximal number of pods that can be evacuated per time? A global setting for an entire cluster?

with `descheduler.alpha.kubernetes.io/eviction-in-progress` or creating evacuation requests).
2. Run the descheduler and observe only an expected amount of pods
is requested to be evicted/evacuated.
- `EvacuationTTL` for testing an evacuation request (with a single owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused, do we introduce the TTL for the first version or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see I have mentioned the TTL will not be implemented in the first version under Open Questions. Yet, v1alpha1 section mentions it will. I removed that part from v1alpha1.

v1alpha1 will not introduce the TTL mechanism. Yet, it might get implemented in later v1alpha versions.

@atiratree
Copy link

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

FIY, we are still discussing the Evacuation API feature and it was deferred from 1.31

@ingvagabund
Copy link
Contributor Author

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

Yes

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

I did. This is an important feature for KubeVirt. There's a strong interest to have this feature merged sooner than later. Waiting until the Evacuation API is implemented/alpha would be the best course of actions. Yet, it's not a blocker for implementing the first alpha of this feature.

@ingvagabund
Copy link
Contributor Author

In Filip's proposal (I need to read it in more detail, but I skimmed it already), there is this notion for Evacuator Prioritization and multiple evacuators (and a lot of other details not considered here). To me it sounds like when going from alpha implementation (here) to beta, and trying to assimilate the other complexities from the real evacuator API too much would change to call the graduation here a graduation from alpha to beta? Maybe it would be another alpha for a while before becoming beta? (or, again, as above comment, it would be worth to consider all those things from the start...)

Currently, the promotion to beta requires implementation of the evacuation API. Thus, we need to stay aligned with it and can not promote to beta until the evacuation API is beta.

@knelasevero
Copy link
Contributor

Do we consider the alpha implementation here to be an workaround until the until the evacuation API is fully integrated (in the future, in beta)?

Yes

I understand trying to deliver value earlier and then already start to get feedback, but have you checked with Feature Requesters and stakeholders if that is really what they want? Maybe it makes sense to start with the evacuation API integration from the start.

I did. This is an important feature for KubeVirt. There's a strong interest to have this feature merged sooner than later. Waiting until the Evacuation API is implemented/alpha would be the best course of actions. Yet, it's not a blocker for implementing the first alpha of this feature.

I know it is not a blocker. I see it more from the frame of:

Is this delivering enough value when implemented as a workaround to justify not waiting for the actual thing that you want to use? Is the feedback and initial users you will get now be worth implementing this, even if the thing will be quite different when the Evacuation API is here (in beta or when it is safe to use it or w/e)?

@ingvagabund
Copy link
Contributor Author
ingvagabund commented Jun 18, 2024

Is this delivering enough value when implemented as a workaround to justify not waiting for the actual thing that you want to use? Is the feedback and initial users you will get now be worth implementing this, even if the thing will be quite different when the Evacuation API is here (in beta or when it is safe to use it or w/e)?

Disclaimer: this is my personal comment and it does not have to reflect the current reality.

We don't have any participating initial users besides the discussions I have had with some KubeVirt representatives. So currently there's only one user. The feedback is currently limited and influenced by the evacuation API proposal. I don't have any solid data to support anything I have proposed in the current proposal. The design was crafted in a best-effort manner in my best conscience. Taking into account many what-if scenarios. We still don't have many participating contributors in the descheduler community to extend the discussions/reviews and hand craft a better design. Practically, any contribution from my side is done either in my free time of when I prioritize my work in my current company.

I am aligned with your perspective and acknowledge you asked the right and important questions. I see two options. To wait (hard to say for how long) and collect more feedback with an assumption the design will improve. Or, pin the current design, produce v1alpha1 implementation, collect the feedback and see how well/bad the design proves the be.

@knelasevero
Copy link
Contributor
knelasevero commented Jun 19, 2024

We still don't have many participating contributors in the descheduler community to extend the discussions/reviews and hand craft a better design

You focus on the design reviews not being comprehensive for lack of contributors, but that is actually an argument to wait, no? Since if you consider the repo to have less support than it needs it would be harder to maintain a workaround than it would be to maintain what you envision to be the final implementation (that would also start at alpha, but would not have big changes of core internals, while evolving initially).

To wait (hard to say for how long) and collect more feedback with an assumption the design will improve. Or, pin the current design, produce v1alpha1 implementation, collect the feedback and see how well/bad the design proves the be.

I also see these two options, and I see pros and cons coming from both of them. I think the workaround design is great, I wouldn't consider it to be bad. I think it was detailed and a lot has been considered. Just raised this discussion to be the devil's advocate and check other paths. This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

with an assumption the design will improve

But you also assume that, or the goal of migrating to the intended final implementation using the actual evacuation API (and ditching the previous one) would not be written down here in you EP.

@fossedihelm
Copy link

Reading the great discussion here! To me, the proposal looks good, and thank you for this effort!

This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also, do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

Answering this question: Yes, this is an essential feature for KubeVirt.
Let's start from the beginning: live migration is a key feature of KubeVirt; it allows the workload not to be compromised. For this reason, it is chosen as the default behavior during an eviction, drain, etc...
KubeVirt is also shipped by default with Openshift, and currently, Descheduler and KubeVirt behavior are not compatible with each other. Without this workaround, a cluster admin would have to choose between having an active KubeVirt core feature and the Descheduler, and this would put them in a difficult situation that would be best avoided.
As expressed above this is a workaround that will be overridden with the evacuation API and KubeVirt will move to the evacuation API accordingly.
IOW, yes, KubeVirt side it would definitely be better to have this ASAP.
Thank you!

@knelasevero
Copy link
Contributor

Reading the great discussion here! To me, the proposal looks good, and thank you for this effort!

This is an important feature for KubeVirt -> but how urgent, I think that's the question. Also, do they know of the plan to migrate later to the evacuation API approach? Would they explicitly prefer not to wait, or did they just say they need it asap?

Answering this question: Yes, this is an essential feature for KubeVirt. Let's start from the beginning: live migration is a key feature of KubeVirt; it allows the workload not to be compromised. For this reason, it is chosen as the default behavior during an eviction, drain, etc... KubeVirt is also shipped by default with Openshift, and currently, Descheduler and KubeVirt behavior are not compatible with each other. Without this workaround, a cluster admin would have to choose between having an active KubeVirt core feature and the Descheduler, and this would put them in a difficult situation that would be best avoided. As expressed above this is a workaround that will be overridden with the evacuation API and KubeVirt will move to the evacuation API accordingly. IOW, yes, KubeVirt side it would definitely be better to have this ASAP. Thank you!

Thanks! Having your direct input here is important. Thanks for chiming in!

@ingvagabund
Copy link
Contributor Author

I am currently refactoring the current code base so at the end the changes required to implement v1alpha1 of this KEP are minimal. The current PoC is under https://github.com/ingvagabund/descheduler/tree/eviction-in-background-code-2024-07-02.

Copy link
Contributor
@damemi damemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, and the evacuation API does seem like a more graceful evolution of the eviction API.

One question though, we have had requests for more immediate pod deletion (like #1253), and this seems to go a step further away from that. Should we have any continued support for direct eviction requests? Or is eviction going away entirely?

### Annotation vs. evacuation API based eviction

The [evacuation API](https://github.com/kubernetes/enhancements/pull/4565) is expected
to replace the [eviction API](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any timeline on deprecation for the eviction API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. It will strongly depends on when the evacuation API graduates.

**Answer**: The first implementation will not reset any evacuation request.
The descheduler will account for existing requests and update the internal counters accordingally.
In the future a mechanism for deleting too old evacuation requests can be introduced.
I.e. based on a new `--max-evacuation-request-ttl` option.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a policy config field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. A policy config field is versioned. With a feature gate mechanism alpha feature fields can be ignored.

@damemi
Copy link
Contributor
damemi commented Jul 16, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2024
@ingvagabund
Copy link
Contributor Author

Thank you Mike for going through the proposal.

Should we have any continued support for direct eviction requests? Or is eviction going away entirely?

@atiratree will know better though based on my understand the eviction API will not go away. Instead, users will still have option to choose between the evacuation (delayed eviction) and the eviction APIs. Yet, with strong recommendation to prefer the evacuation API as it will serve as a cluster scoped coordinator for any eviction.

Wrt. to #1253 we will need to decide whether the descheduler will eventually switch to the evacuation API by default or not. Yet, we will still keep the direct eviction API and any potential immediate deletion configurable. It will take some time before the evacuation API becomes GA.

@atiratree
Copy link

@atiratree will know better though based on my understand the eviction API will not go away. Instead, users will still have option to choose between the evacuation (delayed eviction) and the eviction APIs. Yet, with strong recommendation to prefer the evacuation API as it will serve as a cluster scoped coordinator for any eviction.

@ingvagabund yes the eviction API is not going away. A better version of it (e.g. Evacuation API, etc.) is still being discussed upstream in sig-apps and should eventually be used by all the major components like the descheduler, cluster autoscaler, etc. Normal users do not really use eviction much.

@atiratree
Copy link

in accordance with the above consensus
/lgtm

@k8s-ci-robot
Copy link
Contributor

@atiratree: changing LGTM is restricted to collaborators

In response to this:

in accordance with the above consensus
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ingvagabund ingvagabund added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 24, 2024
@k8s-ci-robot k8s-ci-robot merged commit f3569b5 into kubernetes-sigs:master Jul 24, 2024
9 checks passed
@ingvagabund ingvagabund deleted the evictions-in-background branch July 24, 2024 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API
8 participants