Use better defaults for poll interval and max reconcile rate #203

turkenh · 2024-02-20T08:22:04Z

Description of your changes

We observe that this Provides struggle to process managed resources after some (not that high) scale, no matter how much cpu or memory we allocate for the process. After some investigation, we noticed that there are two primary reasons for this is:

--max-reconcile-rate parameter, which is being used to configure some other internal parameters, when set to 10, prevents processing more than ~10 resources per second.
--poll parameter with 1m as default, adds all the resources to the queue for drift detection after 1m.

We never run scaling experiments for this provider, rather just started with some numbers. This PR tweaks the defaults for poll interval and max reconcile rate parameters based on the following scaling experiments.

Using the setup linked below, I created 1000 and 10000 Objects with different poll interval and max-reconcile-rate settings. After giving some time for system to stabilize, I created 1 more Object and recorded time until it gets processed first time, i.e. Time until First Reconcile, as a measure of system responsiveness.

poll,max-reconcile-rate	With 1000 Objects	With 10000 Objects
1m, 10 (current default)	2m 22s	28m
1m, 100	under 1s	1m
10m, 100 (proposed)	under 1s	under 1s

Please note, the proposed defaults for the two configurations are also aligned with the Upbound's official providers, e.g. provider-aws.

I have:

Read and followed Crossplane's contribution process.
Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

See the experiment setup here: https://github.com/turkenh/provider-kubernetes-scalability

Signed-off-by: Hasan Turken <turkenh@gmail.com>

turkenh · 2024-02-20T08:35:14Z

Some metrics from the experiments:

1m, 10: (1k Objects, 10k Objects)

1m, 100: (1k Objects, 10k Objects)

10m, 100: (1k Objects, 10k Objects)

One thing to note here is, it looks like we don't get clear signals on responsiveness from metrics like workqueue_depth. I suspect that crossplane runtime's ratelimiting reconciler drops the objects from the queue without really processing them, which prevents observing the number of items waiting to be processed (I'll create a separate issue for that).

lsviben

Thanks @turkenh for the in-depth performance checks. This change seems sensible to me based on the results.

Using the setup linked below, I created 1000 and 10000 Objects with different poll interval and max-reconcile-rate settings. After giving some time for system to stabilize

I have a q about this: "After giving some time for system to stabilize". Did you also notice differences in the time needed for the system to stabilize? I would assume it would be faster with the new values, but maybe we would also be hitting k8s api-server throttling due to more requests?

lsviben · 2024-02-20T09:47:09Z

Thanks @turkenh for the in-depth performance checks. This change seems sensible to me based on the results.

Using the setup linked below, I created 1000 and 10000 Objects with different poll interval and max-reconcile-rate settings. After giving some time for system to stabilize

I have a q about this: "After giving some time for system to stabilize". Did you also notice differences in the time needed for the system to stabilize? I would assume it would be faster with the new values, but maybe we would also be hitting k8s api-server throttling due to more requests?

I realized you can see this on the last graphs, so nvm :)

turkenh · 2024-02-20T17:52:14Z

After further testing I found a problem with just bumping the poll interval. It works fine for the default ReadinessPolicy, i.e. SuccessfulCreate, but when we chose DeriveFromObject it delays readiness of object until the next poll. Which goes up to 10 minutes with this PR.

I believe we need to find a way to differentiate poll interval from initial readiness. May be something to implement in managed reconciler, e.g. if the object is not ready, requeue with a shorter delay otherwise fallback to poll interval.

negz · 2024-02-20T18:26:33Z

May be something to implement in managed reconciler, e.g. if the object is not ready, requeue with a shorter delay otherwise fallback to poll interval.

@turkenh the managed resource reconciler itself has no concept of MR readiness. That's implemented inside each InternalClient. That said, I think you could use the (relatively new) PollIntervalHook to do this:

https://github.com/crossplane/crossplane-runtime/blob/7fcb8c5/pkg/reconciler/managed/reconciler.go#L556

For example:

WithPollIntervalHook(func (mg resource.Managed, interval time.Duration) time.Duration {
    if mg.GetCondition(xpv1.ConditionReady) == corev1.ConditionTrue {
        return interval // If it's ready, use the poll interval from the flag.
    }
    return 30 * time.Second // If it's not, requeue sooner.
})

For this provider I think it would be ideal to support watches rather than relying on polling. I'd certainly feel better about a long poll interval like 10m if I knew it was really just a backup to a watch. That said, I imagine it won't be straightforward to build given right now our Kubernetes clients don't persist across reconciles. I imagine we'd have to switch to using a pool of clients.

turkenh · 2024-02-20T18:57:31Z

That said, I think you could use the (relatively new) PollIntervalHook to do this:

Thanks @negz, this is exactly where I found myself after checking the managed reconciler code. Great that we have this flexibility already. The only thing that I didn't like is, pollJitter implementation already uses that function, so, I had to leak that code here.

negz · 2024-02-20T20:17:49Z

internal/controller/object/object.go

+			if mg.GetCondition(xpv1.TypeReady).Status == v1.ConditionTrue {
+				// This is the same as runtime default poll interval with jitter, see:
+				// https://github.com/crossplane/crossplane-runtime/blob/7fcb8c5cad6fc4abb6649813b92ab92e1832d368/pkg/reconciler/managed/reconciler.go#L573
+				return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //#nosec G404 -- no need for secure randomness


Suggested change

return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //#nosec G404 -- no need for secure randomness

return pollInterval + time.Duration((rand.Float64()-0.5)*2*float64(pollJitter)) //nolint G404 // No need for secure randomness

Nit: let's use golint style disable comments here (I know this was just copied from upstream).

Done. Also, applied jitter to 30 seconds as well.

negz · 2024-02-20T20:21:17Z

The only thing that I didn't like is, pollJitter implementation already uses that function, so, I had to leak that code here.

I think that's fine.

Signed-off-by: Hasan Turken <turkenh@gmail.com>

Use better defaults for poll interval and max reconcile rate

d732bb3

Signed-off-by: Hasan Turken <turkenh@gmail.com>

turkenh requested review from lsviben and negz February 20, 2024 08:51

turkenh mentioned this pull request Feb 20, 2024

Configurable MaxConcurrentReconciles with a better default for max-reconcile-rate crossplane-contrib/provider-helm#211

Merged

2 tasks

lsviben approved these changes Feb 20, 2024

View reviewed changes

negz approved these changes Feb 20, 2024

View reviewed changes

negz reviewed Feb 20, 2024

View reviewed changes

Do not wait for poll interval if not ready

c73c4bc

Signed-off-by: Hasan Turken <turkenh@gmail.com>

turkenh force-pushed the tweak-poll-and-max-reconcile-rate branch from 74a00e8 to c73c4bc Compare February 20, 2024 20:36

turkenh merged commit cc17d12 into crossplane-contrib:main Feb 20, 2024
8 checks passed

negz mentioned this pull request Feb 21, 2024

Add a retroactive one-pager for rate limiting crossplane/crossplane#5415

Merged

6 tasks

sttts mentioned this pull request Feb 22, 2024

logging: wire klog backend, but only output request throttling logs crossplane/crossplane#5419

Merged

2 tasks

turkenh mentioned this pull request Feb 23, 2024

Workqueue depth and duration metrics are not accurate crossplane/crossplane-runtime#674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use better defaults for poll interval and max reconcile rate #203

Use better defaults for poll interval and max reconcile rate #203

	return pollInterval + time.Duration((rand.Float64()-0.5)2float64(pollJitter)) //#nosec G404 -- no need for secure randomness
	return pollInterval + time.Duration((rand.Float64()-0.5)2float64(pollJitter)) //nolint G404 // No need for secure randomness

Use better defaults for poll interval and max reconcile rate #203

Use better defaults for poll interval and max reconcile rate #203

Conversation

Description of your changes

How has this code been tested

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment