max_retries circuit breaker vs. dynamic cluster selection #17412

markdroth · 2021-07-19T22:55:48Z

There are two cases where the cluster is dynamically determined at load-balancing time:

Aggregate clusters.
Dynamic cluster selection via the new extension point added in api: add cluster_specifier_plugin to RouteAction #16944.

In these cases, it's not clear how the max_retries circuit breaker can work, because max_retries is configured on a per-cluster basis, but when the cluster is being dynamically determined, the cluster is not known until after the retry code runs. And, in fact, the chosen cluster may be different for each retry attempt.

For aggregate clusters, one possible solution might be to say that we use the max_retries value from the aggregate cluster itself instead of the one from the chosen underlying cluster. As discussed in #13134, we agreed that for aggregate clusters, the aggregate cluster should control the LB policy, but all other functionality -- including circuit breakers -- should be controlled by the underlying clusters that the aggregate cluster points to. In other words, if aggregate cluster A points to underlying clusters B and C, then the circuit breakers configured for B or C should be used depending on when the aggregate cluster chooses to send a request to B or C; the circuit breaker limits configured for cluster A itself are basically ignored. However, we could change that to say that the max_retries circuit breaker is one specific exception to this, since it simply does not make sense for it to come from the underlying cluster.

Unfortunately, that approach would not work for the dynamic cluster selection case, because in that case there is no aggregate cluster that is configured to begin with.

Another possible option here would be to add a per-route setting to override the max_retries circuit breaker setting from the cluster, which could be used in cases where the cluster is dynamically determined. However, this would also allow overriding the setting when the cluster is not dynamically determined, which would mean that circuit breakers will no longer be configured in just one place for each cluster, which seems a little sub-optimal. But on the other hand, retries are actually implemented on Envoy's downstream side, so configuring the retry circuit breaker on the upstream side seems a little counter-intuitive as well.

Anyone have any other suggestions on a better way to handle this?

CC @htuch @yxue @mattklein123 @ejona86 @dfawley @dapengzhang0 @donnadionne @AndresGuedez

htuch · 2021-07-20T01:31:20Z

Not a super expert on this code, but I think max_retries is a property of the cluster, not the request, i.e. it's the maximum number of concurrent retries in flight allowed across all requests. So, you determine the cluster and then check for that determined cluster if there is any available retry budget?

snowp · 2021-07-20T14:08:18Z

Yeah what @htuch says seems like the most natural way for this to work from what I understand. Is there a reason why we have to make the retry decision before we find out what cluster we'd be targeting? I can understand this in the aggregate cluster pov since it's a dummy cluster that might hide the fact that we are selecting different clusters, but if the router is the one deciding which cluster to route to can't it just make this selection, check circuit breakers and only issue the retry if the budget allows for it?

markdroth · 2021-07-20T14:34:54Z

I'm not that familiar with Envoy's code here, but I don't think the router is the one deciding which cluster will be used. From the discussion in #13134, my understanding is that in the aggregate cluster case, the router thinks it's just choosing a single cluster, but that cluster happens to be an aggregate cluster that redirects to another cluster from within its LB policy, so I believe that that underlying redirection would not be visible to the router. And although the dynamic cluster selection plugin mechanism associated with #16944 is not yet implemented, I would think it would work in a similar way.

Regardless of how this is implemented in Envoy, this is definitely a problem for gRPC. In our implementation, both aggregate clusters and dynamic cluster selection are implemented in LB policies. The retry code sits in front of the LB code, and it does not have visibility into the cluster that gets chosen by the LB policy. It's non-trivial for the retry code to first pick a cluster and then decide whether to retry. (If we really have no other option, we can think about ways we can restructure our code to make this work, but it wouldn't be my first choice.)

Also, even if the retry code could pick the cluster and then check the circuit breaker, I think it would require changing the semantics of the circuit breaker. Currently, the circuit breaker limits the number of requests that are doing retries, but not the actual number of retry attempts: the count of active retries is checked and incremented before the first retry attempt for a given request and then decremented when the request is committed to a particular retry attempt (i.e., no more attempts will be started for that request). However, that counter is a per-cluster counter, which means that that approach doesn't work when different attempts go to different clusters, especially if hedging is configured. The only alternative I see would be to change the circuit breaker semantics such that each individual attempt checks and increments the counter for the target cluster it is assigned to, which would basically mean that the counter will count the number of retry attempts instead of the number of requests that are doing retries. And it's not clear to me that that semantic is actually helpful, because if the intent of the circuit breaker is to provide a protection to the individual cluster, then it wouldn't really make sense to throttle an attempt to cluster B if all previous attempts for that request went to cluster A -- from cluster B's perspective, the first attempt it sees is the initial request, not a retry.

It seems like we've built a few different pieces that don't play well together here. I think it might be useful to take a step back and figure out how we want this to work from first principles.

mattklein123 · 2021-07-20T16:41:41Z

The existing code and intention is as @snowp says: limit the number of concurrent retries to a particular cluster at a given time. I think this works fine for the dynamic case. In the aggregate case I think it's fine either way (the actual cluster or the aggregate cluster holds the circuit breaker).

It seems like we've built a few different pieces that don't play well together here. I think it might be useful to take a step back and figure out how we want this to work from first principles.

From the Envoy side I think this all works out OK. If there are issues with the API as it maps to gRPC we can discuss that separately, but I think changing the Envoy semantics outright is going to be difficult/impossible due to existing deployments, so it would have to be something new if you want that.

markdroth · 2021-07-20T19:06:54Z

Can you help me understand how this actually works in the implementation? My understanding was that the router just used the aggregate cluster as any other cluster, using its LB policy to pick a host, and then sending the traffic to that host. Can you show me where in the code the retry logic checks the cluster of the resulting host?

Or are you saying that the code may not be doing this correctly right now but that that's how we should fix it to work? If the latter, where would that change be made? I'm just trying to get an idea of the objects involved here.

One thing I did notice while looking at this is that it looks like my previous understanding of the circuit breaker semantics was incorrect. It looks like Envoy does actually increment and decrement the counter for each individual attempt, not for the request as a whole. So that piece at least isn't a problem here.

mattklein123 · 2021-07-20T19:10:23Z

Here is where the retry circuit breakers are consulted:

envoy/source/common/router/retry_state_impl.cc

Line 263 in a1c5a50

if (!cluster_.resourceManager(priority_).retries().canCreate()) {

A RetryStateImpl is created for every request when the cluster is selected. The same cluster is used for all retries.

I don't know what the aggregate cluster does in terms of either maintaining its own circuit breaker or forwarding it to the real cluster. IMO either is fine as long as it is documented.

I think this all works correctly today at least as far as it was originally designed.

markdroth · 2021-07-20T19:46:57Z

Currently, I suspect that the aggregate cluster just uses the max_retries value from the aggregate cluster, not for the underlying clusters. However, the semantics we agreed upon in #13134 says it should do the latter, which means we'll need to make changes here.

Does the current implementation increment the counter of active retries before or after it asks the LB policy for a host? If it does so before, how can we change this to use the max_retries setting for the underlying cluster, since we need to increment the counter before we pick a host, and we don't know the underlying cluster until after we've picked that host?

mattklein123 · 2021-07-20T21:08:23Z

Currently, I suspect that the aggregate cluster just uses the max_retries value from the aggregate cluster, not for the underlying clusters. However, the semantics we agreed upon in #13134 says it should do the latter, which means we'll need to make changes here.

Sounds good, I think that should be pretty straightforward.

Does the current implementation increment the counter of active retries before or after it asks the LB policy for a host? If it does so before, how can we change this to use the max_retries setting for the underlying cluster, since we need to increment the counter before we pick a host, and we don't know the underlying cluster until after we've picked that host?

In Envoy the flow is Cluster -> LB -> Host. So, retries happen before host selection via the LB. This works for Envoy clearly but if it's not going to work for gRPC we will have to think about how to reconcile this from an API perspective which might be tricky.

markdroth · 2021-07-20T22:55:04Z

gRPC's structure is basically the same, retries happen before the LB pick. But that's exactly the problem I'm trying to point out; I think there's a chicken-and-egg problem here.

If I'm understanding right (and please correct me if I'm not), I think the following statements are both true:

The retry counter needs to be incremented before the LB pick.
The aggregate cluster does not determine which underlying cluster it will send an attempt to until it does the LB pick.

Assuming that both of those statements are true in Envoy's current implementation, then I don't see how we can change the aggregate cluster behavior to use the circuit breaker from the underlying cluster, because we don't know which cluster's counter to update until after we do the LB pick. Am I missing something here?

mattklein123 · 2021-07-20T23:07:47Z

Assuming that both of those statements are true in Envoy's current implementation, then I don't see how we can change the aggregate cluster behavior to use the circuit breaker from the underlying cluster, because we don't know which cluster's counter to update until after we do the LB pick. Am I missing something here?

Nope you are not missing anything. I agree this won't work for the aggregate cluster case. IMO trying to fix this will be huge amount of work for dubious benefit so I would recommend just swapping what we do for aggregate cluster. :)

markdroth · 2021-07-22T13:17:26Z

Okay. It does seem a little inconsistent to say that the max_retries circuit breaker comes from the aggregate cluster and the other circuit breakers come from the underlying cluster, but I guess that's a natural consequence of the fact that retries are handled in the router filter instead of in the cluster itself. I'll make a note about this in #13134.

For the broader issue here, we've discussed this on the gRPC side and decided to not support the max_retries circuit breaker for now, since we don't seem to have a very clean way to handle it for the dynamic cluster selection functionality that we're about to implement. So while this remains a design problem, it's not something we have to tackle immediately, so I'll go ahead and close this.

Thanks for the discussion!

markdroth added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 19, 2021

htuch added area/cluster area/retry question Questions that are neither investigations, bugs, nor enhancements and removed enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jul 20, 2021

markdroth closed this as completed Jul 22, 2021

markdroth mentioned this issue Jul 22, 2021

fix and document aggregate cluster behavior #13134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_retries circuit breaker vs. dynamic cluster selection #17412

max_retries circuit breaker vs. dynamic cluster selection #17412

max_retries circuit breaker vs. dynamic cluster selection #17412

max_retries circuit breaker vs. dynamic cluster selection #17412

Comments