xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

abeyad · 2023-05-30T16:44:55Z

When the Bootstrap config has xDS configured to use Envoy gRPC, we can get into a situation where we get a Cluster not available error when trying to establish a new gRPC stream.

This happens when the xDS cluster is not yet initialized, and the cluster that depends on it (e.g. for ADS or SDS using Envoy gRPC) gets initialized first, and attempts to create a connection to a not-yet-initialized cluster where the cluster is not found in the thread-local storage for the main thread.

When the cluster is not found, the gRPC stream schedules a retry attempt for some jittered back-off delay of up to 1 second. Typically, the xDS cluster is initialized by then, so the subsequent retry attempt to establish the gRPC stream, which gets triggered when the timer expires, ends up being successful.

This is not ideal, especially for Envoy Mobile, where we wouldn't want to wait for up to 1 second just for xDS initialization in the app.

This scenario can be triggered under two circumstances:

The static xDS cluster appears after the cluster that depends on it in the Bootstrap config's static_resources repeated field ordering.
We use ADS, where the gRPC service is using Envoy gRPC (i.e. an Envoy cluster in the static_resources field) and the cluster upon which ADS depends is not guaranteed to be fully warmed during initialization. The ClusterManagerImpl calls ads_mux_->start() after adding primary clusters, but depending on the cluster type, it may not yet be warmed when the ClusterManagerInitHelper::addCluster() call finishes executing. Examples of cluster types that will call the ADS mux's start(), even if wait_for_warm_on_init is set, include the STRICT_DNS cluster, which has to make asynchronous calls to resolve the DNS entries for the destination hosts.

For scenario 1, there are a few options to solve the issue:

Have the ClusterManagerImpl go through a first pass where it figures out the dependencies between clusters, then initializes them based on that ordering.
- This is the most sensible in terms of API experience
- But adds quite a bit of complexity to already complex code and will likely slow down the time it takes to get all the clusters initialized since we'll have to go through a first pass to figure out dependencies (and at that point, maybe it's not so much more efficient than just the backoff retry mechanism?)
Have a way for clusters to declare their dependencies on each other in the Bootstrap config.
- But this requires a new configuration knob in the Bootstrap and still requires the API user to do something to declare dependencies, which isn't much different than the third option which is...
Require that clusters that depend on a static xDS cluster are ordered after the cluster they depend upon in the Bootstrap's static_resources field.
- This would require an update to documentation and making that info well known to xDS users.
- Update the tests that set the correct cluster ordering in the Bootstrap config. tests: Eliminate retry backoffs in the SDS integration tests #27679 is an example of this.

For scenario 2, there are a couple options to "solve" the issue:

Move the ads_mux_->start() call to after the initialization is complete for the cluster that ADS is configured with. This will require a first pass to determine cluster dependencies, and adding the ADS mux initialization to the post-cluster-init callback if ADS depends on it.
Just have a caveat for the time being that ADS configured with an Envoy cluster that requires asynchronous initialization will result in a backoff retry that could delay app initialization. For the ADS use cases with Google's Traffic Director that we are thinking of, it will depend on Google gRPC having a target URI which doesn't have the same retry backoff problem that Envoy gRPC does.

Another option that applies to both scenarios is to tighten the backoff timer interval (e.g. to 500ms max instead of 1s).

The text was updated successfully, but these errors were encountered:

abeyad · 2023-05-30T16:45:33Z

cc @adisuissa @alyssawilk

Clarification for a workaround to #27702. Signed-off-by: Ali Beyad <abeyad@google.com>

…y#27706) Clarification for a workaround to envoyproxy#27702. Signed-off-by: Ali Beyad <abeyad@google.com> Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>

This commit addresses an issue outlined in envoyproxy#27702 where, when configuring xDS to use an ADS gRPC mux connecting to an Envoy cluster (via EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a DNS cluster that must asynchronously resolve the endpoints before it can be used), then the ADS mux attempted to initialize before the cluster was ready. This caused the underlying GrpcStream to fail to establish the connection and set off a retry timer for [0,1s). Instead, this commit ensures that we don't try initializing the ADS mux and connecting to the xDS backend until the Envoy cluster for the xDS backend is fully initiliazed and ready. As part of testing this change, we had to move some of the ClusterManagerImpl initialization functionality to a separate init() method, instead of running it all in the constructor. Running all of the ClusterManagerImpl's lifecycle from the constructor prevents derived classes from overriding any of the functionality (e.g. to override certain functions in tests), because when running in the base class's constructor, only base class methods are executed, not the derived class's. Signed-off-by: Ali Beyad <abeyad@google.com>

…30215) This commit addresses an issue outlined in #27702 where, when configuring xDS to use an ADS gRPC mux connecting to an Envoy cluster (via EnvoyGrpc), if the cluster requires asynchronous initialization (e.g. a DNS cluster that must asynchronously resolve the endpoints before it can be used), then the ADS mux attempted to initialize before the cluster was ready. This caused the underlying GrpcStream to fail to establish the connection and set off a retry timer for [0,1s). Instead, this commit ensures that we don't try initializing the ADS mux and connecting to the xDS backend until the Envoy cluster for the xDS backend is fully initiliazed and ready. As part of testing this change, we had to move some of the ClusterManagerImpl initialization functionality to a separate init() method, instead of running it all in the constructor. Running all of the ClusterManagerImpl's lifecycle from the constructor prevents derived classes from overriding any of the functionality (e.g. to override certain functions in tests), because when running in the base class's constructor, only base class methods are executed, not the derived class's. It turns out that several tests in cluster_manager_impl_test.cc depended on a derived class of ClusterManagerImpl which offered mocked behavior, but the overridden methods in the derived class were actually never getting invoked (because they were called from the base ClusterManagerImpl's constructor). We fix those tests in this commit too. Risk Level: Low Testing: unit test Docs Changes: N/A Release Notes: Fixed a bug (#27702) that caused ADS initialization to fail on the first attempt and set a back-off retry interval of up to 1 second, if ADS is using an Envoy Cluster for the backend. The issue was fixed to ensure that ADS initialization happens after the Envoy Cluster it depends upon has been properly initialized. ADS that does not depend on an Envoy Cluster (i.e. GoogleGrpc) is not affected by this change. Platform Specific Features: N/A Signed-off-by: Ali Beyad <abeyad@google.com>

abeyad added bug area/xds labels May 30, 2023

This was referenced May 30, 2023

tests: Eliminate retry backoffs in the SDS integration tests #27679

Merged

docs: Add info about xDS cluster ordering in the Bootstrap #27706

Merged

abeyad added the no stalebot Disables stalebot from closing an issue label May 31, 2023

htuch pushed a commit that referenced this issue Jun 2, 2023

docs: Add info about xDS cluster ordering in the Bootstrap (#27706)

517d2db

Clarification for a workaround to #27702. Signed-off-by: Ali Beyad <abeyad@google.com>

abeyad self-assigned this Oct 3, 2023

abeyad mentioned this issue Oct 14, 2023

xds: Ensures the xDS cluster is initialized before starting ADS mux #30215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

xds: "Cluster not available" can cause backoff retries of [0,1s) for xDS requests using an Envoy cluster #27702

Comments