Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

skuruppu · 2021-02-24T23:57:12Z

We got a customer issue reported where they are using the SpannerDataReader and they see an error in their logs:

Exception details: google.Cloud.Spanner.Data.SpannerException: The service is currently unavailable. This is a most likely a transient condition.

---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="GOAWAY received", DebugException="Grpc.Core.Internal.CoreErrorDetailException:

There doesn't seem to be any impact on the workload from this error other than it taking a long time when this happens. Looking at the code, an UNAVAILABLE error is caught and retried. But that's only if hasNext is computed and _safeToRetry is set.

I don't think it's getting this far based on the below stack trace because I see WithExpirationChecking().

My question is will the RPC be retried elsewhere even if the UNAVAILBLE error is not caught in this loop?

CC @olavloite

The text was updated successfully, but these errors were encountered:

skuruppu · 2021-02-25T00:08:41Z

Maybe we need to implement similar logic to that of Node.js that @olavloite worked on a few months ago.

jskeet · 2021-02-25T06:49:46Z

I'll try to take a look today.

jskeet · 2021-02-25T13:24:56Z

No, if the RPC isn't retried there, I don't think there's anywhere else it would be retried. I don't think anything else would have enough information to retry it sensibly. Note that retrying streaming RPCs is inherently somewhat different to retrying regular ones.

I don't think this has anything to do with WithSessionExpiryChecking though - the only impact of that method should be to mark the session expired if the exception is "Session not found" (which it isn't in this case). All exceptions are just passed through, so it shouldn't affect retry at all.

This exception propagating suggests that either:

We received more than 512 partial result sets since the last resume token (so we no longer have a valid resume token) and then an error occurred
Two RPCs failed in a row

The first of these is complicated to go through - I'd need to look much more closely at the code. It seems much more likely that we've received two RPC failures in a row, despite backing off for as long as the first RPC said we ought to.

The part about hasNext is a red herring, I believe - _safeToRetry starts off as true. So even if the very first request fails, we'll still retry - but if that second request fails as well, that won't be retried.

skuruppu · 2021-02-26T05:04:04Z

Thanks @jskeet, that makes sense. I missed the fact that _safeToRetry is true to begin with.

I think you're right that it's way more likely that they received two failures in a row.

I will mark this as closed since there's no action we can take. I might reopen it if more information comes up.

skuruppu · 2021-03-03T00:36:35Z

@olavloite did some investigations and he was able to reproduce the issue with two consecutive UNAVAILABLE errors being returned.

From his experience no other Spanner client libraries has this restriction. His recommendation was to remove _maxConsecutiveErrors and retry until the total timeout is reached.

@jskeet, is there a reason that the C# client chose this path? Do you see any problems with changing this logic?

This issue is still impacting the customer.

jskeet · 2021-03-03T06:24:21Z

@skuruppu: I suspect when I implemented SqlResultStream, I was either taking my cue from the Java code at the time (which may well have changed since then) or from the .NET code at the time (which we were fixing, basically).

I think if we retry until the total timeout is reached, we should probably add some sort of minimum sleep time in there. At the moment, if the failure doesn't indicate how long to wait for, we don't wait at all - which could lead to us basically hammering a failed node (or whatever) as hard as we could. The "max consecutive errors" prevents that, so if we remove that protection we should add something else in.

skuruppu · 2021-03-03T22:39:25Z

Ah ok, so I was thinking we'll implement it with exp backoff so that should address your concern.

@olavloite would you be able to take this on? For some reason, I still can't assign you issues.

skuruppu · 2021-03-03T22:42:34Z

Ah figured it out!

Retry consecutive errors that are retryable until either the call succeeds or until the deadline of the call has been exceeded. Fixes googleapis#5977

Retry consecutive retryable errors for the same resume token in a stream of PartialResultSets. This will prevent SpannerDataReader from throwing Unavailable exceptions while reading data. The ExecuteSqlStreaming RPC will be retried at most 115 times for the same resume token. This value is chosen as it will cause retries to be attempted for roughly the same amount of time as the default timeout of ExecuteStreamingSql (1 hour). Fixes googleapis#5977

Retry consecutive retryable errors for the same resume token in a stream of PartialResultSets. This will prevent SpannerDataReader from throwing Unavailable exceptions while reading data. The ExecuteSqlStreaming RPC will be retried at most 115 times for the same resume token. This value is chosen as it will cause retries to be attempted for roughly the same amount of time as the default timeout of ExecuteStreamingSql (1 hour). Fixes #5977

* fix: retry errors in stream until timeout Retry consecutive errors that are retryable until either the call succeeds or until the deadline of the call has been exceeded. Fixes #5977 * fix: use deadline instead of timeout

Changes in Google.Cloud.Spanner.Data version 3.6.0: - [Commit 69c83e4](googleapis@69c83e4): - fix: retry errors in stream until timeout ([issue 6013](googleapis#6013)) - Also fixes [issue 5977](googleapis#5977) - [Commit fa5641d](googleapis@fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](googleapis#5977) - [Commit a86b6ea](googleapis@a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions` - [Commit ef02e74](googleapis@ef02e74): feat: add CMEK fields to backup and database Packages in this release: - Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0 - Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0 - Release Google.Cloud.Spanner.Common.V1 version 3.6.0 - Release Google.Cloud.Spanner.Data version 3.6.0 - Release Google.Cloud.Spanner.V1 version 3.6.0

Changes in Google.Cloud.Spanner.Data version 3.6.0: - [Commit 69c83e4](69c83e4): - fix: retry errors in stream until timeout ([issue 6013](#6013)) - Also fixes [issue 5977](#5977) - [Commit fa5641d](fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](#5977) - [Commit a86b6ea](a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions` - [Commit ef02e74](ef02e74): feat: add CMEK fields to backup and database Packages in this release: - Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0 - Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0 - Release Google.Cloud.Spanner.Common.V1 version 3.6.0 - Release Google.Cloud.Spanner.Data version 3.6.0 - Release Google.Cloud.Spanner.V1 version 3.6.0

skuruppu added the api: spanner Issues related to the Spanner API. label Feb 24, 2021

skuruppu assigned jskeet Feb 24, 2021

skuruppu added type: question Request for information or clarification. Not an issue. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 24, 2021

skuruppu closed this as completed Feb 26, 2021

skuruppu reopened this Mar 3, 2021

skuruppu assigned olavloite and unassigned jskeet Mar 3, 2021

yoshi-automation added the 🚨 This issue needs some love. label Mar 4, 2021

olavloite mentioned this issue Mar 8, 2021

fix: retry errors in stream until timeout #6013

Merged

olavloite mentioned this issue Mar 9, 2021

fix: retry consecutive retryable errors in sql stream without timeout #6015

Merged

jskeet closed this as completed in #6015 Mar 10, 2021

jskeet mentioned this issue Mar 17, 2021

Release all Spanner packages as version 3.6.0 #6146

Closed

jskeet mentioned this issue Mar 17, 2021

Release all Spanner packages as version 3.6.0 #6147

Merged

taka-oyama mentioned this issue Apr 14, 2023

[spanner] : STATUS_UNAVAILABLE isn't retried in some cases in Result->rows() googleapis/google-cloud-php#5880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

Comments