[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977

Closed
skuruppu opened this issue Feb 24, 2021 · 8 comments · Fixed by #6013, #6015 or #6147
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: question Request for information or clarification. Not an issue.

Comments

@skuruppu
Copy link
Contributor

We got a customer issue reported where they are using the SpannerDataReader and they see an error in their logs:

Exception details: google.Cloud.Spanner.Data.SpannerException: The service is currently unavailable. This is a most likely a transient condition.

---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="GOAWAY received", DebugException="Grpc.Core.Internal.CoreErrorDetailException:

There doesn't seem to be any impact on the workload from this error other than it taking a long time when this happens. Looking at the code, an UNAVAILABLE error is caught and retried. But that's only if hasNext is computed and _safeToRetry is set.

I don't think it's getting this far based on the below stack trace because I see WithExpirationChecking().

BSPLuxw3GsCLTMD

My question is will the RPC be retried elsewhere even if the UNAVAILBLE error is not caught in this loop?

CC @olavloite

@skuruppu skuruppu added the api: spanner Issues related to the Spanner API. label Feb 24, 2021
@skuruppu skuruppu added type: question Request for information or clarification. Not an issue. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Feb 24, 2021
@skuruppu
Copy link
Contributor Author

Maybe we need to implement similar logic to that of Node.js that @olavloite worked on a few months ago.

@jskeet
Copy link
Collaborator
jskeet commented Feb 25, 2021

I'll try to take a look today.

@jskeet
Copy link
Collaborator
jskeet commented Feb 25, 2021

No, if the RPC isn't retried there, I don't think there's anywhere else it would be retried. I don't think anything else would have enough information to retry it sensibly. Note that retrying streaming RPCs is inherently somewhat different to retrying regular ones.

I don't think this has anything to do with WithSessionExpiryChecking though - the only impact of that method should be to mark the session expired if the exception is "Session not found" (which it isn't in this case). All exceptions are just passed through, so it shouldn't affect retry at all.

This exception propagating suggests that either:

  • We received more than 512 partial result sets since the last resume token (so we no longer have a valid resume token) and then an error occurred
  • Two RPCs failed in a row

The first of these is complicated to go through - I'd need to look much more closely at the code. It seems much more likely that we've received two RPC failures in a row, despite backing off for as long as the first RPC said we ought to.

The part about hasNext is a red herring, I believe - _safeToRetry starts off as true. So even if the very first request fails, we'll still retry - but if that second request fails as well, that won't be retried.

@skuruppu
Copy link
Contributor Author

Thanks @jskeet, that makes sense. I missed the fact that _safeToRetry is true to begin with.

I think you're right that it's way more likely that they received two failures in a row.

I will mark this as closed since there's no action we can take. I might reopen it if more information comes up.

@skuruppu skuruppu reopened this Mar 3, 2021
@skuruppu
Copy link
Contributor Author
skuruppu commented Mar 3, 2021

@olavloite did some investigations and he was able to reproduce the issue with two consecutive UNAVAILABLE errors being returned.

From his experience no other Spanner client libraries has this restriction. His recommendation was to remove _maxConsecutiveErrors and retry until the total timeout is reached.

@jskeet, is there a reason that the C# client chose this path? Do you see any problems with changing this logic?

This issue is still impacting the customer.

@jskeet
Copy link
Collaborator
jskeet commented Mar 3, 2021

@skuruppu: I suspect when I implemented SqlResultStream, I was either taking my cue from the Java code at the time (which may well have changed since then) or from the .NET code at the time (which we were fixing, basically).

I think if we retry until the total timeout is reached, we should probably add some sort of minimum sleep time in there. At the moment, if the failure doesn't indicate how long to wait for, we don't wait at all - which could lead to us basically hammering a failed node (or whatever) as hard as we could. The "max consecutive errors" prevents that, so if we remove that protection we should add something else in.

@skuruppu
Copy link
Contributor Author
skuruppu commented Mar 3, 2021

Ah ok, so I was thinking we'll implement it with exp backoff so that should address your concern.

@olavloite would you be able to take this on? For some reason, I still can't assign you issues.

@skuruppu skuruppu assigned olavloite and unassigned jskeet Mar 3, 2021
@skuruppu
Copy link
Contributor Author
skuruppu commented Mar 3, 2021

Ah figured it out!

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Mar 4, 2021
olavloite added a commit to olavloite/google-cloud-dotnet that referenced this issue Mar 8, 2021
Retry consecutive errors that are retryable until either the call
succeeds or until the deadline of the call has been exceeded.

Fixes googleapis#5977
olavloite added a commit to olavloite/google-cloud-dotnet that referenced this issue Mar 9, 2021
Retry consecutive retryable errors for the same resume token in a stream
of PartialResultSets. This will prevent SpannerDataReader from throwing
Unavailable exceptions while reading data.
The ExecuteSqlStreaming RPC will be retried at most 115 times for the
same resume token. This value is chosen as it will cause retries to
be attempted for roughly the same amount of time as the default timeout
of ExecuteStreamingSql (1 hour).

Fixes googleapis#5977
jskeet pushed a commit that referenced this issue Mar 10, 2021
Retry consecutive retryable errors for the same resume token in a stream
of PartialResultSets. This will prevent SpannerDataReader from throwing
Unavailable exceptions while reading data.
The ExecuteSqlStreaming RPC will be retried at most 115 times for the
same resume token. This value is chosen as it will cause retries to
be attempted for roughly the same amount of time as the default timeout
of ExecuteStreamingSql (1 hour).

Fixes #5977
olavloite added a commit that referenced this issue Mar 17, 2021
* fix: retry errors in stream until timeout

Retry consecutive errors that are retryable until either the call
succeeds or until the deadline of the call has been exceeded.

Fixes #5977

* fix: use deadline instead of timeout
jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Mar 17, 2021
Changes in Google.Cloud.Spanner.Data version 3.6.0:

- [Commit 69c83e4](googleapis@69c83e4):
  - fix: retry errors in stream until timeout ([issue 6013](googleapis#6013))
  - Also fixes [issue 5977](googleapis#5977)
- [Commit fa5641d](googleapis@fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](googleapis#5977)
- [Commit a86b6ea](googleapis@a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions`
- [Commit ef02e74](googleapis@ef02e74): feat: add CMEK fields to backup and database

Packages in this release:
- Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0
- Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0
- Release Google.Cloud.Spanner.Common.V1 version 3.6.0
- Release Google.Cloud.Spanner.Data version 3.6.0
- Release Google.Cloud.Spanner.V1 version 3.6.0
jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Mar 17, 2021
Changes in Google.Cloud.Spanner.Data version 3.6.0:

- [Commit 69c83e4](googleapis@69c83e4):
  - fix: retry errors in stream until timeout ([issue 6013](googleapis#6013))
  - Also fixes [issue 5977](googleapis#5977)
- [Commit fa5641d](googleapis@fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](googleapis#5977)
- [Commit a86b6ea](googleapis@a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions`
- [Commit ef02e74](googleapis@ef02e74): feat: add CMEK fields to backup and database

Packages in this release:
- Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0
- Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0
- Release Google.Cloud.Spanner.Common.V1 version 3.6.0
- Release Google.Cloud.Spanner.Data version 3.6.0
- Release Google.Cloud.Spanner.V1 version 3.6.0
jskeet added a commit that referenced this issue Mar 17, 2021
Changes in Google.Cloud.Spanner.Data version 3.6.0:

- [Commit 69c83e4](69c83e4):
  - fix: retry errors in stream until timeout ([issue 6013](#6013))
  - Also fixes [issue 5977](#5977)
- [Commit fa5641d](fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](#5977)
- [Commit a86b6ea](a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions`
- [Commit ef02e74](ef02e74): feat: add CMEK fields to backup and database

Packages in this release:
- Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0
- Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0
- Release Google.Cloud.Spanner.Common.V1 version 3.6.0
- Release Google.Cloud.Spanner.Data version 3.6.0
- Release Google.Cloud.Spanner.V1 version 3.6.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: question Request for information or clarification. Not an issue.
Projects
None yet
4 participants