-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanner: We may not be retrying requests when WithSessionExpiryChecking() fails with UNAVAILABLE #5977
Comments
Maybe we need to implement similar logic to that of Node.js that @olavloite worked on a few months ago. |
I'll try to take a look today. |
No, if the RPC isn't retried there, I don't think there's anywhere else it would be retried. I don't think anything else would have enough information to retry it sensibly. Note that retrying streaming RPCs is inherently somewhat different to retrying regular ones. I don't think this has anything to do with This exception propagating suggests that either:
The first of these is complicated to go through - I'd need to look much more closely at the code. It seems much more likely that we've received two RPC failures in a row, despite backing off for as long as the first RPC said we ought to. The part about |
Thanks @jskeet, that makes sense. I missed the fact that I think you're right that it's way more likely that they received two failures in a row. I will mark this as closed since there's no action we can take. I might reopen it if more information comes up. |
@olavloite did some investigations and he was able to reproduce the issue with two consecutive UNAVAILABLE errors being returned. From his experience no other Spanner client libraries has this restriction. His recommendation was to remove @jskeet, is there a reason that the C# client chose this path? Do you see any problems with changing this logic? This issue is still impacting the customer. |
@skuruppu: I suspect when I implemented SqlResultStream, I was either taking my cue from the Java code at the time (which may well have changed since then) or from the .NET code at the time (which we were fixing, basically). I think if we retry until the total timeout is reached, we should probably add some sort of minimum sleep time in there. At the moment, if the failure doesn't indicate how long to wait for, we don't wait at all - which could lead to us basically hammering a failed node (or whatever) as hard as we could. The "max consecutive errors" prevents that, so if we remove that protection we should add something else in. |
Ah ok, so I was thinking we'll implement it with exp backoff so that should address your concern. @olavloite would you be able to take this on? For some reason, I still can't assign you issues. |
Ah figured it out! |
Retry consecutive errors that are retryable until either the call succeeds or until the deadline of the call has been exceeded. Fixes googleapis#5977
Retry consecutive retryable errors for the same resume token in a stream of PartialResultSets. This will prevent SpannerDataReader from throwing Unavailable exceptions while reading data. The ExecuteSqlStreaming RPC will be retried at most 115 times for the same resume token. This value is chosen as it will cause retries to be attempted for roughly the same amount of time as the default timeout of ExecuteStreamingSql (1 hour). Fixes googleapis#5977
Retry consecutive retryable errors for the same resume token in a stream of PartialResultSets. This will prevent SpannerDataReader from throwing Unavailable exceptions while reading data. The ExecuteSqlStreaming RPC will be retried at most 115 times for the same resume token. This value is chosen as it will cause retries to be attempted for roughly the same amount of time as the default timeout of ExecuteStreamingSql (1 hour). Fixes #5977
* fix: retry errors in stream until timeout Retry consecutive errors that are retryable until either the call succeeds or until the deadline of the call has been exceeded. Fixes #5977 * fix: use deadline instead of timeout
Changes in Google.Cloud.Spanner.Data version 3.6.0: - [Commit 69c83e4](googleapis@69c83e4): - fix: retry errors in stream until timeout ([issue 6013](googleapis#6013)) - Also fixes [issue 5977](googleapis#5977) - [Commit fa5641d](googleapis@fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](googleapis#5977) - [Commit a86b6ea](googleapis@a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions` - [Commit ef02e74](googleapis@ef02e74): feat: add CMEK fields to backup and database Packages in this release: - Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0 - Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0 - Release Google.Cloud.Spanner.Common.V1 version 3.6.0 - Release Google.Cloud.Spanner.Data version 3.6.0 - Release Google.Cloud.Spanner.V1 version 3.6.0
Changes in Google.Cloud.Spanner.Data version 3.6.0: - [Commit 69c83e4](googleapis@69c83e4): - fix: retry errors in stream until timeout ([issue 6013](googleapis#6013)) - Also fixes [issue 5977](googleapis#5977) - [Commit fa5641d](googleapis@fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](googleapis#5977) - [Commit a86b6ea](googleapis@a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions` - [Commit ef02e74](googleapis@ef02e74): feat: add CMEK fields to backup and database Packages in this release: - Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0 - Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0 - Release Google.Cloud.Spanner.Common.V1 version 3.6.0 - Release Google.Cloud.Spanner.Data version 3.6.0 - Release Google.Cloud.Spanner.V1 version 3.6.0
Changes in Google.Cloud.Spanner.Data version 3.6.0: - [Commit 69c83e4](69c83e4): - fix: retry errors in stream until timeout ([issue 6013](#6013)) - Also fixes [issue 5977](#5977) - [Commit fa5641d](fa5641d): fix: retry consecutive retryable errors in sql stream. Fixes [issue 5977](#5977) - [Commit a86b6ea](a86b6ea): feat: add `optimizer_statistics_package` field in `QueryOptions` - [Commit ef02e74](ef02e74): feat: add CMEK fields to backup and database Packages in this release: - Release Google.Cloud.Spanner.Admin.Database.V1 version 3.6.0 - Release Google.Cloud.Spanner.Admin.Instance.V1 version 3.6.0 - Release Google.Cloud.Spanner.Common.V1 version 3.6.0 - Release Google.Cloud.Spanner.Data version 3.6.0 - Release Google.Cloud.Spanner.V1 version 3.6.0
We got a customer issue reported where they are using the
SpannerDataReader
and they see an error in their logs:There doesn't seem to be any impact on the workload from this error other than it taking a long time when this happens. Looking at the code, an UNAVAILABLE error is caught and retried. But that's only if
hasNext
is computed and_safeToRetry
is set.I don't think it's getting this far based on the below stack trace because I see
WithExpirationChecking()
.My question is will the RPC be retried elsewhere even if the
UNAVAILBLE
error is not caught in this loop?CC @olavloite
The text was updated successfully, but these errors were encountered: