-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: Significant redelivery while existing messages are still being "processed" #10437
Comments
Thanks as always for the detailed logs.
I think this might be the other way around, where pulled messages = more modacks issued at the time they are pulled in (to acknowledge receip of these messages).
There's no guarantee that increasing the number of streams means you'll hit another server. On top of that, this level of processing shouldn't be OOMing the servers so that doesn't seem right either. I'll check to see if there are any changes around then that might've resulted in this. Just to confirm, this region was processing the data at the same speed (around 400k messages per client outstanding) without issue prior to 6/20? |
So it seems like the issue is because of messages expiring. Specifically, messages are being pulled into the client library but not being processed (because concurrency control, MaxOutstandingMessages, is reached). This is supported by the fact that your end to end latency is very high around the time of these expirations (>1 hour). I think server side flow control is working most of the time, but when the stream is torn down and recreated, the server is erroneously delivering more messages. My current hypothesis goes something like
These steps happen randomly and eventually you build up a large number of outstanding messages in your client that can't be processed in time. Since your I'll try to craft a version of the library that includes a separate setting for #callbacks, as promised in #9727, which should let you set num_callbacks to > 400k and thus your application won't get bogged down by messages received but not being processed in these off cases. The long term solution is to fix the behavior in step 4, but that requires a lot more planning. |
I was thinking something like this was happening but I shrugged it off initially because I assumed the new server would not have any messages to deliver because all of the existing messages in the subscription were already leased by a client (and being modack'd). Or does the stream being killed/restarted invalidate all of the leases being held by the outstanding messages delivered on that stream?
I'm also a bit confused why it's working for several hours of the day then suddenly decides to redeliver a bunch of messages despite the stream (seemingly) being restarted constantly throughout the day. Could it be that it's reconnecting to the same forwarder most of the time but when it connects to a different one (in step 4) that causes the issue? Or I'm trying to see if it's somehow related to |
I was assuming you have a constant flow of messages coming into your subscription. Is your publish throughput less sustained and done more in bursts?
Yeah that's what I'm thinking is happening. I'm not sure if increasing MaxOutstandingMessages will help here. I think it might be best to try the new version of the library that tweaks callbacks independently (maybe setting that to 500k) and seeing if this issue is better. I'll update here once that's ready. |
@jameshartig When you get the chance, can you pull in this version of the library: This introduces a new setting to For your case, I recommend keeping |
Client
PubSub v1.39.0
Environment
CentOS on GCE
Go Environment
N/A
Code
I'll try to include the relevant code below as it pertains to this issue.
e.g.
Expected behavior
I expect there to be minimal redeliveries while a message is being handled and should still be modacked by the client.
Actual behavior
We see messages being redelivered in spikes frequently. Since we only keep track of "already processing" messages on a per VM basis we can't accurately track when we get redeliveries to different VMs.
Screenshots
There are around 400k-450k unacked messages at any given time.
That said, the messages are each pretty small and only total 50MB-60MB:
We let messages get as old as ~2 hours before we start to actually "process" them:
We are generally "processing" the message in less than 10 seconds at the 90th percentile:
This is a chart of all of the times we had to cancel an existing "pending" message because we received another one.
You'll notice it spikes as high as 300k which means we are getting redeliveries of most of the outstanding messages.
This chart shows the total messages being held in memory per datacenter:
You'll notice that it's pretty normal until 6/22/2024 at 20:00 UTC where it then started to increase dramatically. Then it returned to normal on 6/24 but then started to be erratic again on 6/24 at 20:00 UTC. Another thing to notice is that it spikes above 400k which means we have significantly more messages being held in-memory than are even outstanding in the subscription.
Even worse is if you look at the messages being held by the pub/sub client itself (this comes from the
OustandingMessagesView
) (green is sum (across 2-3 servers) and yellow is average):When things were normal the client was holding onto 400k-500k messages but when things go wrong the client is holding onto 1 million+ messages (presumably in flow control) which seems insane since there shouldn't be more than 400k messages in the subscription.
We set
MinExtensionPeriod
to 10 minutes since we expect to hold onto the messages for a long time and don't need to modAck all of the time. This seems to be working since we see a spike in modAck's every 5 minutes:What's odd though is that around when we submit the modAck's we see the spike in new pulled messages and spike in duplicate messages:
Additional context
NumGoroutines
was previously set to2
but after the issues on 6/22, we lowered it to1
without much of a difference.WithGRPCConnectionPool
was previously set to2
(default) but we raised it to 4 at the same time as lowering NumGoroutines and again, there wasn't much difference. We also tried lowering the number of VMs supporting this workload from 3 -> 2 but that seemed to have made things worse.The
Acknowledgement deadline
on the subscription is set to the default (10 seconds). We haven't changed this but could if you think that would help.We were previously "abusing" the backoff delay and kept nack'ing messages until the ~2 hours was up but after talking to a Pub/Sub PM we were told that the backoff delay isn't a guarantee and that only a certain number of messages will respect that backoff delay. This approach (holding the messages in memory in our handler) was suggested as a better approach. Things were working pretty normally until recently so I'm not sure if there was some change on the Pub/Sub side. We didn't make any changes to our code on 6/22. It's also a bit interesting that things were okay on 6/24 until 20:00 UTC again like on 6/22.
I'm wondering if there's some memory/message limit on the forwarders that we might be hitting which causes them to crash and then another forwarder redelivers the messages. My assumption is based on us not having issues when the number of undelivered messages total stays under ~420k because on both 6/22, 6/23, and 6/24 we hit new peaks of ~430k, ~451k, and ~445k outstanding messages, respectively. If that's the case then would actually increasing the number of goroutines help so that messages could be potentially spread across more forwarders? We could also try to decrease the size of the messages but they're already pretty small.
Finally, related to above, I limited this to only europe-west1 because we have the most traffic there and the other regions seem relatively unaffected.
@hongalex I apologize in advance for the complicated issues.
The text was updated successfully, but these errors were encountered: