[go: nahoru, domu]

Page MenuHomePhabricator

Special:Homepage is rendered much slower (<1 sec to 2+ sec)
Closed, ResolvedPublic

Description

Looking at our performance metrics as part of our chores, I noticed that about 7 days ago the rendering times for the homepage went up and stayed high: https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1&from=1718582400000&to=now

image.png (305ร—779 px, 71 KB)

Acceptance Criteria:

  • figure out why homepage performance got much worse on June 18th / 19th.

Note:
This seems to happen between the Group 0 and Group 1 deploy, so somehow I doubt that it is related to the train per se. But maybe some caching config change or something?

Excimer profiles:

XHGui profiles:

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJun 25 2024, 1:44 PM

The degradation seems to have started around midnight between June 18th and June 19th.

The closest relevant entries in SAL are:

Hi, I just checked on a bare-metal debug server (mwdebug1001), and it takes 1.9s, so I doubt it's k8s-related

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I would still consider removing the edit count call in getActionData for the desktop version of the module, since it is not used there (AFAICT).

Yep, that's the plan. Also, the mobile version needs some work. We rarely show that counter there, but if we do and it is not available, then it turns to:

image.png (453ร—1 px, 49 KB)

which is not great either. (I'll probably make a separate task for that.)

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

image.png (395ร—847 px, 40 KB)

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I'm sure the developers would be best positioned to say whether anything has changed, but as far as the AQS services themselves are concerned it doesn't seem like there have been any significant increases in latency: service-level view, REST gateway level view (easiest to read if you filter out the proton metrics)

Adding the Data-Platform team tag given that maybe something on AQS changed. (And if we can rule that out, that would also be useful to know.)

Change #1049973 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

Change #1049974 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

Change #1049973 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

Change #1049974 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

Change #1050002 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Change #1050005 had a related patch set uploaded (by Michael GroรŸe; author: Michael GroรŸe):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Change #1050002 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:51:50Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:55:16Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:05:51Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] (duration: 14m 01s)

Change #1050005 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:29:40Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:32:21Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:38:29Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] (duration: 08m 48s)

Looking at my own early tracking (https://grafana.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?from=now-24h&orgId=1&to=now&viewPanel=1) it does not look like the change backported last night had the desired effect. Otherwise, I would expect basically no events above 1 second. But let's see if the picture changes when the train rolls forward to Group 2.

Looking into this further, I was able to trigger a run actually taking a long time (3.57 seconds) in SpecialHomepage::execute: https://performance.wikimedia.org/xhgui/run/view?id=667d521c1e076ad57073b11e. This happened after I unselected a topic from a splatter of selected ones.

Other runs, which were just simple reloads, spend only little time (0.115 seconds) in SpecialHomepage::execute and instead almost all of the nominal wall-time (4.1 seconds) comes from the deferred updates (resetting the cache for tasks) https://performance.wikimedia.org/xhgui/run/view?id=667d51e229fe83b0bc29881b

So this suggests a few possible lines of inquiry to me:

  1. Why don't we somehow regenerate the cache when changed filters are saved and only when Special:Homepage is reloaded?
  2. Did the search get slower somehow?

I don't think the first item has changed last week tuesday, so checking-in with the search team about something changing on their end might be worthwhile.

That aside, I have created T368616: Special:Homepage performance issues to collect the opportunities to improve Special:Homepage performance as I notice them.

Now that the train has rolled forward to Group 2, we can tell with high confidence that it was those requests to AQS after all. As implemented in Homepage: don't load yesterdays edits on desktop, they went done on desktop (to what looks like much more consistently less than before)

image.png (309ร—778 px, 70 KB)

And stayed high on mobile:

image.png (310ร—779 px, 75 KB)

(or maybe even got a little bit worse there)

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.

Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

(Still, I continue to be unsure about what caused this in the first place. I inquired, and apparently nothing changed on the side of AQS, so something in the transport layer in between must be different since last week Tuesday.)

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.
Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

Thanks! I've created T368750: Newcomer Homepage: Suggested Edits (mobile preview) empty state when there are no suggested edits to document a decision and design.

@Michael there was a change on AQS explained in T366851: gocql startup times have increased between v1.2.0 and v1.6.0. After upgrading gocql library, it seems that startup times for all services increased.

Are you still experiencing delays?

CC @Sfaci and @Milimetric

Regarding @VirginiaPoundstone has mentioned, I wanted to add that the delay was confirmed only when AQS service starts.

Taking a look at the exact dates around the issue you mentioned on this ticket, we merged something related to "cross-DC Cassandra client connection" on June 17 (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043195) but the delay was already there since May 29 (it was added with this change where we had to set the initialDelaySecond property to 30 secs to have enough time for the service to start -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1037033) but we never saw a performance degradation due to that. So I would say that delay itself is not related to this issue.

Thanks @VirginiaPoundstone and @Sfaci for looking into this!

As far as I can tell, the issues we're experiencing (and they're still ongoing) started on the 18th of May around 15:00 UTC plus-minus a few hours.

One noticeable, and also still ongoing, side-effect was that we see a lot more timeouts of our requests to AQS. You should be able to see them in this Logstash snapshot: https://logstash.wikimedia.org/goto/a28714fe427acb5915ab263e51ff88c1

Can you correlate those events to events in the instrumentation on your side?

Note that we're preparing a change to drop this dependency, so soon-ish those events are expected to stop because the code that makes the requests will be removed.

I have been exploring a bit more the changes around 18th of May and there is no change that we can correlate to these events. The service code haven't been changed this year and the only changes we have done are related to the kubernetes configuration as I mentioned before. The closest change, regarding time, is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1033405 where some network policies were changed on 24th May (I guess it was deployed after that date) but I can't say whether that change could be related to this. @BTullis any idea here?

Taking a look at the grafana dashboard for edit-analytics for the last two months it seems that the latency is pretty stable (there are a couple of peaks but the rest of the chart is fine) so, considering that, I don't know what happens but I would say that these events are not related to the service itself. Regarding the timeouts you mentioned, just wondering if there is something preventing your app from reaching the service.

I think this can be closed.

With the train releasing the fix for mobile to the Group 2 wikis, we can also see the statistics for mobile improve:

image.png (310ร—781 px, 61 KB)

And while the effect is less stark, it is also clearly there for our client side metrics:

image.png (313ร—779 px, 70 KB)

image.png (312ร—778 px, 89 KB)

The subtask can stay open for a little while longer to track the creation of Grafana panels that show the additional metrics that were added.

I have been exploring a bit more the changes around 18th of May and there is no change that we can correlate to these events. The service code haven't been changed this year and the only changes we have done are related to the kubernetes configuration as I mentioned before. The closest change, regarding time, is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1033405 where some network policies were changed on 24th May (I guess it was deployed after that date) but I can't say whether that change could be related to this. @BTullis any idea here?

Taking a look at the grafana dashboard for edit-analytics for the last two months it seems that the latency is pretty stable (there are a couple of peaks but the rest of the chart is fine) so, considering that, I don't know what happens but I would say that these events are not related to the service itself. Regarding the timeouts you mentioned, just wondering if there is something preventing your app from reaching the service.

Thank you for looking into this! I guess we have to conclude that something on the transport layer in between has changed. But that is way outside of my area of expertise