Special:Homepage is rendered much slower (<1 sec to 2+ sec)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Michael
	Jun 25 2024, 1:44 PM

Description

Looking at our performance metrics as part of our chores, I noticed that about 7 days ago the rendering times for the homepage went up and stayed high: https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?orgId=1&from=1718582400000&to=now

Acceptance Criteria:

figure out why homepage performance got much worse on June 18th / 19th.

Note:
This seems to happen between the Group 0 and Group 1 deploy, so somehow I doubt that it is related to the train per se. But maybe some caching config change or something?

Excimer profiles:

XHGui profiles:

https://performance.wikimedia.org/xhgui/run/view?id=667bce24be639ea503044362 (fast)
https://performance.wikimedia.org/xhgui/run/view?id=667bce5bb6b0c73de92d3530 (fast, after hard refresh)
https://performance.wikimedia.org/xhgui/run/view?id=667bcdd5534105702ed0e2e9 (slow, after hard refresh, different person)

https://performance.wikimedia.org/xhgui/run/view?id=667d521c1e076ad57073b11e (slow, after changing topics, cache-miss in main thread)
https://performance.wikimedia.org/xhgui/run/view?id=667d51e229fe83b0bc29881b (fast, almost all of the wall-time is spent in deferred updates)

Details

Subject	Repo	Branch	Lines +/-
Homepage: don't load yesterdays edits on desktop	mediawiki/extensions/GrowthExperiments	wmf/1.43.0-wmf.11	+9 -2
Homepage: log rendering time for each module and each wiki	mediawiki/extensions/GrowthExperiments	wmf/1.43.0-wmf.11	+18 -5
Homepage: don't load yesterdays edits on desktop	mediawiki/extensions/GrowthExperiments	master	+9 -2
Homepage: log rendering time for each module and each wiki	mediawiki/extensions/GrowthExperiments	master	+18 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Michael	T368405 Special:Homepage is rendered much slower (<1 sec to 2+ sec)
Open	Michael	T368750 Newcomer Homepage: Suggested Edits (mobile preview) empty state when there are no suggested edits
Resolved	nettrom_WMF	T368796 Data exploration: How does page load time impact Homepage engagement?

Event Timeline

Michael created this task.Jun 25 2024, 1:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2024, 1:44 PM

Sgs moved this task from Inbox to Up Next on the Growth-Team board.Jun 25 2024, 3:49 PM

KStoller-WMF triaged this task as High priority.Jun 25 2024, 4:03 PM

KStoller-WMF mentioned this in Growth-Team (Sprint 17 (Growth Team)).Jun 25 2024, 4:41 PM

KStoller-WMF moved this task from Up Next to Sprint 17 (Growth Team) on the Growth-Team board.

KStoller-WMF edited projects, added Growth-Team (Sprint 17 (Growth Team)); removed Growth-Team.

T317187: GrowthExperiments Special:Homepage: investigate performance regression since September 6 2022 was a similar case IIRC and may be useful in debugging.

Michael updated the task description. (Show Details)Jun 26 2024, 8:21 AM

The degradation seems to have started around midnight between June 18th and June 19th.

The closest relevant entries in SAL are:

https://sal.toolforge.org/log/hnWZLZABxE1_1c7sLanH deployment of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/1047077 which seems to be hardly related
Deployments of css-only changes
move of mw on k8s from 95% to 100% (but that's almost 12 hours before, still worth mentioning)

Hi, I just checked on a bare-metal debug server (mwdebug1001), and it takes 1.9s, so I doubt it's k8s-related

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

In T368405#9925400, @Michael wrote:

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I would still consider removing the edit count call in getActionData for the desktop version of the module, since it is not used there (AFAICT).

Yep, that's the plan. Also, the mobile version needs some work. We rarely show that counter there, but if we do and it is not available, then it turns to:

which is not great either. (I'll probably make a separate task for that.)

In T368405#9925476, @kostajh wrote:

In T368405#9925400, @Michael wrote:

In a different conversation, @kostajh pointed out that the requests to the AnalyticsQueryService, added in T235810 to show the site edits in the last day, might be a contributing factor. They are in fact still in use as a fallback in \GrowthExperiments\HomepageModules\SuggestedEdits::getMobileSummaryBody. And that else-branch also logs a warning if we do not get something useful back from the AnalyticsQueryService. And that warning is seeing a suspicious uptick as well:

So I think a good next place to look would be to ask AQS folks if anything there changed on June 18.

I'm sure the developers would be best positioned to say whether anything has changed, but as far as the AQS services themselves are concerned it doesn't seem like there have been any significant increases in latency: service-level view, REST gateway level view (easiest to read if you filter out the proton metrics)

Adding the Data-Platform team tag given that maybe something on AQS changed. (And if we can rule that out, that would also be useful to know.)

Change #1049973 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

Change #1049974 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

KStoller-WMF assigned this task to Michael.Jun 26 2024, 4:14 PM

Restricted Application added a project: User-Michael. · View Herald TranscriptJun 26 2024, 4:14 PM

Change #1049973 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1049973

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.12; 2024-07-02).Jun 26 2024, 5:00 PM

Milimetric added a project: Data Products.Jun 26 2024, 5:43 PM

VirginiaPoundstone moved this task from Incoming to Radar (other teams) on the Data Products board.Jun 26 2024, 5:47 PM

Change #1049974 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1049974

Change #1050002 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Change #1050005 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Change #1050002 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: log rendering time for each module and each wiki

https://gerrit.wikimedia.org/r/1050002

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:51:50Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T20:55:16Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

ReleaseTaggerBot edited projects, added MW-1.43-notes (1.43.0-wmf.11; 2024-06-25); removed MW-1.43-notes (1.43.0-wmf.12; 2024-07-02).Jun 26 2024, 9:00 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:05:51Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050002|Homepage: log rendering time for each module and each wiki (T368405)]] (duration: 14m 01s)

Change #1050005 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.43.0-wmf.11] Homepage: don't load yesterdays edits on desktop

https://gerrit.wikimedia.org/r/1050005

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:29:40Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]]

Maintenance_bot removed a project: Patch-For-Review.Jun 26 2024, 9:31 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:32:21Z] <cjming@deploy1002> cjming, migr: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T21:38:29Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1050005|Homepage: don't load yesterdays edits on desktop (T368405)]] (duration: 08m 48s)

Michael updated the task description. (Show Details)Jun 27 2024, 12:55 PM

Michael mentioned this in T368616: Special:Homepage performance issues.Jun 27 2024, 1:06 PM

Looking at my own early tracking (https://grafana.wikimedia.org/d/ff15559c-b4a2-4363-94c8-190a086b3315/michael-s-playground?from=now-24h&orgId=1&to=now&viewPanel=1) it does not look like the change backported last night had the desired effect. Otherwise, I would expect basically no events above 1 second. But let's see if the picture changes when the train rolls forward to Group 2.

Looking into this further, I was able to trigger a run actually taking a long time (3.57 seconds) in SpecialHomepage::execute: https://performance.wikimedia.org/xhgui/run/view?id=667d521c1e076ad57073b11e. This happened after I unselected a topic from a splatter of selected ones.

Other runs, which were just simple reloads, spend only little time (0.115 seconds) in SpecialHomepage::execute and instead almost all of the nominal wall-time (4.1 seconds) comes from the deferred updates (resetting the cache for tasks) https://performance.wikimedia.org/xhgui/run/view?id=667d51e229fe83b0bc29881b

So this suggests a few possible lines of inquiry to me:

Why don't we somehow regenerate the cache when changed filters are saved and only when Special:Homepage is reloaded?
Did the search get slower somehow?

I don't think the first item has changed last week tuesday, so checking-in with the search team about something changing on their end might be worthwhile.

That aside, I have created T368616: Special:Homepage performance issues to collect the opportunities to improve Special:Homepage performance as I notice them.

Now that the train has rolled forward to Group 2, we can tell with high confidence that it was those requests to AQS after all. As implemented in Homepage: don't load yesterdays edits on desktop, they went done on desktop (to what looks like much more consistently less than before)

And stayed high on mobile:

(or maybe even got a little bit worse there)

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.

Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

(Still, I continue to be unsure about what caused this in the first place. I inquired, and apparently nothing changed on the side of AQS, so something in the transport layer in between must be different since last week Tuesday.)

KStoller-WMF mentioned this in T368750: Newcomer Homepage: Suggested Edits (mobile preview) empty state when there are no suggested edits.Jun 28 2024, 3:45 PM

Michael mentioned this in T366761: Registration timestamp not found for user ID {userId}.Jun 28 2024, 5:20 PM

So tackling the "structured tasks mobile preview"-fallback seems like the main priority to fix this now.
Pinging @KStoller-WMF because that requires a product decision for how and when exactly to do that.

Thanks! I've created T368750: Newcomer Homepage: Suggested Edits (mobile preview) empty state when there are no suggested edits to document a decision and design.

KStoller-WMF mentioned this in T368796: Data exploration: How does page load time impact Homepage engagement?.Jun 28 2024, 9:24 PM

@Michael there was a change on AQS explained in T366851: gocql startup times have increased between v1.2.0 and v1.6.0. After upgrading gocql library, it seems that startup times for all services increased.

Are you still experiencing delays?

CC @Sfaci and @Milimetric

Regarding @VirginiaPoundstone has mentioned, I wanted to add that the delay was confirmed only when AQS service starts.

Taking a look at the exact dates around the issue you mentioned on this ticket, we merged something related to "cross-DC Cassandra client connection" on June 17 (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1043195) but the delay was already there since May 29 (it was added with this change where we had to set the initialDelaySecond property to 30 secs to have enough time for the service to start -> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1037033) but we never saw a performance degradation due to that. So I would say that delay itself is not related to this issue.

Thanks @VirginiaPoundstone and @Sfaci for looking into this!

As far as I can tell, the issues we're experiencing (and they're still ongoing) started on the 18th of May around 15:00 UTC plus-minus a few hours.

One noticeable, and also still ongoing, side-effect was that we see a lot more timeouts of our requests to AQS. You should be able to see them in this Logstash snapshot: https://logstash.wikimedia.org/goto/a28714fe427acb5915ab263e51ff88c1

Can you correlate those events to events in the instrumentation on your side?

Note that we're preparing a change to drop this dependency, so soon-ish those events are expected to stop because the code that makes the requests will be removed.

I have been exploring a bit more the changes around 18th of May and there is no change that we can correlate to these events. The service code haven't been changed this year and the only changes we have done are related to the kubernetes configuration as I mentioned before. The closest change, regarding time, is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1033405 where some network policies were changed on 24th May (I guess it was deployed after that date) but I can't say whether that change could be related to this. @BTullis any idea here?

Taking a look at the grafana dashboard for edit-analytics for the last two months it seems that the latency is pretty stable (there are a couple of peaks but the rest of the chart is fine) so, considering that, I don't know what happens but I would say that these events are not related to the service itself. Regarding the timeouts you mentioned, just wondering if there is something preventing your app from reaching the service.

Michael moved this task from Incoming to Code Review on the Growth-Team (Sprint 17 (Growth Team)) board.Jul 9 2024, 2:22 PM

Michael moved this task from Code Review to Incoming on the Growth-Team (Sprint 17 (Growth Team)) board.

Michael moved this task from Sprint 17 (Growth Team) to FY2024-25 Q1 Sprint 1 on the Growth-Team board.

Michael edited projects, added Growth-Team (FY2024-25 Q1 Sprint 1); removed Growth-Team (Sprint 17 (Growth Team)).

Gehel added a project: Data-Platform-SRE (2024.07.08 - 2024.07.28).Jul 10 2024, 12:31 PM

I think this can be closed.

With the train releasing the fix for mobile to the Group 2 wikis, we can also see the statistics for mobile improve:

And while the effect is less stark, it is also clearly there for our client side metrics:

The subtask can stay open for a little while longer to track the creation of Grafana panels that show the additional metrics that were added.

In T368405#9945822, @Sfaci wrote:

I have been exploring a bit more the changes around 18th of May and there is no change that we can correlate to these events. The service code haven't been changed this year and the only changes we have done are related to the kubernetes configuration as I mentioned before. The closest change, regarding time, is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1033405 where some network policies were changed on 24th May (I guess it was deployed after that date) but I can't say whether that change could be related to this. @BTullis any idea here?

Taking a look at the grafana dashboard for edit-analytics for the last two months it seems that the latency is pretty stable (there are a couple of peaks but the rest of the chart is fine) so, considering that, I don't know what happens but I would say that these events are not related to the service itself. Regarding the timeouts you mentioned, just wondering if there is something preventing your app from reaching the service.

Thank you for looking into this! I guess we have to conclude that something on the transport layer in between has changed. But that is way outside of my area of expertise

Gehel moved this task from Backlog to Done on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.Jul 16 2024, 3:38 PM

	F56461551: image.png
	Jul 16 2024, 1:46 PM

	F56461548: image.png
	Jul 16 2024, 1:46 PM

	F56461546: image.png
	Jul 16 2024, 1:46 PM

	F55953849: image.png
	Jun 28 2024, 8:09 AM

Special:Homepage is rendered much slower (<1 sec to 2+ sec)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Special:Homepage is rendered much slower (<1 sec to 2+ sec)
Closed, ResolvedPublic
Actions

Related Objects
Search...