[go: nahoru, domu]

Page MenuHomePhabricator

Reduce Lilypond shellouts from VisualEditor
Closed, ResolvedPublic

Description

We got paged for shellbox unavailability again today (previously tracked in T310557), and this time concluded (based on Shellbox\ShellboxError entries in logstash) that all the shellbox requests were associated with a single score edit: https://en.wikipedia.org/w/index.php?title=Pictures_at_an_Exhibition&diff=1096844863&oldid=1096808949&diffmode=source Note the shellbox activity lasted for about 23 minutes, and the edit timestamp is at the end of that interval.

The reason appears to be background parsing associated with VisualEditor. The MWExtensionDialog as used in Score has the default 0.25s debounce preview, meaning we're shelling out to Lilypond through Shellbox every quarter-second while the user is typing -- regardless of whether an existing shellout is in flight. That's reasonable for lots of parsing applications that take much less time than that, but for something as heavy as these score parses, we should extend that interval, which would have the effect of cutting down on the request rate to shellbox.

(One point of clarification: I originally thought the debounce value meant "we'll parse after the user stops typing for 250 ms." That still seemed like something we'd plausibly run into here: I imagined a user with a musical score on their desk, transcribing one or two notes at a time and then looking back at the score. But @Legoktm experimented with typing continuously, only a little slower than normal, and generated 400 requests to shellbox -- so it seems like these may be fired off every 250 ms, whether the user is still typing or not.)

We might also want to experiment with changing from a fixed rate limit to a fixed concurrency limit: that is, start a parse only if another parse isn't already in progress. That would limit the effect on the infrastructure: no matter how fast or slow a user types, their background parses tie up at most one shellbox replica at a time. (We'd want those requests to have a fairly short deadline, so that if the parse fails for whatever reason, it doesn't mean that background parsing stops for a full 60 seconds, or worse, indefinitely.)

@TheresNoTime also noted Real Time Preview will have a similar phenomenon, tracked separately at T312318.

Thanks to @jhathaway @Krinkle @Legoktm @Perryprog @TheresNoTime for their work digging into this.

Event Timeline

Legoktm renamed this task from Limit Lilypond shellouts from VisualEditor to Reduce Lilypond shellouts from VisualEditor.Jul 8 2022, 8:03 AM
Legoktm removed a subscriber: Familyfirst407.
Legoktm triaged this task as High priority.Jul 8 2022, 8:05 AM
Legoktm added a subscriber: Esanders.

I'm triaging this as high priority because it is causing temporary outages of Shellbox that are triggering pages.

@Esanders since you touched this most recently, how difficult would it be to increase the debounce time to something like 2 seconds for the score dialog?

CDanis added a subscriber: VPuffetMichel.

@Esanders @VPuffetMichel hello from SRE, just wanted to make sure this task was on your radar for a quick patch soon -- we've had multiple outages associated with this recently

One point of clarification: I originally thought the debounce value meant "we'll parse after the user stops typing for 250 ms."

It does, but 250ms is not a lot of time when you are typing. It uses OO.ui.debounce (with immediate=false):

* Return a function, that, as long as it continues to be invoked, will not
* be triggered. The function will be called after it stops being called for
* N milliseconds. If `immediate` is passed, trigger the function on the
* leading edge, instead of the trailing.

Change 813924 had a related patch set uploaded (by Esanders; author: Esanders):

[mediawiki/extensions/Score@master] Use a slower debounce for updating the live preview

https://gerrit.wikimedia.org/r/813924

Change 813924 merged by jenkins-bot:

[mediawiki/extensions/Score@master] Use a slower debounce for updating the live preview

https://gerrit.wikimedia.org/r/813924

It seems it's deployed now. Shall we close this?

I'd like to redo @Legoktm's manual test first, and make sure I can't reproduce a spike -- I'll do that later today, and close afterward.

RLazarus claimed this task.

Yep, looks much better! Closing.

Change 816873 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] shellbox: Restore replicas to 8, now that T312319 is resolved.

https://gerrit.wikimedia.org/r/816873

Change 816873 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox: Restore replicas to 8, now that T312319 is resolved.

https://gerrit.wikimedia.org/r/816873