[go: nahoru, domu]

Page MenuHomePhabricator

WMF-JobQueueComponent
ActivePublic

Details

Description

The infrastructure used by Wikimedia Foundation for storage and execution of the MediaWiki job queue.

As of July 2018, the MediaWiki JobQueue infrastructure (at WMF) in a nutshell:

  • Jobs are submitted from MediaWiki web servers to Kafka using EventBus.
  • Jobs are scheduled using ChangeProp.
  • Jobs are executed using rpc/RunSingleJob endpoint in wmf-config, on a dedicated "jobrunner" pool of MediaWiki app servers.

Workboard columns:

Maintained by: MediaWiki-Platform-Team

See also:

Recent Activity

Yesterday

daniel moved T175146: JobQueue: Unify JobRunner entry points from Needs Further Discussion to Backlog (Triaged and Ready) on the MW-Interfaces-Team board.
Thu, Jul 18, 2:56 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

Mon, Jul 15

Jonathanischoice added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Hi, strangely, some of my MIDI uploads have not been transcoded (Quarry) e.g. in my list of Commons uploads, the first six are fine, but File:6-Z46B_set_class_on_C.mid and on are still not, after 3 days?

Mon, Jul 15, 2:19 AM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Mon, Jul 1

bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Seems to have stabilized:

Mon, Jul 1, 8:25 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Fri, Jun 28

bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Hmm, it's down under 4k entries but still high.

Fri, Jun 28, 5:01 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Thu, Jun 27

Jonathanischoice added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Hi, apologies that I don't know how to help, and not sure if it's the same problem; the manual intervention here worked for files I uploaded to Commons on Monday (e.g. File:3-5B set class on C.mid), but files I uploaded yesterday are still waiting, e.g. File:4-3 set class on C.mid. I made a quarry query for these.

Thu, Jun 27, 10:57 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Wed, Jun 26

TheDJ added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

The queue was shrinking since yesterday, but is climbing again since this morning. This doesn't make any sense. With those high res jobs disabled, we should have more than enough capacity to catch up, should we not ?

Wed, Jun 26, 5:42 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Tue, Jun 25

bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Ok, 1440p and 2160p transcodes are temporarily disabled for now until better fixes, and we did a kill of the old stuck processes. Might still take a bit to shake everything out; I'm trying to flush through all the missing audio.

Tue, Jun 25, 8:17 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

I'm seriously considering bringing back my "chunked" scheme that would at least produce smaller, standalone jobs that encode say 10 seconds worth of video, then reassemble the final into a single video at the end. :P Main reason I haven't is that the logic needs to be able to handle missing chunks if individual ones time out or fail and that sounds like a pain, but it'll be a lot friendlier to the job queue infrastructure.

Tue, Jun 25, 6:01 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
TheDJ added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Funnily enough, thats exactly the kind of problem i was also concerned would pop up with the new k8s cluster T357309#9561624
Didnt know we were so susceptible to it on the old setup as well.

Tue, Jun 25, 5:58 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Looks like we've got a couple problems with high-res videos:

  • a bunch of 4K videos got uploaded at once and they all queued up
  • some of them are stuck! they should be timing out
  • it's also possible the audio clips are going to the wrong queue, i have to double-check this
Tue, Jun 25, 5:16 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

The list of "active" (may or may not actually be active) includes a number of 2160p high-res videos hitting since June 21. We've also gotten reports before about certain kinds of AV1 videos slowing down the input handling, which I haven't checked for.

Tue, Jun 25, 4:36 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

I'm bulk-adding the missing audio transcodes which should force them to run through as fast as possible between other jobs, and hopefully will handle the prioritized queue split better.

Tue, Jun 25, 4:34 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Live system thinks it has 9,223 items queued on commons and requeue is throttling there for now.... occasionally it goes down an item and moves on.

Tue, Jun 25, 4:29 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
Stashbot added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Mentioned in SAL (#wikimedia-operations) [2024-06-25T16:23:55Z] <bvibber> running requeueTranscodes for missing audio files on commons (mwmaint1002) cf T368364

Tue, Jun 25, 4:23 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Batch requeueTranscodes failured on June 22 with this error:

Tue, Jun 25, 4:21 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
bvibber added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

Could be a backfill run but that shouldn't be interfering with anything... I'll check on it

Tue, Jun 25, 4:16 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
TheDJ added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

So pretty early on the 20th around 01:20 (last Thursday) it started rising. So it wasn't the train, as the train arrived on Commons on Wednesday. In SAL, I only see db maintenance T367856 going on around that time.

Tue, Jun 25, 12:59 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
Aklapper added a comment to T368364: Transcodes of audio-only samples are not running for new uploads.

See also T368333: No sound in MP4 video files downloaded from Commons (which is likely not related; still mentioning it for the paper trail).

Tue, Jun 25, 12:27 PM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode
Ciencia_Al_Poder added a project to T368364: Transcodes of audio-only samples are not running for new uploads: WMF-JobQueue.

I see a surge in backlog for WebVideoTranscodePrioritized

Tue, Jun 25, 9:12 AM · WMF-JobQueue, Regression, TimedMediaHandler-Transcode

Jun 5 2024

Tgr added a comment to T354042: Forward X-Wikimedia-Debug header to MediaWiki jobs.

Please do open a new Phab for this, if appropriate.

Jun 5 2024, 4:21 PM · Observability-Logging, WMF-JobQueue
Wbm1058 added a comment to T354042: Forward X-Wikimedia-Debug header to MediaWiki jobs.

I enhanced one of my bots to check for, and report, errors it encounters.

Jun 5 2024, 1:23 PM · Observability-Logging, WMF-JobQueue

May 23 2024

daniel added a comment to T175146: JobQueue: Unify JobRunner entry points.

But maybe this is for RunJobs (which WMF doesn't use) rather than RunSingleJob (which I thought this task is mainly about). I very much like the idea of replacing Special:RunJobs with a (non-private) REST replacement with signed headers. Is that planned as part of this task?

May 23 2024, 8:55 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration
Krinkle added a comment to T175146: JobQueue: Unify JobRunner entry points.
  • The patches under T365752 add generic support for private modules as a new feature in REST. It seems non-trivial to maintain and support in a generic way. A simpler approach might be a feature flag specific to JobRunner, in which case one could communicate disablement via an HTTP status code from RunSingleJob, rather than by being absent from the registry. This might make sense since as that way the manifest isn't variable by user, which would risk/complicate internals around caching, and e.g. generating Swagger specs as we'd need to be careful not to let the cache be infected by a jobrunner. For special pages and API modules, we don't vary their registration by permission. The pages that are installed are always existent, but they may return a permission error. Anyway, more than the registry aspect pro/con, my main question here is about cost/benefit of an early abstraction now (which delays the main work of this task, and means there will be multiple moving pieces when we deploy/test this) vs later if/when there is more than 1 user for it (which would contain the business logic in a single class, and allow us to move more rapidly).
May 23 2024, 7:33 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

May 18 2024

gerritbot added a comment to T175146: JobQueue: Unify JobRunner entry points.

Change #1033227 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] REST: add support for private modules

https://gerrit.wikimedia.org/r/1033227

May 18 2024, 4:12 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

May 16 2024

hnowlan closed T246389: Enable MW REST API on job runners and video scalers (for the new rest.php job executor) as Resolved.

Resolved as part of k8s migration

May 16 2024, 4:37 PM · Platform Team Workboards (Platform Engineering Reliability), serviceops, WMF-JobQueue, MediaWiki-Core-JobQueue
hnowlan closed T246389: Enable MW REST API on job runners and video scalers (for the new rest.php job executor), a subtask of T175146: JobQueue: Unify JobRunner entry points, as Resolved.
May 16 2024, 4:36 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration
Aklapper placed T246389: Enable MW REST API on job runners and video scalers (for the new rest.php job executor) up for grabs.

@hnowlan: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

May 16 2024, 4:33 PM · Platform Team Workboards (Platform Engineering Reliability), serviceops, WMF-JobQueue, MediaWiki-Core-JobQueue
daniel claimed T175146: JobQueue: Unify JobRunner entry points.
May 16 2024, 3:14 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

May 7 2024

lbowmaker moved T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.
May 7 2024, 1:12 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Apr 28 2024

Aklapper edited Description on WMF-JobQueue.
Apr 28 2024, 8:25 PM

Apr 25 2024

FJoseph-WMF moved T175146: JobQueue: Unify JobRunner entry points from Incoming (Needs Triage) to Needs Further Discussion on the MW-Interfaces-Team board.
Apr 25 2024, 3:45 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration
daniel added a comment to T175146: JobQueue: Unify JobRunner entry points.

Quick summary of a conversation with Timo:

  • the primary protection mechansim should eb "off per default, enabled on internal cluster"
  • to prevent 3rd frmo accidentially making this endpoint public , there should be a second line of defense, like a list of IPs
  • signing requests, like SpecialRunJobs does, is a good mechanism as well. As long as the signature is created by MW, it's easy. It's inconvenient to try and generate a signature outside MW.
  • relying on the host header for picking the wiki that needs to process a job should be fine, but we nede to actually start setting this header in changeprop
  • we can probably just use the RunSingleJob REST enpoint that exists in the EventBus extension. Should we move it into core? 3rd parties are very unlinkely to ever need it.
  • we may want to establish best practices for "private" endpoints in general
Apr 25 2024, 3:16 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

Apr 24 2024

MSantos moved T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" from Needs Input (waiting) to Cross-team / Strategic on the MediaWiki-Engineering board.
Apr 24 2024, 2:02 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error

Apr 23 2024

daniel updated subscribers of T175146: JobQueue: Unify JobRunner entry points.

Notes to self:

Apr 23 2024, 4:03 PM · Patch-For-Review, Security, MW-Interfaces-Team, Platform Team Workboards (Initiatives), WMF-JobQueue, TechCom-RFC (TechCom-RFC-Closed), MediaWiki-Core-JobQueue, MediaWiki-Configuration

Apr 22 2024

Maintenance_bot added a project to T263301: Old image unexpectedly overwritten by a revision several years later (after Internal server error): Commons.
Apr 22 2024, 9:30 PM · Commons, MediaWiki-File-management, Unstewarded-production-error, Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Wikimedia-production-error, MediaWiki-Uploading
Krinkle added a project to T263301: Old image unexpectedly overwritten by a revision several years later (after Internal server error): MediaWiki-File-management.
Apr 22 2024, 9:01 PM · Commons, MediaWiki-File-management, Unstewarded-production-error, Platform Team Workboards (Clinic Duty Team), WMF-JobQueue, Wikimedia-production-error, MediaWiki-Uploading

Apr 19 2024

Ottomata added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

I fear I read that task, the way it is written at least, differently.

Apr 19 2024, 9:02 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Clement_Goubert closed T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes as Resolved.

Marking this resolved as you just confirmed a big file upload going through correctly. Thanks for your help in debugging this!

Apr 19 2024, 4:25 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Bawolff added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Seems like this is fixed! We just had File:OBR_Hafignnover_6-2024.webm (4.88 GB) successfully uploaded (req-id: c1a5aab7-6e6b-4ef5-b20d-f9ddb095577f ). The job took 5 minutes 20 seconds to complete, so went beyond the previous 202 second limit.

Apr 19 2024, 4:22 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Apr 18 2024

akosiaris added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

For replicating state changes (T120242) [...]

Why though? Why is 99.9999% (or 99.999999% or 99.99%) not enough?

There is a "Why do we need this?" section in T120242's description. Let's keep this discussion there?

Apr 18 2024, 3:37 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Ottomata added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

Replied at T120242#9726131

Apr 18 2024, 12:36 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Ladsgroup added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

search index not getting updated in 0.001% of edits

Search is probably fine.

Apr 18 2024, 12:23 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

[...]
So it seems like two separate issues.

I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)

Apr 18 2024, 11:58 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

request_terminate_timeout for mw-jobrunner should now be set to 86400, as it was on bare metal.

Apr 18 2024, 11:33 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
gerritbot added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Change #1021427 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: fix php.timeout

https://gerrit.wikimedia.org/r/1021427

Apr 18 2024, 11:31 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Ottomata added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

search index not getting updated in 0.001% of edits

Apr 18 2024, 11:30 AM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
Stashbot added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Mentioned in SAL (#wikimedia-operations) [2024-04-18T11:29:19Z] <cgoubert@deploy1002> Finished scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308 (duration: 37m 04s)

Apr 18 2024, 11:29 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
gerritbot added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Change #1021427 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: fix php.timeout

https://gerrit.wikimedia.org/r/1021427

Apr 18 2024, 11:02 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Stashbot added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Mentioned in SAL (#wikimedia-operations) [2024-04-18T10:52:15Z] <cgoubert@deploy1002> Started scap: Redeploy mw-on-k8s with full rebuild - Fix setting php.timeout - T358308

Apr 18 2024, 10:52 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
gerritbot added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

Change #1021418 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Fix php.timeout

https://gerrit.wikimedia.org/r/1021418

Apr 18 2024, 10:50 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management