[go: nahoru, domu]

Page MenuHomePhabricator

Dumps-GenerationComponent
ActivePublic

Members (10)

Details

Description

This project covers generation of XML and SQL dumps of Wikimedia wikis, plus RDF and JSON dumps of Wikidata. For currently available HTML dumps, see Wikimedia Enterprise.

This does not cover web server issues, bandwidth problems, formatting of html files, rsyncing them elsewhere and so on. It also does not cover other files archived and served on the dataset hosts.

Example issues that fall under this project: problems with the content or with dump runs not starting or being incomplete, discussions of new formats or a redo of the whole architecture.

See also: Datasets-General-or-Unknown

Recent Activity

Today

JJMC89 added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

Even when it doesn't page, the increased lag causes bots that respect a reasonable maxlag to not be able to function. T368098#9914664

Mon, Sep 2, 5:56 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Ladsgroup added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.
| 417670674 | wikiadmin2023   | 10.64.0.157:44926    | enwiki | Query     |       3 | Creating sort index                            | SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page>
| 417670695 | wikiadmin2023   | 10.64.0.157:44978    | enwiki | Sleep     |       1 |                                                | NULL                                                                       >
| 417670696 | wikiadmin2023   | 10.64.0.157:44992    | enwiki | Query     |       1 | Creating sort index                            | SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page>
| 417670805 | wikiadmin2023   | 10.64.0.157:43284    | enwiki | Sleep     |      11 |                                                | NULL                                                                       >
| 417670807 | wikiadmin2023   | 10.64.0.157:43292    | enwiki | Query     |      11 | Creating sort index                            | SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page>
| 417670842 | wikiadmin2023   | 10.64.0.157:43328    | enwiki | Sleep     |       8 |                                                | NULL                                                                       >
| 417670843 | wikiadmin2023   | 10.64.0.157:43334    | enwiki | Query     |       8 | Creating sort index                            | SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page>
| 417670858 | cumin2024       | 10.64.48.98:48540    | NULL   | Query     |       0 | starting                                       | show processlist                                                           >
| 417670962 | wikiadmin2023   | 10.64.0.157:43366    | enwiki | Sleep     |       2 |                                                | NULL                                                                       >
| 417670963 | wikiadmin2023   | 10.64.0.157:43376    | enwiki | Query     |       2 | Creating sort index                            | SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page>
| 417670973 | wikiadmin2023   | 10.64.0.157:58688    | enwiki | Sleep     |       1 |
Mon, Sep 2, 5:54 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Ladsgroup added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

It just caused a page

Mon, Sep 2, 5:51 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Ladsgroup added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

Hi, this has caused ~12 alerts just since this weekend (https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-data-persistence/20240901.txt and https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-data-persistence/20240902.txt). The prefetch is fully done so that's not really the issue here. Can you do something about it?

Mon, Sep 2, 4:58 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE

Sat, Aug 31

Reedy added a subtask for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1: T373752: Build php-uuid package, and add to WMF production and CI.
Sat, Aug 31, 9:24 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops

Fri, Aug 30

karapayneWMDE edited projects for T197090: [CLIENT][SW] Wikidata qid of articles is not present in export/dump, added: Wikidata Integration in Wikimedia projects; removed wmde-wikidata-tech.
Fri, Aug 30, 9:01 AM · Wikidata Integration in Wikimedia projects, wmde-wikidata-tech, Wikidata, MediaWiki-extensions-WikibaseClient, MediaWiki-Core-Snapshots, Dumps-Generation

Wed, Aug 28

VirginiaPoundstone moved T352650: Migrate current-generation dumps to run from our containerized images from NEEDS DISCUSSION to Radar (other teams) on the Data Products board.
Wed, Aug 28, 5:56 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Wed, Aug 21

Bugreporter edited Description on Dumps-Generation.
Wed, Aug 21, 5:50 AM
ABran-WMF closed T372961: db1206 depooled, high replication lag, a subtask of T368098: Dumps generation without prefetch cause disruption to the production environment, as Resolved.
Wed, Aug 21, 5:42 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Marostegui added a subtask for T368098: Dumps generation without prefetch cause disruption to the production environment: T372961: db1206 depooled, high replication lag.
Wed, Aug 21, 4:18 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Marostegui added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

This has caused another page in production T372961

Wed, Aug 21, 4:01 AM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE

Wed, Aug 14

Niharika added a comment to T365693: Provide attribute to indicate that user is temporary account in exported content.

@VirginiaPoundstone @lbowmaker who owns the decision on this task?

Wed, Aug 14, 4:36 PM · Temporary accounts (Blockers to pilot wiki deployment), Data Pipelines, Data-Engineering, Dumps-Generation, Data Products, MediaWiki-Core-Snapshots
kostajh edited projects for T365693: Provide attribute to indicate that user is temporary account in exported content, added: Temporary accounts (Blockers to pilot wiki deployment); removed Temporary accounts.
Wed, Aug 14, 4:17 PM · Temporary accounts (Blockers to pilot wiki deployment), Data Pipelines, Data-Engineering, Dumps-Generation, Data Products, MediaWiki-Core-Snapshots
Gehel moved T352650: Migrate current-generation dumps to run from our containerized images from Incoming to Watching on the Data-Platform-SRE board.
Wed, Aug 14, 8:43 AM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Tue, Aug 13

Maintenance_bot removed a project from T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts): Patch-For-Review.
Tue, Aug 13, 8:30 PM · MW-1.40-notes (1.40.0-wmf.19; 2023-01-16), Discovery-Search (Current work), CirrusSearch, Dumps-Generation
gerritbot added a comment to T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts).

Change #856655 merged by Ryan Kemper:

[operations/puppet@production] snapshot: Remove absented cirrus dump job

https://gerrit.wikimedia.org/r/856655

Tue, Aug 13, 8:07 PM · MW-1.40-notes (1.40.0-wmf.19; 2023-01-16), Discovery-Search (Current work), CirrusSearch, Dumps-Generation

Mon, Aug 12

xcollazo added a comment to T352650: Migrate current-generation dumps to run from our containerized images.
  • If I'm understanding correctly, people are thinking that we cutover from bare metal execution to use of container entrypoints for the jobs, and that we aren't necessarily running bare metal side-by-side with containers. This seems like a risk, at least if we consider the possibility that new stuff may be more prone to breakage and that such breakage could lead to lagged data delivery which can have a compounding effect in the infrastructure.
Mon, Aug 12, 5:49 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Thu, Aug 8

dr0ptp4kt updated subscribers of T352650: Migrate current-generation dumps to run from our containerized images.

Following up on some discussions:

Thu, Aug 8, 3:51 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Aug 2 2024

Marostegui added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

We just had another lag spike caused by dumps on enwiki: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1206&var-port=9104&viewPanel=6&from=1722603665043&to=1722604675864
It wasn't as impactful as before but it did cause some of our alerts to fire up as it went over a minute of lag, although it recovered quickly

Aug 2 2024, 1:20 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE

Aug 1 2024

dr0ptp4kt added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

To confirm understanding, did we have a leaning on whether the containerized version would pin to an older version of MediaWiki versus whether it would need to keep getting MediaWiki updates?

Aug 1 2024, 9:26 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Lucas_Werkmeister_WMDE closed T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08 as Resolved.
Aug 1 2024, 1:13 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation

Jul 29 2024

Milimetric added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Just for the record, we met and discussed @Joe's proposal (this task's description) and were in general agreement that it's the best way forward. We have follow-up discussions to have and coordination to do, but we're aligned on the idea.

Jul 29 2024, 3:04 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Jul 26 2024

VirginiaPoundstone moved T352650: Migrate current-generation dumps to run from our containerized images from Incoming to NEEDS DISCUSSION on the Data Products board.
Jul 26 2024, 4:08 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Jul 24 2024

bd808 added a subtask for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1: T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions.
Jul 24 2024, 7:47 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Ottomata added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

I think it will be quite a while before we are fully able to decom Dumps 1. This task will unblock SRE's MW to k8s migration, and will allow them to remove all the complicated puppet and scap code supporting the bare metal MW deployment.

Jul 24 2024, 4:50 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
WDoranWMF edited projects for T368098: Dumps generation without prefetch cause disruption to the production environment, added: Dumps 2.0; removed Data Products (Data Products Sprint 16), Data-Engineering.
Jul 24 2024, 4:15 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE

Jul 23 2024

Krinkle updated the task description for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1.
Jul 23 2024, 9:49 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Krinkle updated the task description for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1.
Jul 23 2024, 9:46 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Krinkle removed a subtask for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1: T290536: Serve production traffic via Kubernetes.
Jul 23 2024, 9:41 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Krinkle updated the task description for T319432: Migrate WMF production from PHP 7.4 to PHP 8.1.
Jul 23 2024, 9:41 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Krinkle attached a referenced file: F56618452: Screenshot 2024-07-23 at 14.46.28.png.
Jul 23 2024, 1:47 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Krinkle attached a referenced file: F56618453: Screenshot 2024-07-23 at 14.47.10.png.
Jul 23 2024, 1:47 PM · Dumps-Generation, MediaWiki-Platform-Team, serviceops

Jul 22 2024

PeterBowman added a comment to T365155: Text id verification makes dumps skip many good rows.

Got it, thank you for the clarification!

Jul 22 2024, 6:38 PM · Data Products (Data Products Sprint 16), Dumps-Generation
xcollazo added a comment to T365155: Text id verification makes dumps skip many good rows.

Hello. This is an empty text entry in wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004.bz2 as downloaded from https://dumps.wikimedia.org/wikidatawiki/20240701/:

<page>
  <title>Q6157973</title>
  <ns>0</ns>
  <id>5952866</id>
  <revision>
    <id>2136476381</id>
    <parentid>2045646165</parentid>
    <timestamp>2024-04-24T18:13:17Z</timestamp>
    <contributor>
      <username>William Avery Bot</username>
      <id>2964320</id>
    </contributor>
    <comment>/* wbeditentity-update:0| */ Changing runeberg.org URLs to https (×7). See [[Wikidata:Requests_for_permissions/Bot/William_Avery_Bot_11|request for permission]]</comment>
    <origin>2136476381</origin>
    <model>wikibase-item</model>
    <format>application/json</format>
    <text bytes="90348" sha1="aimrgeqjzqp6d3oz5qc8awzr5nhi9da" />
    <sha1>aimrgeqjzqp6d3oz5qc8awzr5nhi9da</sha1>
  </revision>
</page>

Shouldn't this be already fixed with the deployed patch?

Jul 22 2024, 6:18 PM · Data Products (Data Products Sprint 16), Dumps-Generation
Joe added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

Jul 22 2024, 3:00 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Ottomata added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

IIUC, the PHP 8 issue with be the same with containerized MW. I also don't know exactly what it is. :)

Jul 22 2024, 12:55 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Milimetric added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Sorry to ask this very basic question, but I found a bunch of others didn't know: how exactly is Dumps blocking the php 8 upgrade? Like, if we leave everything exactly as-is and just upgrade PHP, would it not run the way it is currently set up? On surface I see no big difference between the current setup and a containerized MW running on the same servers, so I'm curious about the nuance I'm missing here.

Jul 22 2024, 12:50 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Jul 21 2024

PeterBowman added a comment to T365155: Text id verification makes dumps skip many good rows.

Hello. This is an empty text entry in wikidatawiki-20240701-pages-articles-multistream6.xml-p4469005p5969004.bz2 as downloaded from https://dumps.wikimedia.org/wikidatawiki/20240701/:

Jul 21 2024, 7:09 PM · Data Products (Data Products Sprint 16), Dumps-Generation

Jul 18 2024

VirginiaPoundstone added a project to T352650: Migrate current-generation dumps to run from our containerized images: Data Products.
Jul 18 2024, 3:15 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
VirginiaPoundstone added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

@Joe thanks for the ping. Just to keep you posted about our progress: Will is out this week but he and Data SRE are in conversation to see what we can accomplish. Should know more mid next week.

Jul 18 2024, 3:14 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
Maintenance_bot moved T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08 from [DOT] Prioritized to Ongoing on the wmde-wikidata-tech board.
Jul 18 2024, 1:29 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation
ItamarWMDE moved T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08 from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.
Jul 18 2024, 1:03 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation
Lucas_Werkmeister_WMDE added a comment to T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08.

Looks like the dumps are starting to come back \o/

Jul 18 2024, 12:59 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation
Lucas_Werkmeister_WMDE moved T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08 from In Task Breakdown to Ready for Tech Verification on the Wikidata Dev Team (Wikidata.org Slice) board.
Jul 18 2024, 12:58 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation
Lucas_Werkmeister_WMDE added a project to T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08: Wikidata Dev Team (Wikidata.org Slice).
Jul 18 2024, 12:58 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation

Jul 17 2024

VirginiaPoundstone moved T368098: Dumps generation without prefetch cause disruption to the production environment from In Process to Paused on the Data Products (Data Products Sprint 16) board.
Jul 17 2024, 4:17 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE
Gehel added a project to T352650: Migrate current-generation dumps to run from our containerized images: Data-Platform-SRE.
Jul 17 2024, 2:09 PM · Data Products, Data-Platform-SRE, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops

Jul 16 2024

VirginiaPoundstone moved T365693: Provide attribute to indicate that user is temporary account in exported content from Metrics Platform Backlog to Pipelines Backlog on the Data Products board.
Jul 16 2024, 4:35 PM · Temporary accounts (Blockers to pilot wiki deployment), Data Pipelines, Data-Engineering, Dumps-Generation, Data Products, MediaWiki-Core-Snapshots
VirginiaPoundstone moved T365693: Provide attribute to indicate that user is temporary account in exported content from Pipelines Backlog to Metrics Platform Backlog on the Data Products board.
Jul 16 2024, 4:33 PM · Temporary accounts (Blockers to pilot wiki deployment), Data Pipelines, Data-Engineering, Dumps-Generation, Data Products, MediaWiki-Core-Snapshots

Jul 15 2024

BTullis added a comment to T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08.

I can give a status update here, which I hope will be useful.

Jul 15 2024, 5:15 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation