[go: nahoru, domu]

Page MenuHomePhabricator

WDQS graph split: load data from dumps into new hosts
Closed, ResolvedPublic

Description

In order to proceed on T337013 , we need to do a full data reload on wdqs1022-1024, similar to T323096 .

Creating this ticket to:

I started the reload with the following command:

/usr/bin/python3 /usr/local/bin/test-cookbook -c 968346 sre.wdqs.data-reload --reload-data graph_split --no-depool --reason T349011 --wikidata-dump=/mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/20231016/wikidata-20231016-all-BETA.ttl.bz2 --lexemes-dump=/mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/20231013/wikidata-20231013-lexemes-BETA.ttl.bz2 wdqs1022.eqiad.wmnet

Event Timeline

@dcausse couple of questions:

  • Are we OK to start the data load as soon as these hosts are puppetized, or are other steps we need to do first?
  • Does each host need its data loaded, or can we load on one and data-transfer to the others?

@bking only one host has to be loaded with the full dataset.
The loading process can be started as soon as possible but there are few constraints:

  • once we settle on a dump to import we will have to stick to it (if it fails we have to continue using this same dump unless we all agree to pick a new one)
  • to be aligned with the dumps that are imported into hdfs we must select a particular set of files, when selecting a "-all" file the preceding "lexemes" one must be taken, e.g. if wikidata-20230925-all-BETA.ttl.bz2 is taken wikidata-20230922-lexemes-BETA.ttl.bz2 must be taken.
  • the updater must never be started after the reload

More details related to dump loading in T325114 .

bking updated the task description. (Show Details)
bking moved this task from Incoming to In Progress on the Data-Platform-SRE board.

I started a data reload for hosts wdqs1022-1024 . These are running in a tmux window under my user on cumin1001 .

Based on T323096 , we expect this process to fail multiple times, which is why we're running in on three hosts. As long as one completes, we will be able to transfer the data to the other hosts.

Apologies for not catching this earlier:

  • to be aligned with the dumps that are imported into hdfs we must select a particular set of files, when selecting a "-all" file the preceding "lexemes" one must be taken, e.g. if wikidata-20230925-all-BETA.ttl.bz2 is taken wikidata-20230922-lexemes-BETA.ttl.bz2 must be taken.

This will require a change to the cookbook; I've created T349011 for this purpose.

Progress report:
wdqs1022: started reload 2023-10-24 0000 UTC . Munging finished 2023-10-26 0003 UTC. So far, we've processed 409/1104 munged files, which works out to ~37% complete over a period of ~1 wk total, ~5 days if we don't count the munging step. Assuming nothing goes wrong, we should expect this to complete in ~ 9 days.

Gehel triaged this task as High priority.Nov 3 2023, 10:28 AM

@bking thanks for triggering the import, could you update the task description with the dump files you used? (needed because we have to explicitly keep the corresponding partition in hdfs).

@bking thanks for triggering the import, could you update the task description with the dump files you used? (needed because we have to explicitly keep the corresponding partition in hdfs).

Updated description as requested.

Another progress report: We are 80% (869/1104) done on the leading host (wdqs1022).

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:51:15Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:51:29Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:02Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:06Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:30Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:54Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504

Update: The wikidata dump finished on wdqs1022 ( Wikidata dump loaded in 25 days, 13:32:17.263762) . It's processing lexemes now...

Looks like the data reload for lexemes completed. @dcausse , are you able to check the data from the reload and make sure it's usable? Let me know if I can help.

Moving to "blocked/waiting" until we have confirmation on the reload data.