WDQS graph split: load data from dumps into new hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Sep 27 2023, 3:27 PM

Description

In order to proceed on T337013 , we need to do a full data reload on wdqs1022-1024, similar to T323096 .

Creating this ticket to:

Reload the data via cookbook and confirm operation.

I started the reload with the following command:

/usr/bin/python3 /usr/local/bin/test-cookbook -c 968346 sre.wdqs.data-reload --reload-data graph_split --no-depool --reason T349011 --wikidata-dump=/mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/20231016/wikidata-20231016-all-BETA.ttl.bz2 --lexemes-dump=/mnt/nfs/dumps-clouddumps1001.wikimedia.org/wikidatawiki/entities/20231013/wikidata-20231013-lexemes-BETA.ttl.bz2 wdqs1022.eqiad.wmnet

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Gehel	T350464 Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph
Resolved	bking	T347504 WDQS graph split: load data from dumps into new hosts
Resolved	bking	T349011 Improve data-reload cookbook based on graph split needs

Event Timeline

bking created this task.Sep 27 2023, 3:27 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 27 2023, 3:27 PM

bking added a parent task: T337013: [Epic] Splitting the graph in WDQS.Sep 27 2023, 3:27 PM

Maintenance_bot added a project: Wikidata.Sep 27 2023, 3:30 PM

bking added a subtask: T347505: Prepare new WDQS hosts for graph splitting.Sep 27 2023, 3:35 PM

bking removed a subtask: T347505: Prepare new WDQS hosts for graph splitting.Sep 27 2023, 3:38 PM

bking added a parent task: T347505: Prepare new WDQS hosts for graph splitting.

bking updated the task description. (Show Details)Sep 27 2023, 3:41 PM

@dcausse couple of questions:

Are we OK to start the data load as soon as these hosts are puppetized, or are other steps we need to do first?
Does each host need its data loaded, or can we load on one and data-transfer to the others?

bking added a subscriber: RKemper.Sep 27 2023, 6:54 PM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.Oct 2 2023, 3:42 PM

@bking only one host has to be loaded with the full dataset.
The loading process can be started as soon as possible but there are few constraints:

once we settle on a dump to import we will have to stick to it (if it fails we have to continue using this same dump unless we all agree to pick a new one)
to be aligned with the dumps that are imported into hdfs we must select a particular set of files, when selecting a "-all" file the preceding "lexemes" one must be taken, e.g. if wikidata-20230925-all-BETA.ttl.bz2 is taken wikidata-20230922-lexemes-BETA.ttl.bz2 must be taken.
the updater must never be started after the reload

More details related to dump loading in T325114 .

bking claimed this task.Oct 4 2023, 5:07 PM

bking updated the task description. (Show Details)

bking moved this task from Incoming to In Progress on the Data-Platform-SRE board.

I started a data reload for hosts wdqs1022-1024 . These are running in a tmux window under my user on cumin1001 .

Based on T323096 , we expect this process to fail multiple times, which is why we're running in on three hosts. As long as one completes, we will be able to transfer the data to the other hosts.

bking mentioned this in T349011: Improve data-reload cookbook based on graph split needs.Oct 16 2023, 4:44 PM

bking added a subtask: T349011: Improve data-reload cookbook based on graph split needs.

Apologies for not catching this earlier:

to be aligned with the dumps that are imported into hdfs we must select a particular set of files, when selecting a "-all" file the preceding "lexemes" one must be taken, e.g. if wikidata-20230925-all-BETA.ttl.bz2 is taken wikidata-20230922-lexemes-BETA.ttl.bz2 must be taken.

This will require a change to the cookbook; I've created T349011 for this purpose.

Progress report:
wdqs1022: started reload 2023-10-24 0000 UTC . Munging finished 2023-10-26 0003 UTC. So far, we've processed 409/1104 munged files, which works out to ~37% complete over a period of ~1 wk total, ~5 days if we don't count the munging step. Assuming nothing goes wrong, we should expect this to complete in ~ 9 days.

bking closed subtask T349011: Improve data-reload cookbook based on graph split needs as Resolved.Nov 2 2023, 3:58 PM

Gehel edited parent tasks, added: T350464: Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph; removed: T347505: Prepare new WDQS hosts for graph splitting, T337013: [Epic] Splitting the graph in WDQS.Nov 3 2023, 10:22 AM

Gehel triaged this task as High priority.Nov 3 2023, 10:28 AM

@bking thanks for triggering the import, could you update the task description with the dump files you used? (needed because we have to explicitly keep the corresponding partition in hdfs).

In T347504#9307835, @dcausse wrote:

@bking thanks for triggering the import, could you update the task description with the dump files you used? (needed because we have to explicitly keep the corresponding partition in hdfs).

Updated description as requested.

dcausse mentioned this in T350703: Restart Search Platform-owned services for Java 8 / Java 11 security updates.Nov 9 2023, 8:49 AM

Another progress report: We are 80% (869/1104) done on the leading host (wdqs1022).

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:51:15Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:51:29Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1022.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:02Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:06Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1024.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:30Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504

Mentioned in SAL (#wikimedia-operations) [2023-11-13T20:52:54Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on wdqs1023.eqiad.wmnet with reason: T347504

Update: The wikidata dump finished on wdqs1022 ( Wikidata dump loaded in 25 days, 13:32:17.263762) . It's processing lexemes now...

Looks like the data reload for lexemes completed. @dcausse , are you able to check the data from the reload and make sure it's usable? Let me know if I can help.

dr0ptp4kt subscribed.Nov 28 2023, 8:05 PM

Moving to "blocked/waiting" until we have confirmation on the reload data.

bking mentioned this in T350465: Load Wikidata split graphs into test servers.Nov 30 2023, 6:49 PM

Gehel closed this task as Resolved.Dec 1 2023, 2:45 PM

WDQS graph split: load data from dumps into new hostsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

WDQS graph split: load data from dumps into new hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...