[go: nahoru, domu]

Page MenuHomePhabricator
Paste P64016

Testing wdqs.data-reload with HDFS
ActivePublic

Authored by dcausse on Jun 4 2024, 3:24 PM.
Tags
None
Referenced Files
F54939322: Testing wdqs.data-reload with HDFS
Jun 4 2024, 3:31 PM
F54939194: Testing wdqs.data-reload with HDFS
Jun 4 2024, 3:24 PM
Subscribers
# The cookbook requires a new option to loadData.sh and thus https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/1038848 must be deployed on all wdqs nodes (or at least the node used for the test)
# The code of the cookbook is at https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1031933
# The cookbook options should be:
cookbook sre.wdqs.data-reload \
--task-id T349069 \
--reason "Test wdqs reload based on HDFS" \
--reload-data wikidata_full \
--from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ \
--stat-host stat1009.eqiad.wmnet \
wdqs_host

Event Timeline

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 268, in run
    self.preparation_step.run()
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 458, in run
    self._extract_from_hdfs(tmpdir)
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 415, in _extract_from_hdfs
    size = self._get_dump_size_from_hdfs()
  File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 408, in _get_dump_size_from_hdfs
    return int(re.sub(r"^(\d+)\s+.*$", next(lines), r"\1"))
ValueError: invalid literal for int() with base 10: '\\1'

@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet

I checked out your changes in my home directory on cumin2002, to be able to test them.

brouberol@cumin2002:~$ test-cookbook --change 1038904
INFO:__main__:Change exists in project operations/cookbooks with latest patch set being 24
INFO:__main__:Setting up Cookbooks change 1038904 patch set 24 for testing
INFO:__main__:Checkout of change 1038904 not found, cloning the repo
INFO:__main__:Executing command /usr/bin/git clone --depth 10 https://gerrit.wikimedia.org/r/operations/cookbooks /home/brouberol/cookbooks_testing/cookbooks-1038904
Cloning into '/home/brouberol/cookbooks_testing/cookbooks-1038904'...
remote: Counting objects: 296, done
remote: Finding sources: 100% (296/296)
remote: Getting sizes: 100% (257/257)
remote: Compressing objects: 100% (642787/642787)
remote: Total 296 (delta 54), reused 147 (delta 29)
Receiving objects: 100% (296/296), 310.23 KiB | 2.75 MiB/s, done.
Resolving deltas: 100% (54/54), done.
INFO:__main__:Executing command /usr/bin/git -C /home/brouberol/cookbooks_testing/cookbooks-1038904 status --porcelain
INFO:__main__:No local modification found, fetching change from Gerrit
INFO:__main__:Executing command /usr/bin/git -C /home/brouberol/cookbooks_testing/cookbooks-1038904 fetch https://gerrit.wikimedia.org/r/operations/cookbooks refs/changes/04/1038904/24
remote: Counting objects: 8552, done
remote: Finding sources: 100% (8552/8552)
remote: Getting sizes: 100% (1572/1572)
remote: Compressing objects: 100% (29460/29460)
remote: Total 8552 (delta 5684), reused 8537 (delta 5679)
Receiving objects: 100% (8552/8552), 1.90 MiB | 9.47 MiB/s, done.
Resolving deltas: 100% (5684/5684), done.
From https://gerrit.wikimedia.org/r/operations/cookbooks
 * branch            refs/changes/04/1038904/24 -> FETCH_HEAD
INFO:__main__:Executing command /usr/bin/git -C /home/brouberol/cookbooks_testing/cookbooks-1038904 rev-parse --verify change-1038904-24
fatal: Needed a single revision
INFO:__main__:Checking out the patch set into branch change-1038904-24
INFO:__main__:Executing command /usr/bin/git -C /home/brouberol/cookbooks_testing/cookbooks-1038904 checkout -b change-1038904-24 FETCH_HEAD
Switched to a new branch 'change-1038904-24'
INFO:__main__:==================================================
INFO:__main__:Executing: sudo cookbook -c /home/brouberol/cookbooks_testing/config.yaml
INFO:__main__:==================================================
#--- cookbooks args=[] ---#
[0/137] sre: SRE Cookbooks
q - Quit
h - Help
>>> q
brouberol@cumin2002:~$ sudo cookbook -c /home/brouberol/cookbooks_testing/config.yaml  sre.wdqs.data-reload  --task-id T349069  --reason "Test wdqs reload based on HDFS"  --reload-data wikidata_full  --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/  --stat-host stat1009.eqiad.wmnet  wdqs2023.codfw.wmnet
Acquired lock for key /spicerack/locks/cookbooks/sre.wdqs.data-reload:wdqs2023.codfw.wmnet: {'concurrency': 1, 'created': '2024-06-12 08:14:58.809865', 'owner': 'brouberol@cumin2002 [3637379]', 'ttl': 2419200}
START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
Creating stat1009.eqiad.wmnet:/srv/analytics-search/wdqs_reload_temp_folder and setting analytics-search as owner
----- OUTPUT of 'mkdir -p /srv/an.../dumps_from_hdfs' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.39hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'mkdir -p /srv/an.../dumps_from_hdfs'.
----- OUTPUT of 'chown -R analyti...load_temp_folder' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.37hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'chown -R analyti...load_temp_folder'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Extracting dumps from hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ to stat1009.eqiad.wmnet:/srv/analytics-search/wdqs_reload_temp_folder/reload.3637379.1718180098/dumps_from_hdfs
----- OUTPUT of 'sudo -u analytic...k-test-T349069/"' -----
289435916  hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00,  2.44s/hosts]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:02<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -u analytic...k-test-T349069/"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'set -o pipefail;..._hdfs' | tail -1' -----
19123612299264 3971724787712
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.40hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'set -o pipefail;..._hdfs' | tail -1'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of 'sudo -u analytic...dumps_from_hdfs"' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00,  7.46s/hosts]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:07<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -u analytic...dumps_from_hdfs"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Cleaning/creating target data wdqs2023.codfw.wmnet:/srv/dump/dumps_from_hdfs
----- OUTPUT of 'rm -rf /srv/dump/dumps_from_hdfs' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.98hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'rm -rf /srv/dump/dumps_from_hdfs'.
----- OUTPUT of 'mkdir -p /srv/dump' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.77hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'mkdir -p /srv/dump'.
----- OUTPUT of 'test -d /srv/dump' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.83hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'test -d /srv/dump'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Copying dumps from stat1009.eqiad.wmnet:/srv/analytics-search/wdqs_reload_temp_folder/reload.3637379.1718180098/dumps_from_hdfs to wdqs2023.codfw.wmnet:/srv/dump/dumps_from_hdfs
About to transfer /srv/analytics-search/wdqs_reload_temp_folder/reload.3637379.1718180098/dumps_from_hdfs from stat1009.eqiad.wmnet to ['wdqs2023.codfw.wmnet']:['/srv/dump'] (289440003 bytes)
Cleaning up....
Cleaning up stat1009.eqiad.wmnet:/srv/analytics-search/wdqs_reload_temp_folder/reload.3637379.1718180098/dumps_from_hdfs
----- OUTPUT of 'find /srv/analyt...*.gz' | xargs rm' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.28hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'find /srv/analyt...*.gz' | xargs rm'.
----- OUTPUT of 'rmdir /srv/analy.../dumps_from_hdfs' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.40hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'rmdir /srv/analy.../dumps_from_hdfs'.
----- OUTPUT of 'rmdir /srv/analy...37379.1718180098' -----
================
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.40hosts/s]
FAIL |                                                                                                                                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'rmdir /srv/analy...37379.1718180098'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Exception raised while executing cookbook sre.wdqs.data-reload:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 270, in run
    self.preparation_step.run()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 492, in run
    self._transfer_dump(tmpdir)
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 467, in _transfer_dump
    ret = transfer.run()
  File "/usr/lib/python3/dist-packages/transferpy/Transferer.py", line 584, in run
    port = firewall_handler.open(self.source_host, self.options['port'])
KeyError: 'port'
Released lock for key /spicerack/locks/cookbooks/sre.wdqs.data-reload:wdqs2023.codfw.wmnet: {'concurrency': 1, 'created': '2024-06-12 08:14:58.809865', 'owner': 'brouberol@cumin2002 [3637379]', 'ttl': 2419200}
END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[extract_kafka_timestamp_from_sparql] found null
Exception raised while executing cookbook sre.wdqs.data-reload:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run
    raw_ret = runner.run()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 274, in run
    self._reload_wikibase()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 317, in _reload_wikibase
    self.postload_step.run()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 344, in run
    timestamp = self._extract_kafka_timestamp_from_sparql()
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 366, in _extract_kafka_timestamp_from_sparql
    return parse_iso_dt(timestamp)
  File "/home/brouberol/cookbooks_testing/cookbooks/cookbooks/sre/wdqs/data-reload.py", line 612, in parse_iso_dt
    dt = datetime.fromisoformat(timestamp)
ValueError: Invalid isoformat string: 'null'

@RKemper I think we should now do a full import to measure the time it takes in order to have a rough estimation to answer T367409
To have a full run we need to re-enable the updater on wdqs2023 (which I think will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042965)
The command to run should be (using the latest dumps):

cookbook sre.wdqs.data-reload \
 --task-id T349069 \
 --reason "Test wdqs reload based on HDFS" \
 --reload-data wikidata_full \
 --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ \
 --stat-host stat1009.eqiad.wmnet \
 wdqs2023.codfw.wmnet

we really want to use wdqs2023 because currently it's the sole machine where I have deployed a quick backport of a problem in the loadData.sh script (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1042254)