Create separate pybal pools for wdqs graph split (main vs scholarly)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RKemper
	May 7 2024, 6:43 AM

Description

Context

Currently we've got one pybal pool per-DC for public wdqs (https://config-master.wikimedia.org/pybal/eqiad/wdqs) corresponding to query.wikidata.org and a separate one for wdqs-internal.

We'll ultimately want to split the public wdqs into two pybal pools: wdqs-main and wdqs-scholarly. Among other things, the separate pools will allow us to shift hosts over from one type of graph split host to the other in response to evolving usage.

AC

Pools exist for wdqs-main and wdqs-scholarly instead of there just being a single monolithic public wdqs
Tacked on: Decide what to do about wdqs-internal, i.e. can we just have wdqs-internal hosts contain the journal for the wdqs-main graph but not wdqs-scholarly

Details

Other Assignee: RKemper

Subject	Repo	Branch	Lines +/-
wdqs-main, wdqs-scholarly: use HTTPS for health check	operations/puppet	production	+2 -2
wdqs-main, wdqs-scholarly: use TLS for pybal pools	operations/puppet	production	+4 -4
wdqs: new -main, -scholarly services	operations/puppet	production	+2 -0
wdqs: move -main and -scholarly to production	operations/puppet	production	+2 -2
wdqs: Prepare to configure the load balancers	operations/puppet	production	+2 -2
wdqs: -main and -scholarly are different services	operations/puppet	production	+8 -8
wdqs: create wdqs split pybal pools	operations/puppet	production	+2 -0
wdqs: add wdqs2024 to scholarly pool	operations/puppet	production	+1 -0
wdqs: add graph split hosts to conftool_data	operations/puppet	production	+11 -0
wdqs graph-split: temp remove main/scholarly pools	operations/puppet	production	+0 -12
wdqs: update scap wdqs hostlist	operations/puppet	production	+4 -1
wdqs: add main and scholarly role assignments	operations/puppet	production	+33 -38
[WIP] wdqs: create wdqs split pybal pools	operations/puppet	production	+6 -0
wdqs graph split: fix tab alignment	operations/puppet	production	+2 -2
wdqs: add main and scholarly puppet config	operations/puppet	production	+299 -10

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Open	None	T364363 [Epic] Productionize federated wdqs graph-split endpoints
Resolved	Stevemunene	T364368 Create separate pybal pools for wdqs graph split (main vs scholarly)
Resolved	Stevemunene	T372919 Bring wqds2024 back into service
Open	RKemper	T373145 Create new service catalog entries for wdqs-main and wdqs-scholarly

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.Jun 18 2024, 9:01 AM

Gehel edited projects, added Data-Platform-SRE (2024.07.08 - 2024.07.28); removed Data-Platform-SRE (2024.06.17 - 2024.07.07).Jul 8 2024, 6:33 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.

@Stevemunene and I paired on deciding some initial host/pybal allocations. The following numbers are assuming we need 2 hosts per pool to keep pybal happy, but can be adjusted if we're fine starting with just 1 host for scholarly.

EQIAD
    main
        1021 (current wdqs-public)
        1022 (current test host)
    scholarly
        1023 (current test host)
        1024 (current test host)

CODFW
    main
        2021 (current wdqs-public)
        2022 (current wdqs-public)
    scholarly
        2023 (current test host)
        2024 (current wdqs-public)
    test
        2025 (current wdqs-public)

That would leave us with the following numbers for public wdqs:

eqiad-public: 7 hosts
codfw-public: 11 hosts

Change #1054342 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] wdqs: add main and scholarly role assignments

https://gerrit.wikimedia.org/r/1054342

Change #1054520 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] [WIP] wdqs: create wdqs split pybal pools

https://gerrit.wikimedia.org/r/1054520

Stevemunene updated Other Assignee, added: RKemper.Jul 22 2024, 6:18 AM

Change #1046123 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add main and scholarly puppet config

https://gerrit.wikimedia.org/r/1046123

Change #1056230 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph split: fix tab alignment

https://gerrit.wikimedia.org/r/1056230

Change #1056230 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph split: fix tab alignment

https://gerrit.wikimedia.org/r/1056230

Change #1046120 abandoned by Stevemunene:

[operations/puppet@production] [WIP] wdqs: create wdqs split pybal pools

Reason:

Duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054520

https://gerrit.wikimedia.org/r/1046120

bking edited projects, added Data-Platform-SRE (2024.07.29 - 2024.08.16); removed Data-Platform-SRE (2024.07.08 - 2024.07.28).Jul 31 2024, 3:02 PM

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.07.29 - 2024.08.16) board.

Depooled the relevant hosts that will be no longer in wdqs-public:

sudo -E cumin 'wdqs1021*,wdqs2021*,wdqs2022*,wdqs2024*,wdqs2025*' 'depool'

Change #1054342 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add main and scholarly role assignments

https://gerrit.wikimedia.org/r/1054342

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1021.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2024-08-03T00:54:18Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on wdqs[2021-2022,2024-2025].codfw.wmnet with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T00:54:38Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on wdqs[2021-2022,2024-2025].codfw.wmnet with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T01:15:18Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on 9 hosts with reason: T364368 rejiggering hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-03T01:15:33Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on 9 hosts with reason: T364368 rejiggering hosts

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Change #1059441 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: temp remove main/scholarly pools

https://gerrit.wikimedia.org/r/1059441

Change #1059441 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: temp remove main/scholarly pools

https://gerrit.wikimedia.org/r/1059441

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

wdqs1023 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1021.eqiad.wmnet with OS bullseye executed with errors:

wdqs1021 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030114_ryankemper_950847_wdqs1021.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1021.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1022.eqiad.wmnet with OS bullseye executed with errors:

wdqs1022 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030153_ryankemper_994613_wdqs1022.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1022.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

wdqs1023 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408051728_ryankemper_680045_wdqs1023.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1023.eqiad.wmnet with OS bullseye executed with errors:

wdqs1023 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1023.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye

Change #1060902 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: update scap wdqs hostlist

https://gerrit.wikimedia.org/r/1060902

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs1024.eqiad.wmnet with OS bullseye executed with errors:

wdqs1024 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs1024.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

wdqs2021 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2021.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Change #1060902 merged by Ryan Kemper:

[operations/puppet@production] wdqs: update scap wdqs hostlist

https://gerrit.wikimedia.org/r/1060902

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:40:09Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:40:35Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-15T07:31:01Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-15T07:31:16Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts

Gehel edited projects, added Data-Platform-SRE (2024.08.17 - 2024.09.06); removed Data-Platform-SRE (2024.07.29 - 2024.08.16).Fri, Aug 16, 9:43 AM

Gehel moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors:

wdqs2024 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2024.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2022.codfw.wmnet with OS bullseye executed with errors:

wdqs2022 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2022.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye executed with errors:

wdqs2025 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2025.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2023.codfw.wmnet with OS bullseye executed with errors:

wdqs2023 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2023.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors:

wdqs2024 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wdqs2024.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-08-20T06:43:40Z] <ryankemper@cumin2002> START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-20T06:43:43Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts

bking closed subtask T372919: Bring wqds2024 back into service as Resolved.Tue, Aug 20, 10:11 PM

In T364368#9971052, @RKemper wrote:
@Stevemunene and I paired on deciding some initial host/pybal allocations. The following numbers are assuming we need 2 hosts per pool to keep pybal happy, but can be adjusted if we're fine starting with just 1 host for scholarly.
EQIAD
    main
        1021 (current wdqs-public)
        1022 (current test host)
    scholarly
        1023 (current test host)
        1024 (current test host)

CODFW
    main
        2021 (current wdqs-public)
        2022 (current wdqs-public)
    scholarly
        2023 (current test host)
        2024 (current wdqs-public)
    test
        2025 (current wdqs-public)
That would leave us with the following numbers for public wdqs:

eqiad-public: 7 hosts
codfw-public: 11 hosts

Considering T371833, should we remove wdqs2025 as a test host and re assign it or do we plan on retaining one test instance/ endpoint say query-full-experimental?

Change #1064473 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: add graph split hosts to conftool_data

https://gerrit.wikimedia.org/r/1064473

Change #1064473 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add graph split hosts to conftool_data

https://gerrit.wikimedia.org/r/1064473

Change #1064479 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: new -main, -scholarly services

https://gerrit.wikimedia.org/r/1064479

Change #1064829 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: add wdqs2024 to scholarly pool

https://gerrit.wikimedia.org/r/1064829

Mentioned in SAL (#wikimedia-operations) [2024-08-22T19:01:50Z] <ryankemper> T364368 Pooled all wdqs main/scholarly hosts except wdqs2024, which won't be ready for another hour

Change #1064829 merged by Ryan Kemper:

[operations/puppet@production] wdqs: add wdqs2024 to scholarly pool

https://gerrit.wikimedia.org/r/1064829

Mentioned in SAL (#wikimedia-operations) [2024-08-22T19:31:13Z] <ryankemper> T364368 Pooled wdqs2024 (its data transfer has completed successfully)

Change #1064840 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: -main and -scholarly are different services

https://gerrit.wikimedia.org/r/1064840

Change #1064843 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: Prepare to configure the load balancers

https://gerrit.wikimedia.org/r/1064843

Change #1064848 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: move -main and -scholarly to production

https://gerrit.wikimedia.org/r/1064848

dr0ptp4kt mentioned this in T371833: Tear down wdqs graph split experimental endpoints.Fri, Aug 23, 10:52 AM

Change #1054520 abandoned by Ryan Kemper:

[operations/puppet@production] wdqs: create wdqs split pybal pools

Reason:

duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064479

https://gerrit.wikimedia.org/r/1054520

Change #1064840 merged by Ryan Kemper:

[operations/puppet@production] wdqs: -main and -scholarly are different services

https://gerrit.wikimedia.org/r/1064840

Change #1064843 merged by Ryan Kemper:

[operations/puppet@production] wdqs: Prepare to configure the load balancers

https://gerrit.wikimedia.org/r/1064843

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:23:43Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs secondary looks good, proceeding

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:24:05Z] <ryankemper> T364368 [codfw] Restarted lvs primary: sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:25:24Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

LVS restarts have completed successfully, following the below step:

sudo cumin 'A:lvs and (A:eqiad or A:codfw)' 'disable-puppet "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 Disabled puppet on all lvs hosts in preparation for rolling restart
(merge patch)

[EQIAD]

sudo cumin 'A:lvs and A:eqiad' 'run-puppet-agent --enable "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 [eqiad] enabled puppet on eqiad lvs hosts, expecting alerts soon

ack alerts

sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'
!log T364368 [eqiad] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-secondary-eqiad' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [eqiad] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding


sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'
!log T364368 [eqiad] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-low-traffic-eqiad' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [eqiad] `sudo ipvsadm -L -n` on lvs primary looks good, proceeding

curl -v -k http://wdqs-main.svc.eqiad.wmnet:80/
curl -v -k http://wdqs-scholarly.svc.eqiad.wmnet:80/

[CODFW]

sudo cumin 'A:lvs and A:codfw' 'run-puppet-agent --enable "adding new services wdqs-main & wdqs-scholarly"'
!log T364368 [codfw] ran puppet on codfw lvs hosts, expecting alerts soon

ack alerts

sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'
!log T364368 [codfw] Restarted lvs secondary: `sudo cumin 'A:lvs-secondary-codfw' 'systemctl restart pybal.service'`


sudo cumin 'A:lvs-secondary-codfw' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [codfw] `sudo ipvsadm -L -n` on lvs secondary looks good, proceeding


sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'
!log T364368 [codfw] Restarted lvs primary: `sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service'`

sudo cumin 'A:lvs-low-traffic-codfw' 'ipvsadm -L -n'
# wait 120s while looking at https://icinga.wikimedia.org/alerts
!log T364368 [codfw] `sudo ipvsadm -L -n` on lvs primary looks good, all done with lvs restarts

curl -v -k http://wdqs-main.svc.codfw.wmnet:80/
curl -v -k http://wdqs-scholarly.svc.codfw.wmnet:80/

Change #1064848 merged by Ryan Kemper:

[operations/puppet@production] wdqs: move -main and -scholarly to production

https://gerrit.wikimedia.org/r/1064848

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:42:27Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:43:15Z] <ryankemper> T364368 Merged patch to move lvs state to production for wdqs-main and wdqs-scholarly (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064848) and ran puppet on all LVS hosts

In T364368#10093919, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:42:27Z] <ryankemper> T364368 [codfw] sudo ipvsadm -L -n on lvs primary looks good, all done with lvs restarts

Copy-paste error, this log message can be ignored this step was already performed earlier

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:45:09Z] <ryankemper> T364368 Merged patch to add dns discovery resources for wdqs-main and wdqs-scholarly (https://gerrit.wikimedia.org/r/c/operations/dns/+/1064831), and ran puppet on all DNS hosts

Mentioned in SAL (#wikimedia-operations) [2024-08-26T19:48:18Z] <ryankemper> T364368 Manually adding dns discovery resources to etcd corresponding to https://wikitech.wikimedia.org/wiki/LVS#Add_the_DNS_Discovery_Record

Change #1064479 merged by Ryan Kemper:

[operations/puppet@production] wdqs: new -main, -scholarly services

https://gerrit.wikimedia.org/r/1064479

Change #1067383 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-main, wdqs-scholarly: use TLS for pybal pools

https://gerrit.wikimedia.org/r/1067383

Change #1067383 merged by Bking:

[operations/puppet@production] wdqs-main, wdqs-scholarly: use TLS for pybal pools

https://gerrit.wikimedia.org/r/1067383

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:08:46Z] <ryankemper> T364368 Disabled puppet on all lvs hosts in preparation for rolling restart

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:13:50Z] <ryankemper> T364368 Ran puppet on A:lvs-secondary-eqiad and restarted pybal.service

Change #1067388 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs-main, wdqs-scholarly: use HTTPS for health check

https://gerrit.wikimedia.org/r/1067388

Change #1067388 merged by Bking:

[operations/puppet@production] wdqs-main, wdqs-scholarly: use HTTPS for health check

https://gerrit.wikimedia.org/r/1067388

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:24:54Z] <ryankemper> T364368 ryankemper@cumin2002:~$ sudo cumin 'A:lvs-secondary-eqiad' 'systemctl status pybal.service'

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:37:18Z] <ryankemper> T364368 Ran puppet on A:lvs-low-traffic-eqiad and restarted pybal.service

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:40:26Z] <ryankemper> T364368 Cleared away old ipvs entries for 10.2.2.33:80 and 10.2.2.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:47:45Z] <ryankemper> T364368 Ran puppet on A:lvs-secondary-codfw, restarted pybal.service, and cleared away old ipvs entries for 10.2.1.33:80 and 10.2.1.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:50:55Z] <ryankemper> T364368 Ran puppet on A:lvs-low-traffic-codfw, restarted pybal.service, and cleared away old ipvs entries for 10.2.1.33:80 and 10.2.1.36:80

Mentioned in SAL (#wikimedia-operations) [2024-08-27T17:54:10Z] <ryankemper> T364368 Our LVS operation is done; I've enabled/ran puppet on the remaining lvs hosts

Stevemunene moved this task from In Progress to Done on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Tue, Sep 3, 8:50 AM

Stevemunene closed this task as Resolved.Wed, Sep 4, 9:18 AM

Stevemunene updated the task description. (Show Details)

Create separate pybal pools for wdqs graph split (main vs scholarly)Closed, ResolvedPublicActions

Description

Context

AC

Details

Related ObjectsSearch...

Event Timeline

Create separate pybal pools for wdqs graph split (main vs scholarly)
Closed, ResolvedPublic
Actions

Related Objects
Search...