[go: nahoru, domu]

Page MenuHomePhabricator

Spin down api_appserver and appserver clusters
Closed, ResolvedPublic

Description

Now that all traffic has been migrated to MW-on-K8s, we can spin down and remove the api_appserver and appserver clusters.

  • Remove mw-on-k8s.lua from traffic path
  • Remove mw-on-k8s.lua logic
  • Remove LVS and discovery services
  • Cleanup/update references in cookbooks
  • Cleanup cumin aliases

Details

Other Assignee
Scott_French
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -10
operations/puppetproduction+0 -6
operations/puppetproduction+24 -0
operations/puppetproduction+2 -6
operations/puppetproduction+2 -2
operations/cookbooksmaster+0 -4
operations/cookbooksmaster+59 -20
operations/cookbooksmaster+8 -5
operations/cookbooksmaster+0 -2
operations/puppetproduction+26 -48
operations/puppetproduction+15 -0
operations/puppetproduction+32 -64
operations/puppetproduction+10 -11
operations/puppetproduction+4 -4
operations/puppetproduction+0 -110
operations/puppetproduction+1 -16
operations/puppetproduction+2 -30
operations/puppetproduction+2 -2
operations/dnsmaster+0 -14
operations/cookbooksmaster+40 -25
mediawiki/services/wikifeedsmaster+0 -103
operations/dnsmaster+2 -1
operations/dnsmaster+7 -1
operations/puppetproduction+10 -18
operations/puppetproduction+5 -6
analytics/refinerymaster+1 -2
analytics/refinerymaster+5 -1
analytics/gobblin-wmfmain+55 -1
mediawiki/services/mobileappsmaster+0 -119
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+3 -3
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
analytics/refinery/sourcemaster+3 -0
analytics/refinery/sourcemaster+1 -1
wikimedia-event-utilitiesmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+6 -6
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+6 -6
operations/puppetproduction+0 -1
operations/deployment-chartsmaster+2 -2
operations/software/spicerackmaster+6 -2
mediawiki/services/example-node-apimaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -27
operations/mediawiki-configmaster+0 -4
operations/puppetproduction+0 -406
operations/puppetproduction+2 -1
operations/puppetproduction+11 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+5 -17
operations/puppetproduction+8 -20
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1054623 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Update refinery_version for canary_events, test refine, and test refine_sanitize.pp

https://gerrit.wikimedia.org/r/1054623

Mentioned in SAL (#wikimedia-operations) [2024-07-16T17:39:40Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=codfw [reason: Depooling ahead of turndown - T367949]

Change #1054623 merged by Ottomata:

[operations/puppet@production] Update refinery_version for canary_events, test refine and refine_sanitize

https://gerrit.wikimedia.org/r/1054623

Mentioned in SAL (#wikimedia-operations) [2024-07-16T17:40:11Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=api-ro,name=codfw [reason: Depooling ahead of turndown - T367949]

Mentioned in SAL (#wikimedia-operations) [2024-07-16T17:43:58Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=appservers-rw,name=eqiad [reason: Depooling ahead of turndown - T367949]

Mentioned in SAL (#wikimedia-operations) [2024-07-16T17:44:15Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=api-rw,name=eqiad [reason: Depooling ahead of turndown - T367949]

Mentioned in SAL (#wikimedia-operations) [2024-07-16T17:46:11Z] <swfrench-wmf> appservers-rw and api-rw now resolve to failoid - T367949

Change #1054629 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Update refinery_version for refine and refine_sanitize

https://gerrit.wikimedia.org/r/1054629

Change #1054629 merged by Ottomata:

[operations/puppet@production] Update refinery_version for refine and refine_sanitize

https://gerrit.wikimedia.org/r/1054629

Change #1054633 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Disable produce_canary_events systemd timer

https://gerrit.wikimedia.org/r/1054633

Change #1054633 merged by Ottomata:

[operations/puppet@production] Disable produce_canary_events systemd timer

https://gerrit.wikimedia.org/r/1054633

Change #1054652 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/gobblin-wmf@main] Bump wikimedia-event-utilities version to 1.3.6

https://gerrit.wikimedia.org/r/1054652

Mentioned in SAL (#wikimedia-operations) [2024-07-16T19:24:42Z] <swfrench-wmf> depooling appservers-ro in eqiad, which is not used by remaining analytics workloads - T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-16T19:25:24Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=eqiad [reason: Depooling ahead of turndown - T367949]

Current status:

  • appservers-rw and api-rw are depooled everywhere, and resolve to failoid as of 17:45 UTC
  • api-ro is serving only from eqiad as of 17:40 UTC
  • appservers-ro is depooled everywhere as of 19:25 UTC

Interestingly, with both LVS endpoints marked DOWN on appservers-ro, gdnsd reverts to just using geoip behavior following the discovery-map. Meaning, appservers-ro is behaving as if both DCs were pooled. Indeed, that makes sense given the design intention per bblack (i.e., always serve something, and assume all-depooled is a mistake).

We may want to revert to having a single DC pooled so all requests go to one place.

Edit: Done (appservers-ro is repooled in eqiad)

We are leaving api-ro pooled in eqiad while remaining analytics workloads are investigated / fixed (though noting the above that depooling would have no affect availability wise), as it sounds like breakage there would be rather disruptive.

Many thanks to @BTullis and @Ottomata for the investigation and fixes today.

@Ottomata - When you get a chance, if you could estimate how long it might take to clean up the remaining WikimediaDefaults use cases, that would be greatly appreciated.

Mentioned in SAL (#wikimedia-operations) [2024-07-16T20:12:10Z] <swfrench@cumin2002> conftool action : set/pooled=true; selector: dnsdisc=appservers-ro,name=eqiad [reason: Repooling to concentrate clients in eqiad - T367949]

Change #1053803 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] mobileapps: remove out-of-date prod config example

https://gerrit.wikimedia.org/r/1053803

Change #1053802 merged by jenkins-bot:

[mediawiki/services/wikifeeds@master] wikifeeds: remove out-of-date prod config example

https://gerrit.wikimedia.org/r/1053802

Change #1054652 merged by jenkins-bot:

[analytics/gobblin-wmf@main] Bump wikimedia-event-utilities version to 1.3.6

https://gerrit.wikimedia.org/r/1054652

Change #1054898 had a related patch set uploaded (by Gmodena; author: Gmodena):

[analytics/refinery@master] artifacts: add gobblin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054898

Change #1054898 merged by Ottomata:

[analytics/refinery@master] artifacts: add gobblin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054898

Change #1054904 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Fix location of gobblin-wmf-core-jar-with-dependencies.jar symlink

https://gerrit.wikimedia.org/r/1054904

Change #1054904 merged by Ottomata:

[analytics/refinery@master] Fix location of gobblin-wmf-core-jar-with-dependencies.jar symlink

https://gerrit.wikimedia.org/r/1054904

Change #1054906 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refinery::job::test::gobblin - use gobbin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054906

Change #1054906 merged by Ottomata:

[operations/puppet@production] refinery::job::test::gobblin - use gobbin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054906

Change #1054908 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refinery::job::gobblin - use gobbin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054908

Change #1054908 merged by Ottomata:

[operations/puppet@production] refinery::job::gobblin - use gobbin-wmf 1.0.2

https://gerrit.wikimedia.org/r/1054908

Change #1055256 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/dns@master] wmnet: direct appservers-ro DYNA record to failoid

https://gerrit.wikimedia.org/r/1055256

Change #1055256 merged by Scott French:

[operations/dns@master] wmnet: direct appservers-ro DYNA record to failoid

https://gerrit.wikimedia.org/r/1055256

Mentioned in SAL (#wikimedia-operations) [2024-07-18T18:03:11Z] <swfrench-wmf> appservers-ro.discovery.wmnet now resolves to failoid - T367949

Change #1055268 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/dns@master] wmnet: direct api-ro DYNA record to failoid

https://gerrit.wikimedia.org/r/1055268

Change #1055268 merged by Scott French:

[operations/dns@master] wmnet: direct api-ro DYNA record to failoid

https://gerrit.wikimedia.org/r/1055268

Mentioned in SAL (#wikimedia-operations) [2024-07-18T18:17:09Z] <swfrench-wmf> api-ro.discovery.wmnet now resolves to failoid - T367949

appservers-ro.discovery.wmnet and api-ro.discovery.wmnet now resolve to failoid, by way of manually updating their DYNA records in the wmnet zone template to point to geoip!disc-failoid:

In other words, they are now behaving the same as their -rw counterparts, which resolve to failoid by way of being fully-depooled active / passive discovery services.

Rollback: If this needs rolled back, revert both patches above and update DNS [0].

As long as no issues are encountered in the meantime, we can plan to turn down the actual LVS services next week.

[0] https://wikitech.wikimedia.org/wiki/DNS#Deploying_DNS_changes

Change #1053823 abandoned by Scott French:

[operations/cookbooks@master] switchdc: prepare mediawiki cache warmup for bare-metal turndown

Reason:

Superseded by Ic48417e5acb0a64cd6af1c66a2b25853a8c2a5ef and follow-on changes.

https://gerrit.wikimedia.org/r/1053823

Silenced ProbeDown for api-https:443 and appservers-https:443 for 24h:

  • f6f67d8d-6381-43b3-9262-9a8cf58f2b19
  • ed0d352b-fb83-4bd4-a586-142b100ca6e5

Change #1050304 merged by Scott French:

[operations/dns@master] Remove legacy appserver and api records

https://gerrit.wikimedia.org/r/1050304

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:06:43Z] <swfrench-wmf> ran authdns-update on dns1004 to pick up removal of appservers / api records - T367949

Change #1050381 merged by Scott French:

[operations/puppet@production] service.yaml: Switch api and appserver to lvs_setup 1/3

https://gerrit.wikimedia.org/r/1050381

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:17:09Z] <swfrench-wmf> run-puppet-agent on A:dnsbox to pick up switch to lvs_setup - T367949

Change #1050382 merged by Scott French:

[operations/puppet@production] Remove legacy appservers from profile::lvs::realserver::pools 2/3

https://gerrit.wikimedia.org/r/1050382

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:28:04Z] <swfrench-wmf> run-puppet-agent on O:lvs::balancer to pick up switch to service_setup, removal of profile::lvs::realserver::pools - T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:33:51Z] <swfrench@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-codfw (T367949)

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:40:55Z] <swfrench@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-codfw (T367949)

Change #1056212 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Remove appserver tests

https://gerrit.wikimedia.org/r/1056212

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:44:41Z] <swfrench-wmf> sudo cumin 'A:lvs-low-traffic-codfw' 'systemctl restart pybal.service' - T367949

Change #1056212 merged by Clément Goubert:

[operations/puppet@production] Remove appserver tests

https://gerrit.wikimedia.org/r/1056212

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:51:15Z] <swfrench-wmf> sudo cumin 'A:lvs-secondary-eqiad' 'systemctl restart pybal.service' - T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-23T17:58:25Z] <swfrench-wmf> sudo cumin 'A:lvs-low-traffic-eqiad' 'systemctl restart pybal.service' - T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-23T18:11:53Z] <swfrench-wmf> sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.22:443' (api-https eqiad) - T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-23T18:13:29Z] <swfrench-wmf> sudo cumin 'A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad' 'ipvsadm --delete-service --tcp-service 10.2.2.1:443' (appservers-https eqiad) - T367949

Change #1050383 merged by Scott French:

[operations/puppet@production] Remove conftool-data and service catalog for legacy appservers 3/3

https://gerrit.wikimedia.org/r/1050383

Mentioned in SAL (#wikimedia-operations) [2024-07-23T18:42:19Z] <mutante> puppetmaster1001/puppetmaster2001 - rm /var/run/confd-template/_srv_config-master_pybal_codfw_api-https.err to clear pybal icinga alerts after T367949

Mentioned in SAL (#wikimedia-operations) [2024-07-23T18:45:07Z] <mutante> puppetmaster1001/puppetmaster2001 - rm /var/run/confd-template/*.err to clear pybal icinga alerts after T367949

Change #1056231 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Remove has_lvs: true from appserver / api_appserver hosts

https://gerrit.wikimedia.org/r/1056231

Change #1056231 merged by Scott French:

[operations/puppet@production] Set has_lvs: false on appserver / api_appserver hosts

https://gerrit.wikimedia.org/r/1056231

Volans subscribed.

I took the liberty to add a cleanup item to the task description. If that should be part of another task feel to move it around.

Change #1056239 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Remove references to turned down service in spec files

https://gerrit.wikimedia.org/r/1056239

Change #1056239 merged by Scott French:

[operations/puppet@production] Remove references to turned down service in spec files

https://gerrit.wikimedia.org/r/1056239

Many thanks, all who helped get this out the door.

At this point, the LVS service turndown is done, and we've shaken out a handful of surprises (mainly puppet related, in addition to some confd check file cleanups on config-master hosts and such).

One notable one for future reference is setting has_lvs: false in hieradata (https://gerrit.wikimedia.org/r/1056231), to avoid the affected appservers (in this case, including mwdebug servers) failing puppet runs when instantiating profile::lvs::realserver with no service IPs.

Other changes involved updating test fixture that depends on the now-changed state of production configuration.

Note: As a quick fix, https://gerrit.wikimedia.org/r/1056239 pointed profile_lvs_realserver_spec.rb at jobrunner, so once those appservers go away, we'll need to solve that properly.

With the UTC-late backport window now also past us without issue, I'm optimistic that we're in a good state now.

Icinga downtime and Alertmanager silence (ID=52c5c269-d4e9-4489-a397-00874b75eb1c) set by cgoubert@cumin1002 for 21 days, 0:00:00 on 16 host(s) and their services with reason: Legacy appserver spindown

mw[2261-2262,2268-2277,2299,2441].codfw.wmnet,mw[1364,1398].eqiad.wmnet

Change #1056467 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] site.pp: Put legacy api and appservers insetup

https://gerrit.wikimedia.org/r/1056467

Change #1056470 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.mediawiki.restart-appservers: Remove legacy

https://gerrit.wikimedia.org/r/1056470

Change #1056471 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.mediawiki.route-traffic: Use switchdc defined services

https://gerrit.wikimedia.org/r/1056471

Change #1056472 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.mediawiki: No-op formatting change

https://gerrit.wikimedia.org/r/1056472

Change #1056473 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.mediawiki: Remove legacy services

https://gerrit.wikimedia.org/r/1056473

Change #1056467 merged by Clément Goubert:

[operations/puppet@production] site.pp: Put legacy api and appservers insetup

https://gerrit.wikimedia.org/r/1056467

Change #1056481 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Don't force puppet 7 on legacy appservers

https://gerrit.wikimedia.org/r/1056481

Change #1056481 merged by Clément Goubert:

[operations/puppet@production] Don't force puppet 7 on legacy appservers

https://gerrit.wikimedia.org/r/1056481

Change #1053819 abandoned by Scott French:

[operations/puppet@production] mediawiki-cache-warmup: prepare for bare-metal turndown

Reason:

Superseded by Ic48417e5acb0a64cd6af1c66a2b25853a8c2a5ef.

https://gerrit.wikimedia.org/r/1053819

Change #1056889 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mwdebug: Add hosts to testerver lvs pool

https://gerrit.wikimedia.org/r/1056889

Change #1056895 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Cleanup old config

https://gerrit.wikimedia.org/r/1056895

Change #1056470 merged by jenkins-bot:

[operations/cookbooks@master] sre.mediawiki.restart-appservers: Remove legacy

https://gerrit.wikimedia.org/r/1056470

Change #1056471 merged by jenkins-bot:

[operations/cookbooks@master] sre.mediawiki.route-traffic: Use switchdc defined services

https://gerrit.wikimedia.org/r/1056471

Change #1056472 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: No-op formatting change

https://gerrit.wikimedia.org/r/1056472

Change #1056473 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Remove legacy services

https://gerrit.wikimedia.org/r/1056473

Change #1057010 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] P:mediawiki::php::restarts: fix no-LVS case

https://gerrit.wikimedia.org/r/1057010

Change #1057010 merged by Scott French:

[operations/puppet@production] P:mediawiki::php::restarts: fix no-LVS case

https://gerrit.wikimedia.org/r/1057010

Change #1057841 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] cumin: Remove mw-api aliases

https://gerrit.wikimedia.org/r/1057841

Change #1057841 merged by Clément Goubert:

[operations/puppet@production] cumin: Remove mw-api aliases

https://gerrit.wikimedia.org/r/1057841

Change #1056889 merged by Clément Goubert:

[operations/puppet@production] mwdebug: Add logstash and otelcol config

https://gerrit.wikimedia.org/r/1056889

Clement_Goubert updated Other Assignee, added: Scott_French.
Clement_Goubert updated the task description. (Show Details)

Change #1056895 merged by Clément Goubert:

[operations/puppet@production] Cleanup old config

https://gerrit.wikimedia.org/r/1056895

Change #1058099 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] scap: Remove legacy appserver clusters

https://gerrit.wikimedia.org/r/1058099

Change #1058099 merged by Clément Goubert:

[operations/puppet@production] scap: Remove legacy appserver clusters

https://gerrit.wikimedia.org/r/1058099