[go: nahoru, domu]

Page MenuHomePhabricator

Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm
Open, Needs TriagePublic

Description

< T278641: Migrate deployment-prep away from Debian Stretch to Buster/Bullseye | NOTYETCREATED >

With Stretch mostly removed now it's time to start removing Buster from deployment-prep, by either migrating services to newer Debian versions or by removing unused services entirely.

Tracking task for production migrations: T291916: Tracking task for Bullseye migrations in production

Instances to migrate

(live report)

  • deployment-acme-chief03 (replaced by deployment-acme-chief05)
  • deployment-acme-chief04 (replaced by deployment-acme-chief06)
  • deployment-cache-text07 (replaced by deployment-cache-text08)
  • deployment-cache-upload07 (replaced by deployment-cache-upload08)
  • deployment-cumin T361380
  • deployment-deploy03
  • deployment-docker-api-gateway01
  • deployment-docker-changeprop01 T369913
  • deployment-docker-cpjobqueue01 T369914
  • deployment-docker-mobileapps01 T369915
  • deployment-docker-proton01 T369916
  • deployment-echostore02 T361383
  • deployment-etcd02
  • deployment-eventlog08 T369918
  • deployment-ircd02 T369919
  • deployment-jobrunner04 T370487
  • deployment-kafka-jumbo-[5, 8-9] T361382
  • deployment-kafka-logging01 T361382
  • deployment-kafka-main-[5-6] T361382
  • deployment-maps-master01 T361381
  • deployment-mediawiki[11-12] T361387
  • deployment-memc[08-10] T361384
  • deployment-mwlog01 T369263
  • deployment-mwmaint02 T370582
  • deployment-ores02 T361385
  • deployment-parsoid12 T361386
  • deployment-poolcounter06 T370458 - blocked on packaging poolcounter-prometheus-exporter for Bookworm and/or Bullseye (see also T332015 for production)
  • deployment-puppetdb[03-04] replaced with deployment-puppetdb05
  • deployment-puppetmaster04 replaced with deployment-puppetserver-1
  • deployment-push-notifications01 T370459
  • deployment-restbase04 T370460
  • deployment-sessionstore04 T370461
  • deployment-shellbox T370462
  • deployment-snapshot03 T370465
  • deployment-urldownloader03 T370466
  • deployment-xhgui03 T370467

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedjhathaway
Openhnowlan
Resolvedherron
OpenEevans
Resolvedjijiki
Resolvedelukey
ResolvedSouthparkfan
ResolvedSouthparkfan
Resolvedrook
ResolvedAndrew
Resolvedfgiunchedi
InvalidNone
ResolvedAndrew
ResolvedJgiannelos
ResolvedJgiannelos
ResolvedBTullis
ResolvedSouthparkfan
OpenSouthparkfan
OpenNone
ResolvedSouthparkfan
ResolvedEevans
ResolvedSouthparkfan
ResolvedSouthparkfan
ResolvedBTullis
ResolvedSouthparkfan
ResolvedAndrew
ResolvedSouthparkfan
ResolvedSouthparkfan

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 933463 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge to remove unused cache servers

https://gerrit.wikimedia.org/r/933463

Mentioned in SAL (#wikimedia-cloud) [2023-06-28T09:54:41Z] <fabfur> removed (text|upload) instance references from wgCdnServersNoPurge (T327742)

Andrew updated the task description. (Show Details)

We talked about this a bit in the Release-Engineering-Team team meeting on Wednesday—we discussed whether we had the ability to help. And I don't think we're well-positioned to do all this work as this encompasses a clone of most of Wikimedia's production footprint (where SRE has the most expertise).

I think one path forward here would be for multiple teams pitch-in to fix the instances they know how to fix.


Looking at the instance list, I do wonder if they're needed in beta (echostore?). A place to start might be a bunch of subtasks verifying:

  1. Needed for beta? (i.e., critical to some team's workflow, if so: who?)
  2. Ported to Bullseye in production? If not, any foreseeable blockers?
  3. Can you help?

Other notes:

  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.
  • Production debian packages: T291916: Tracking task for Bullseye migrations in production may be a blocker in some cases deployment-prep depends on the same Debian packages used for production, and if they're not yet packaged for Bullseye, then there's no way to get a Bullseye image to run using production puppet code.
  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Andrew updated the task description. (Show Details)
  • Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Added as subtask: T361477: requests to increase quotas deployment-prep

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

I have further increased the core quota for this project from 220 to 240.

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

There are a few reasons I don't like in-place upgrade.

One is the metadata issue that you mentioned: it's not as simple as changing a setting, it's more like rewriting history since 'what base image was used for this VM' is a question that's of use in various places. It used to be /very/ important as part of VM migration due to copy-on-write issues; I'd need to retest things to make sure that's still the case.

Another is that deployment-prep is meant to resemble production, and we mostly don't do in-place upgrades in production. For example with T361384, the old memc hosts were still running redis and had a ton of puppet and hiera leftover related to supporting redis. Only a new rebuild got us memc servers that actually resemble real production servers without a bunch of miscellaneous cruft.

And, of course, proper puppet design means that it isn't hard to rebuild hosts. If it /is/ hard we probably want to fix that :)

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

There are a few reasons I don't like in-place upgrade.

One is the metadata issue that you mentioned: it's not as simple as changing a setting, it's more like rewriting history since 'what base image was used for this VM' is a question that's of use in various places. It used to be /very/ important as part of VM migration due to copy-on-write issues; I'd need to retest things to make sure that's still the case.

Ok thanks for the explanation!

Another is that deployment-prep is meant to resemble production, and we mostly don't do in-place upgrades in production. For example with T361384, the old memc hosts were still running redis and had a ton of puppet and hiera leftover related to supporting redis. Only a new rebuild got us memc servers that actually resemble real production servers without a bunch of miscellaneous cruft.

To be really fair, deployment-prep is really far from production, from its puppet config to the hosts/VM that runs the software. I can surely agree on the point that it is good every now and then to clean up stale/old packages, but I don't think it is strictly needed for deployment-prep (my 2c).

And, of course, proper puppet design means that it isn't hard to rebuild hosts. If it /is/ hard we probably want to fix that :)

I don't think that this has anything to do with Puppet design, it is just a matter of system complexity and the usability of deployment-prep and horizon/open-stack. In production we are able to reimage in place a node, with the possibility of keeping its data/state around, something that IIUC it is not possible on horizon/open-stack unless you use tricks like the OS in place dist-upgrade. This makes the creation of new VMs painful, since it is not just a matter of running puppet on the new node with the correct role, but also to keep any distributed system consistent with its ensemble (for example, Kafka/Cassandra/etc.. clusters). And in turn this makes the upgrade of deployment-prep something that it is done only when strictly needed, and from people that volunteer to do it. We have been talking about a real staging environment for production since I started working here, but we never really got anywhere close. The deployment-prep setup is incredibly useful but far from production, and we should find better and painless ways to keep it up with production that could encourage people to upgrade rather than avoiding it completely.

To be clear - I know that you are doing this work because it is needed and because moving away from old OSes is a good practice, not blaming you for the deployment-prep situation, just stating some thoughts out loud so we can discuss about them :)

Andrew updated the task description. (Show Details)

Change #1053956 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] deployment-prep: replace deploy03 with deploy04

https://gerrit.wikimedia.org/r/1053956

Change #1053956 merged by Andrew Bogott:

[operations/puppet@production] deployment-prep: replace deploy03 with deploy04

https://gerrit.wikimedia.org/r/1053956