⚓ T327742 Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm

Subject	Repo	Branch	Lines +/-
deployment-prep: replace deploy03 with deploy04	operations/puppet	production	+10 -12
[beta] Update wgCdnServersNoPurge to remove unused cache servers	operations/mediawiki-config	master	+0 -4
hiera: Removed unused cache instances from deployment-prep	operations/puppet	production	+0 -3
hiera: Added new bullseye instance for cache-text in deployment-prep	operations/puppet	production	+3 -0
[beta] Update wgCdnServersNoPurge for new cache server	operations/mediawiki-config	master	+2 -0
hiera: Added new bullseye instance for cache-upload in deployment-prep	operations/puppet	production	+1 -1
[beta] Update wgCdnServersNoPurge	operations/mediawiki-config	master	+2 -0

Status	Assigned	Task
Open	None	T327742 Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm
Resolved	jhathaway	T361380 Replace deployment-cumin with Bullseye or Bookworm host
Open	hnowlan	T361381 Replace deployment-maps-master01 with a Bullseye or Bookworm instance
Resolved	herron	T361382 Replace deployment-prep kafka hosts with Bullseye or Bookworm
Open	Eevans	T361383 Replace or remove deployment-echostore02.deployment-prep.eqiad1.wikimedia.cloud
Resolved	jijiki	T361384 Replace deployment-memc[08-10] with Bullseye or Bookworm
Resolved	elukey	T361385 Replace deployment-ores02
Resolved	Southparkfan	T361386 Remove or replace deployment-parsoid12.deployment-prep.eqiad1.wikimedia.cloud
Resolved	Southparkfan	T361387 Replace or delete deployment-mediawiki[11-12].deployement-prep.eqiad1.wikimedia.cloud
Resolved	rook	T361477 requests to increase quotas deployment-prep
Resolved	Andrew	T361536 Launching new bullseye deployment-prep instances fails, no sudo access
Resolved	fgiunchedi	T369263 Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm
Invalid	None	T369913 Rebuild or delete deployment-docker-changeprop01
Resolved	Andrew	T369914 Rebuild or delete deployment-docker-cpjobqueue01
Resolved	Jgiannelos	T369915 Rebuild or delete deployment-docker-mobileapps01
Resolved	Jgiannelos	T369916 Rebuild or delete deployment-docker-proton01
Resolved	BTullis	T369918 Replace deployment-eventlog08 with Bullseye or Bookworm host
Resolved	Southparkfan	T369919 Replace deployment-ircd02 with a Bullseye or Bookworm host
Open	Southparkfan	T370458 Remove or replace poolcounter06.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Open	None	T332015 Migrate poolcounter hosts to bullseye
Resolved	Southparkfan	T370459 Remove or replace deployment-push-notifications01.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Eevans	T370460 Remove or replace deployment-restbase04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Southparkfan	T370461 Remove or replace deployment-sessionstore04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Southparkfan	T370462 Remove or replace deployment-shellbox.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	BTullis	T370465 Remove or replace deployment-snapshot03.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Southparkfan	T370466 Remove or replace deployment-urldownloader03.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Andrew	T370467 Remove or replace deployment-xhgui03.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Southparkfan	T370487 Remove or replace deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Resolved	Southparkfan	T370582 Remove or replace deployment-mwmaint02.deployment-prep.eqiad1.wikimedia.cloud

Change 933463 merged by jenkins-bot:

[operations/mediawiki-config@master] [beta] Update wgCdnServersNoPurge to remove unused cache servers

https://gerrit.wikimedia.org/r/933463

Mentioned in SAL (#wikimedia-cloud) [2023-06-28T09:54:41Z] <fabfur> removed (text|upload) instance references from wgCdnServersNoPurge (T327742)

Maintenance_bot removed a project: Patch-For-Review.Jun 28 2023, 10:10 AM

Daimona mentioned this in T345566: error.log is not rotated in beta.Sep 5 2023, 12:17 AM

Andrew updated the task description. (Show Details)Mar 29 2024, 4:32 PM

Andrew updated the task description. (Show Details)

Andrew updated the task description. (Show Details)Mar 29 2024, 4:42 PM

Andrew updated the task description. (Show Details)Mar 29 2024, 4:47 PM

Andrew updated the task description. (Show Details)Mar 29 2024, 4:57 PM

We talked about this a bit in the Release-Engineering-Team team meeting on Wednesday—we discussed whether we had the ability to help. And I don't think we're well-positioned to do all this work as this encompasses a clone of most of Wikimedia's production footprint (where SRE has the most expertise).

I think one path forward here would be for multiple teams pitch-in to fix the instances they know how to fix.

Looking at the instance list, I do wonder if they're needed in beta (echostore?). A place to start might be a bunch of subtasks verifying:

Needed for beta? (i.e., critical to some team's workflow, if so: who?)
Ported to Bullseye in production? If not, any foreseeable blockers?
Can you help?

Other notes:

Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.
Production debian packages: T291916: Tracking task for Bullseye migrations in production may be a blocker in some cases deployment-prep depends on the same Debian packages used for production, and if they're not yet packaged for Bullseye, then there's no way to get a Bullseye image to run using production puppet code.

Andrew updated the task description. (Show Details)Mar 29 2024, 5:05 PM

In T327742#9673367, @thcipriani wrote:

Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Andrew updated the task description. (Show Details)Mar 29 2024, 6:02 PM

Andrew updated the task description. (Show Details)Mar 29 2024, 6:07 PM

Andrew updated the task description. (Show Details)

In T327742#9673515, @thcipriani wrote:

In T327742#9673367, @thcipriani wrote:

Quota: deployment-prep project is currently at its quota for vCPU, so it'd need some extra headroom to spin up new Bullseye instances before shutting down the "working" ones.

The average instance on deployment-prep is ~ 2vCPU, 20GB disk, 4GB ram.

deployment-prep is almost at RAM quota and at CPU quota. Bumping those up to allow 5 new instance should allow folks to at least get started. So 10vCPU/20GB ram? Possible? It'll probably possible to reclaim some of that post-migration.

Added as subtask: T361477: requests to increase quotas deployment-prep

rook closed subtask T361477: requests to increase quotas deployment-prep as Resolved.Apr 1 2024, 7:45 PM

elukey mentioned this in T360595: beta-scap-sync-world fails: logstash_checker.py: KeyError: 'aggregations'.Apr 2 2024, 12:57 PM

Andrew closed subtask T361385: Replace deployment-ores02 as Resolved.Apr 2 2024, 1:19 PM

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

Andrew closed subtask T361536: Launching new bullseye deployment-prep instances fails, no sudo access as Resolved.Jun 14 2024, 5:31 PM

hnowlan updated the task description. (Show Details)Jun 26 2024, 10:47 AM

I have further increased the core quota for this project from 220 to 240.

In T327742#9812917, @elukey wrote:

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

There are a few reasons I don't like in-place upgrade.

One is the metadata issue that you mentioned: it's not as simple as changing a setting, it's more like rewriting history since 'what base image was used for this VM' is a question that's of use in various places. It used to be /very/ important as part of VM migration due to copy-on-write issues; I'd need to retest things to make sure that's still the case.

Another is that deployment-prep is meant to resemble production, and we mostly don't do in-place upgrades in production. For example with T361384, the old memc hosts were still running redis and had a ton of puppet and hiera leftover related to supporting redis. Only a new rebuild got us memc servers that actually resemble real production servers without a bunch of miscellaneous cruft.

And, of course, proper puppet design means that it isn't hard to rebuild hosts. If it /is/ hard we probably want to fix that :)

In T327742#9949679, @Andrew wrote:

In T327742#9812917, @elukey wrote:

I have asked something on IRC's wikimedia-cloud about a way to speed up the process. In the puppet repo we have dist-upgrade.sh, a script that should be able to do all the safe steps to upgrade the OS in place. If it is ok to use it, we'd avoid the re-creation of new VMs with the issue of copying stateful data around (for example, Kafka/databases/etc..). The only downside would be for the Cloud team, that would need to fix the VM's metadata (like reported OS) after the dist-upgrade.

There are a few reasons I don't like in-place upgrade.

One is the metadata issue that you mentioned: it's not as simple as changing a setting, it's more like rewriting history since 'what base image was used for this VM' is a question that's of use in various places. It used to be /very/ important as part of VM migration due to copy-on-write issues; I'd need to retest things to make sure that's still the case.

Ok thanks for the explanation!

Another is that deployment-prep is meant to resemble production, and we mostly don't do in-place upgrades in production. For example with T361384, the old memc hosts were still running redis and had a ton of puppet and hiera leftover related to supporting redis. Only a new rebuild got us memc servers that actually resemble real production servers without a bunch of miscellaneous cruft.

To be really fair, deployment-prep is really far from production, from its puppet config to the hosts/VM that runs the software. I can surely agree on the point that it is good every now and then to clean up stale/old packages, but I don't think it is strictly needed for deployment-prep (my 2c).

And, of course, proper puppet design means that it isn't hard to rebuild hosts. If it /is/ hard we probably want to fix that :)

I don't think that this has anything to do with Puppet design, it is just a matter of system complexity and the usability of deployment-prep and horizon/open-stack. In production we are able to reimage in place a node, with the possibility of keeping its data/state around, something that IIUC it is not possible on horizon/open-stack unless you use tricks like the OS in place dist-upgrade. This makes the creation of new VMs painful, since it is not just a matter of running puppet on the new node with the correct role, but also to keep any distributed system consistent with its ensemble (for example, Kafka/Cassandra/etc.. clusters). And in turn this makes the upgrade of deployment-prep something that it is done only when strictly needed, and from people that volunteer to do it. We have been talking about a real staging environment for production since I started working here, but we never really got anywhere close. The deployment-prep setup is incredibly useful but far from production, and we should find better and painless ways to keep it up with production that could encourage people to upgrade rather than avoiding it completely.

To be clear - I know that you are doing this work because it is needed and because moving away from old OSes is a good practice, not blaming you for the deployment-prep situation, just stating some thoughts out loud so we can discuss about them :)

fgiunchedi added a subtask: T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm.Jul 4 2024, 9:35 AM

fgiunchedi updated the task description. (Show Details)

fgiunchedi closed subtask T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm as Resolved.Jul 5 2024, 1:24 PM

fgiunchedi updated the task description. (Show Details)

Andrew updated the task description. (Show Details)Jul 5 2024, 2:38 PM

Andrew closed subtask T361384: Replace deployment-memc[08-10] with Bullseye or Bookworm as Resolved.Jul 6 2024, 3:15 PM

Andrew closed subtask T361380: Replace deployment-cumin with Bullseye or Bookworm host as Resolved.Jul 12 2024, 4:08 PM

Andrew closed subtask T361382: Replace deployment-prep kafka hosts with Bullseye or Bookworm as Resolved.

Andrew updated the task description. (Show Details)Jul 12 2024, 4:20 PM

Andrew updated the task description. (Show Details)Jul 12 2024, 4:25 PM

Andrew updated the task description. (Show Details)

Andrew updated the task description. (Show Details)Jul 12 2024, 4:28 PM