[go: nahoru, domu]

Page MenuHomePhabricator

Cloud-VPSComponent
ActivePublic

Details

Description

Bugs related to Cloud VPS infrastructure and not a specific Cloud VPS project (see VPS-Projects for that).

Issues which are related to Toolforge should go in Toolforge instead.

Request new projects by filing a task in Cloud-VPS (Project-requests)

Request increased quota for existing projects by filing a task in Cloud-VPS (Quota-requests)

Recent Activity

Today

gerritbot added a comment to T371393: Cloud VPS: extend tofu-infra to cover projects, users and roles.

Change #1069994 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] wmcs.vps.create_project: replace logic with message about deprecation

https://gerrit.wikimedia.org/r/1069994

Mon, Sep 2, 11:07 AM · Patch-For-Review, Cloud-VPS, User-aborrero, cloud-services-team
aborrero added a comment to T370660: tofu-infra: investigate S3 spurious endpoint errors.

had this again today.

Mon, Sep 2, 10:56 AM · Cloud-VPS, User-aborrero, Epic, cloud-services-team
CodeReviewBot added a comment to T371393: Cloud VPS: extend tofu-infra to cover projects, users and roles.

aborrero merged https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/28

Mon, Sep 2, 10:54 AM · Patch-For-Review, Cloud-VPS, User-aborrero, cloud-services-team
dcaro added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Hi @dcaro - just following up on this to see if you were ok with shipping these WMCS drives with data on them, back to Dell for identifying the root cause? From Dell's last email a couple weeks ago, they stated that they have a NDA with Hynix, along with the NDA with Wikimedia, which should cover any security concerns. To ensure we don't lose momentum, during my call with Dell today, I asked them to provide the number of drives they need and also a shipping label on where to send them to. Let us know though if you feel comfortable with sending the disks. Thanks, Willy

Mon, Sep 2, 9:46 AM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS
aborrero moved T371391: Cloud VPS: extend tofu-infra to cover quotas from Backlog to Next on the User-aborrero board.
Mon, Sep 2, 9:40 AM · Cloud-VPS, User-aborrero, cloud-services-team
aborrero moved T364725: Migrate Cloud VPS instances to VXLAN based networks from Next to Doing on the User-aborrero board.
Mon, Sep 2, 9:39 AM · cloud-services-team (FY2024/2025-Q1-Q2), User-aborrero, Cloud-VPS
aborrero closed T370414: tofu-infra: create a cookbook automation to run tofu, a subtask of T370037: Cloud VPS: extend tofu-infra coverage, as Resolved.
Mon, Sep 2, 9:33 AM · Cloud-VPS, User-aborrero, Epic, cloud-services-team
aborrero closed T370414: tofu-infra: create a cookbook automation to run tofu as Resolved.
Mon, Sep 2, 9:33 AM · Cloud-VPS, User-aborrero, Epic, cloud-services-team

Thu, Aug 29

wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Hi @dcaro - just following up on this to see if you were ok with shipping these WMCS drives with data on them, back to Dell for identifying the root cause? From Dell's last email a couple weeks ago, they stated that they have a NDA with Hynix, along with the NDA with Wikimedia, which should cover any security concerns. To ensure we don't lose momentum, during my call with Dell today, I asked them to provide the number of drives they need and also a shipping label on where to send them to. Let us know though if you feel comfortable with sending the disks. Thanks, Willy

Thu, Aug 29, 7:58 PM · cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS
Andrew closed T372821: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once as Resolved.
Thu, Aug 29, 6:35 PM · Patch-For-Review, Cloud-VPS, cloud-services-team
gerritbot added a comment to T372821: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once.

Change #1068839 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] ceph.osd.bootstrap_and_add: default to only adding 2 osds at once

https://gerrit.wikimedia.org/r/1068839

Thu, Aug 29, 6:23 PM · Patch-For-Review, Cloud-VPS, cloud-services-team
gerritbot added a project to T372821: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once: Patch-For-Review.
Thu, Aug 29, 6:15 PM · Patch-For-Review, Cloud-VPS, cloud-services-team
gerritbot added a comment to T372821: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once.

Change #1068839 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[cloud/wmcs-cookbooks@main] ceph.osd.bootstrap_and_add: default to only adding 2 osds at once

https://gerrit.wikimedia.org/r/1068839

Thu, Aug 29, 6:15 PM · Patch-For-Review, Cloud-VPS, cloud-services-team
Andrew closed T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade as Resolved.
Thu, Aug 29, 3:13 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
Andrew claimed T372821: wmcs.ceph.osd.bootstrap_and_add cookbook should add fewer osds at once.
Thu, Aug 29, 2:53 PM · Patch-For-Review, Cloud-VPS, cloud-services-team

Mon, Aug 26

bd808 added a parent task for T369044: Upgrade cloud-vps openstack to version 'Caracal': T373093: Update cluster to 1.26.
Mon, Aug 26, 6:16 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
bd808 added a parent task for T369044: Upgrade cloud-vps openstack to version 'Caracal': T373360: magnum to 1.27.
Mon, Aug 26, 6:15 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
bd808 added a parent task for T369044: Upgrade cloud-vps openstack to version 'Caracal': T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.
Mon, Aug 26, 6:15 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
bd808 added a subtask for T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade: T369044: Upgrade cloud-vps openstack to version 'Caracal'.
Mon, Aug 26, 6:15 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
rook added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Looks like some of the images for the 1.26 deploy of magnum k8s vanished from upstream, so tofu sees the deploy as good, but kube-system has a few image pull failures. 1.27 seems to work now. There is a little note in T373360 and a little discussion in T373093 on the issue. At any rate you should probably update. Quarry is updated to 1.27, so it's code is good to replicate from. PAWS has a pr open for updating but it hasn't gone out yet https://github.com/toolforge/paws/pull/451

Mon, Aug 26, 5:53 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 assigned T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade to Andrew.
Mon, Aug 26, 5:09 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

I didn't do a rabbitmq overhaul. This is silly, but I restarted all the magnum-conductor agents and the cert signing became reliable. And, as far as I can tell, that was it -- tofu can now create a k8s cluster.

So the conductor was wedged in a way that was invisible to monitoring, I guess.

In the meantime, bd808 can you retest?

Mon, Aug 26, 5:08 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
SLyngshede-WMF claimed T306788: Update offboard-user script to use Keystone API.
Mon, Aug 26, 3:11 PM · Cloud-VPS, Infrastructure-Foundations, SRE-tools, User-jbond

Sun, Aug 25

Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

I've applied the suggested (but messy) patch suggested as part of the upstream bug. It seems to be helping.

Sun, Aug 25, 11:26 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Change #1065988 merged by Andrew Bogott:

[operations/puppet@production] Magnum: hack in fixes to sqlalchemy use

https://gerrit.wikimedia.org/r/1065988

Sun, Aug 25, 10:01 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Change #1065934 abandoned by Andrew Bogott:

[operations/puppet@production] Openstack magnum: override default db settings to increase connection count

Reason:

This delays but does not resolve the problem.

https://gerrit.wikimedia.org/r/1065934

Sun, Aug 25, 9:20 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Change #1065988 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: hack in fixes to sqlalchemy use

https://gerrit.wikimedia.org/r/1065988

Sun, Aug 25, 9:19 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

The attached bug delays, but does not resolve the problem. It's a connection leak of some sort, affecting many users: https://bugs.launchpad.net/magnum/+bug/2067345

Sun, Aug 25, 8:22 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a project to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade: Patch-For-Review.
Sun, Aug 25, 7:18 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Change #1065934 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack magnum: override default db settings to increase connection count

https://gerrit.wikimedia.org/r/1065934

Sun, Aug 25, 7:18 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS

Sat, Aug 24

Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

I didn't do a rabbitmq overhaul. This is silly, but I restarted all the magnum-conductor agents and the cert signing became reliable. And, as far as I can tell, that was it -- tofu can now create a k8s cluster.

Sat, Aug 24, 2:51 AM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

On the server side, a failed curl looks like this:

Sat, Aug 24, 2:10 AM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Testing by hand, that curl works some of the time and fails some of the time. So maybe we're just seeing a partial service outage + a very touchy agent

Sat, Aug 24, 2:04 AM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
Andrew added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

Deep in the guts of the heat agent it is trying to get json containing a cert, and failing to parse what it gets:

Sat, Aug 24, 1:39 AM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS

Fri, Aug 23

bd808 added a parent task for T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade: T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.
Fri, Aug 23, 9:55 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 renamed T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade from Provisioning of Kubernetes cluster via Magnum stopped working around time time of OpenStack upgrade to Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.
Fri, Aug 23, 9:34 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 added a comment to T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

https://docs.openstack.org/magnum/2024.1/admin/troubleshooting-guide.html#heat-stacks

$ openstack stack list | grep beta
| e93cb3ea-9a9e-47e8-a435-a475706bf948 | beta-v126-zzqihxfjnjjb                | CREATE_FAILED | 2024-08-23T18:03:44Z | None         |
$ openstack stack resource list beta-v126-zzqihxfjnjjb | grep FAILED
| kube_masters                            | dc43e1f5-2f7b-4e69-bc2b-537561edba46 | OS::Heat::ResourceGroup                                                            | CREATE_FAILED   | 2024-08-23T18:04:19Z |
$ openstack stack resource show beta-v126-zzqihxfjnjjb kube_masters
+------------------------+-----------------------------------------------------+
| Field                  | Value                                               |
+------------------------+-----------------------------------------------------+
| attributes             | {'removed_rsrc_list': [], 'refs_map': None, 'refs': |
|                        | None, 'attributes': None}                           |
| creation_time          | 2024-08-23T18:04:19Z                                |
| description            |                                                     |
| links                  | [{'href': 'https://openstack.eqiad1.wikimediacloud. |
|                        | org:28004/v1/deployment-prep/stacks/beta-v126-      |
|                        | zzqihxfjnjjb/e93cb3ea-9a9e-47e8-a435-               |
|                        | a475706bf948/resources/kube_masters', 'rel':        |
|                        | 'self'}, {'href': 'https://openstack.eqiad1.wikimed |
|                        | iacloud.org:28004/v1/deployment-prep/stacks/beta-   |
|                        | v126-zzqihxfjnjjb/e93cb3ea-9a9e-47e8-a435-          |
|                        | a475706bf948', 'rel': 'stack'}, {'href': 'https://o |
|                        | penstack.eqiad1.wikimediacloud.org:28004/v1/deploym |
|                        | ent-prep/stacks/beta-v126-zzqihxfjnjjb-kube_masters |
|                        | -3dcad4ye6nm6/dc43e1f5-2f7b-4e69-bc2b-              |
|                        | 537561edba46', 'rel': 'nested'}]                    |
| logical_resource_id    | kube_masters                                        |
| physical_resource_id   | dc43e1f5-2f7b-4e69-bc2b-537561edba46                |
| required_by            | ['kube_cluster_config', 'api_address_lb_switch',    |
|                        | 'kube_cluster_deploy', 'etcd_address_lb_switch']    |
| resource_name          | kube_masters                                        |
| resource_status        | CREATE_FAILED                                       |
| resource_status_reason | Error: resources.kube_masters.resources[0].resource |
|                        | s.master_config_deployment: Deployment to server    |
|                        | failed: deploy_status_code: Deployment exited with  |
|                        | non-zero status code: 1                             |
| resource_type          | OS::Heat::ResourceGroup                             |
| updated_time           | 2024-08-23T18:04:19Z                                |
+------------------------+-----------------------------------------------------+
Fri, Aug 23, 9:34 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 updated subscribers of T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.

@Andrew, any idea about where I should start looking for hints about what might be going wrong here?

Fri, Aug 23, 8:20 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
bd808 created T373227: Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.
Fri, Aug 23, 8:19 PM · Patch-For-Review, cloud-services-team, Beta-Cluster-Infrastructure, Cloud-VPS
gerritbot added a comment to T268175: central logging for OpenStack services.

Change #1065247 merged by Andrew Bogott:

[operations/puppet@production] Add openstack magnum and heat logs to logstash

https://gerrit.wikimedia.org/r/1065247

Fri, Aug 23, 4:26 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS
gerritbot added a comment to T268175: central logging for OpenStack services.

Change #1065247 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add openstack magnum and heat logs to logstash

https://gerrit.wikimedia.org/r/1065247

Fri, Aug 23, 4:21 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-23T14:48:42Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1062.eqiad.wmnet' (T369044)

Fri, Aug 23, 2:48 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-23T14:42:13Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1062.eqiad.wmnet' (T369044)

Fri, Aug 23, 2:42 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
gerritbot added a comment to T371573: puppet problems mounting cinder volumes (and suggested fixes).

Change #1056606 merged by Andrew Bogott:

[operations/puppet@production] cinderutils: add --allow-unattended-format when preparing volumes

https://gerrit.wikimedia.org/r/1056606

Fri, Aug 23, 12:36 AM · cloud-services-team, collaboration-services, Cloud-VPS, Patch-For-Review

Thu, Aug 22

Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T21:16:19Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudweb.set_maintenance (exit_code=0) (T369044)

Thu, Aug 22, 9:16 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T21:14:17Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudweb.set_maintenance (T369044)

Thu, Aug 22, 9:14 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T20:46:08Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=97) on host 'cloudvirt1047.eqiad.wmnet' (T369044)

Thu, Aug 22, 8:46 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T20:43:52Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1047.eqiad.wmnet' (T369044)

Thu, Aug 22, 8:44 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T20:43:46Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) on host 'cloudvirt1038.eqiad.wmnet' (T369044)

Thu, Aug 22, 8:43 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Stashbot added a comment to T369044: Upgrade cloud-vps openstack to version 'Caracal'.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-08-22T20:36:30Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack on host 'cloudvirt1038.eqiad.wmnet' (T369044)

Thu, Aug 22, 8:36 PM · Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS