[go: nahoru, domu]

Page MenuHomePhabricator

JMeybohm
User

Projects (7)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Apr 2 2020, 9:01 AM (230 w, 4 d)
Availability
Available
IRC Nick
jayme
LDAP User
JMeybohm
MediaWiki User
JMeybohm (WMF) [ Global Accounts ]

Recent Activity

Today

JMeybohm updated the task description for T373810: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran).
Mon, Sep 2, 2:01 PM · Infrastructure-Foundations, SRE-tools
JMeybohm created T373810: sre.hosts.reimage fails when the node is already in puppet db but has no facts (puppet never ran).
Mon, Sep 2, 1:55 PM · Infrastructure-Foundations, SRE-tools
JMeybohm added a comment to T372648: sre.hosts.reimage failing due to mkfs.ext4 taking to long.

It's probably enough to bump the default timeout as a quick fix. I'll take a look.

Mon, Sep 2, 1:51 PM · Infrastructure-Foundations, Spicerack, SRE-tools

Fri, Aug 30

JMeybohm added a comment to T337928: cfssl-issuer: Generate Kubernetes Events.

Deployed with:

helmfile -e staging-codfw -l name=cfssl-issuer-crds -l name=cfssl-issuer -i apply --context 5
Fri, Aug 30, 9:50 AM · Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes

Wed, Aug 28

JMeybohm added a comment to T373192: Improve eventgates health check/readiness probe.

@Ottomata could you please verify the above/patches to unblock the hardware refresh of kafka-main?

Wed, Aug 28, 9:50 AM · Patch-For-Review, serviceops-radar, Event-Platform, Data-Engineering
JMeybohm created T373505: Relabel codfw kubernetes nodes.
Wed, Aug 28, 8:53 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Tue, Aug 27

JMeybohm updated the task description for T373428: decommission kafka-main2001.codfw.wmnet.
Tue, Aug 27, 11:19 AM · ops-codfw, DC-Ops, Patch-For-Review, serviceops, decommission-hardware
JMeybohm added a subtask for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking: T373428: decommission kafka-main2001.codfw.wmnet.
Tue, Aug 27, 10:37 AM · Patch-For-Review, serviceops
JMeybohm added a parent task for T373428: decommission kafka-main2001.codfw.wmnet: T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Tue, Aug 27, 10:37 AM · ops-codfw, DC-Ops, Patch-For-Review, serviceops, decommission-hardware
JMeybohm created T373428: decommission kafka-main2001.codfw.wmnet.
Tue, Aug 27, 10:37 AM · ops-codfw, DC-Ops, Patch-For-Review, serviceops, decommission-hardware

Mon, Aug 26

JMeybohm added a comment to T373192: Improve eventgates health check/readiness probe.

To unblock the kafka hardware replacements, I've added an alternative readiness probe to the chart which will be used when .Values.main_app.conf.test_events evaluates to false. It just does a GET /_info which is what we do for other servicerunner services as well IMHO. AIUI schema loading happens async anyways, so the service should be ready as soon as the routing is set up, right?

Mon, Aug 26, 11:23 AM · Patch-For-Review, serviceops-radar, Event-Platform, Data-Engineering

Fri, Aug 23

JMeybohm updated the task description for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Fri, Aug 23, 1:33 PM · Patch-For-Review, serviceops
JMeybohm updated the task description for T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 1:31 PM · serviceops
JMeybohm updated the task description for T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 1:30 PM · serviceops
JMeybohm updated the task description for T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 1:16 PM · serviceops
JMeybohm created T373192: Improve eventgates health check/readiness probe.
Fri, Aug 23, 1:09 PM · Patch-For-Review, serviceops-radar, Event-Platform, Data-Engineering
JMeybohm added a parent task for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking: T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 12:46 PM · Patch-For-Review, serviceops
JMeybohm added subtasks for T373189: Establish a proper process for repacing kafka nodes: T363210: kafka-main200[6789] and kafka-main2010 implementation tracking, T363214: kafka-main100[6789] and kafka-main1010 implementation tracking.
Fri, Aug 23, 12:46 PM · serviceops
JMeybohm added a parent task for T363214: kafka-main100[6789] and kafka-main1010 implementation tracking: T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 12:46 PM · serviceops
JMeybohm created T373189: Establish a proper process for repacing kafka nodes.
Fri, Aug 23, 12:46 PM · serviceops

Thu, Aug 22

JMeybohm updated the task description for T359423: Migrate charts to Calico Network Policies.
Thu, Aug 22, 3:28 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
JMeybohm renamed T373115: Tegola is configured with just one kafka broker URL from Tegola is configured with just one kafka broker url to Tegola is configured with just one kafka broker URL.
Thu, Aug 22, 2:05 PM · Maps
JMeybohm created T373115: Tegola is configured with just one kafka broker URL.
Thu, Aug 22, 2:04 PM · Maps
JMeybohm updated the task description for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Thu, Aug 22, 11:41 AM · Patch-For-Review, serviceops
JMeybohm updated the task description for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Thu, Aug 22, 10:21 AM · Patch-For-Review, serviceops
JMeybohm updated the task description for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Thu, Aug 22, 9:47 AM · Patch-For-Review, serviceops
JMeybohm updated the task description for T363210: kafka-main200[6789] and kafka-main2010 implementation tracking.
Thu, Aug 22, 9:36 AM · Patch-For-Review, serviceops
JMeybohm closed T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP as Resolved.
Thu, Aug 22, 8:58 AM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes
JMeybohm closed T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP, a subtask of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21, as Resolved.
Thu, Aug 22, 8:58 AM · Patch-For-Review, serviceops, Prod-Kubernetes

Wed, Aug 21

JMeybohm closed T368714: kafka-main replacement nodes don't fit kafka-main (storage wise) as Resolved.
Wed, Aug 21, 3:17 PM · serviceops
JMeybohm added a project to T344171: Reverse DNS for k8s pods IPs: Traffic.

For the record: The CoreDNS Pods are, like all other Pods, reachable by their Pod IP. So going through a debug Pod to resolve the PTRs is not really necessary.
Regarding the IPs of Pods not listed in Endpoints: These could be added to CoreDNS via the https://github.com/coredns/kubepods plugin as well.

Wed, Aug 21, 8:56 AM · Traffic, serviceops, Prod-Kubernetes, Kubernetes

Fri, Aug 16

JMeybohm created T372648: sre.hosts.reimage failing due to mkfs.ext4 taking to long.
Fri, Aug 16, 2:31 PM · Infrastructure-Foundations, Spicerack, SRE-tools
JMeybohm closed T371422: Install (2) 960GB SSDs each in kafka-main10[06-10] as Resolved.

All of these have been reimaged with raid10-6dev now, thanks!

Fri, Aug 16, 10:23 AM · SRE, serviceops, DC-Ops, ops-eqiad
JMeybohm closed T371422: Install (2) 960GB SSDs each in kafka-main10[06-10], a subtask of T368714: kafka-main replacement nodes don't fit kafka-main (storage wise), as Resolved.
Fri, Aug 16, 10:23 AM · serviceops
JMeybohm added a comment to T371422: Install (2) 960GB SSDs each in kafka-main10[06-10].

From T371423#10068548 I did:

Fri, Aug 16, 7:59 AM · SRE, serviceops, DC-Ops, ops-eqiad
JMeybohm closed T371423: Install (2) 960GB SSDs each in kafka-main20[06-10] as Resolved.

All of these have been reimaged with raid10-6dev now, thanks!

Fri, Aug 16, 7:54 AM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm closed T371423: Install (2) 960GB SSDs each in kafka-main20[06-10], a subtask of T368714: kafka-main replacement nodes don't fit kafka-main (storage wise), as Resolved.
Fri, Aug 16, 7:53 AM · serviceops
JMeybohm added a comment to T372445: Requesting access to <LogStash server or localhost port 9200> for <ecarg>.

I can't recommend querying OpenSearch directly. Logs-api was made for services needing log integration (e.g. deploys), and not really for one-off queries. Browsing/filtering/aggregating the data is best accomplished using OpenSearch Dashboards found here: https://logstash.wikimedia.org

Fri, Aug 16, 7:48 AM · SRE, SRE-Access-Requests
JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

They all failed because the installer tried to bring up some old mdadm arrays and failed doing so, Maybe they where on the new disks, or it is because the disk numbering changed with the new disks added, idk.
What worked is zapping GPT/MBR from a debian installer shell and re(PXE)booting into the installer:

dd if=/dev/zero of=/dev/sda bs=512 count=34
dd if=/dev/zero of=/dev/sda bs=512 count=34 seek=$((`blockdev --getsz /dev/sda` - 34))
dd if=/dev/zero of=/dev/sdb bs=512 count=34
dd if=/dev/zero of=/dev/sdb bs=512 count=34 seek=$((`blockdev --getsz /dev/sdb` - 34))
dd if=/dev/zero of=/dev/sdc bs=512 count=34
dd if=/dev/zero of=/dev/sdc bs=512 count=34 seek=$((`blockdev --getsz /dev/sdc` - 34))
dd if=/dev/zero of=/dev/sdd bs=512 count=34
dd if=/dev/zero of=/dev/sdd bs=512 count=34 seek=$((`blockdev --getsz /dev/sdd` - 34))
dd if=/dev/zero of=/dev/sde bs=512 count=34
dd if=/dev/zero of=/dev/sde bs=512 count=34 seek=$((`blockdev --getsz /dev/sde` - 34))
dd if=/dev/zero of=/dev/sdf bs=512 count=34
dd if=/dev/zero of=/dev/sdf bs=512 count=34 seek=$((`blockdev --getsz /dev/sdf` - 34))
Fri, Aug 16, 7:34 AM · SRE, serviceops, DC-Ops, ops-codfw

Thu, Aug 15

JMeybohm added a comment to T371422: Install (2) 960GB SSDs each in kafka-main10[06-10].

I have allocated these drives added these SSDs to specified servers. Please test it out and let us know if there are any issues. Thank you!

Thu, Aug 15, 7:17 AM · SRE, serviceops, DC-Ops, ops-eqiad

Wed, Aug 14

JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

They all failed because the installer tried to bring up some old mdadm arrays and failed doing so, Maybe they where on the new disks, or it is because the disk numbering changed with the new disks added, idk.
What worked is zapping GPT/MBR from a debian installer shell and re(PXE)booting into the installer:

dd if=/dev/zero of=/dev/sda bs=512 count=34
dd if=/dev/zero of=/dev/sda bs=512 count=34 seek=$((`blockdev --getsz /dev/sda` - 34))
dd if=/dev/zero of=/dev/sdb bs=512 count=34
dd if=/dev/zero of=/dev/sdb bs=512 count=34 seek=$((`blockdev --getsz /dev/sdb` - 34))
dd if=/dev/zero of=/dev/sdc bs=512 count=34
dd if=/dev/zero of=/dev/sdc bs=512 count=34 seek=$((`blockdev --getsz /dev/sdc` - 34))
dd if=/dev/zero of=/dev/sdd bs=512 count=34
dd if=/dev/zero of=/dev/sdd bs=512 count=34 seek=$((`blockdev --getsz /dev/sdd` - 34))
dd if=/dev/zero of=/dev/sde bs=512 count=34
dd if=/dev/zero of=/dev/sde bs=512 count=34 seek=$((`blockdev --getsz /dev/sde` - 34))
dd if=/dev/zero of=/dev/sdf bs=512 count=34
dd if=/dev/zero of=/dev/sdf bs=512 count=34 seek=$((`blockdev --getsz /dev/sdf` - 34))
Wed, Aug 14, 4:58 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm closed T372445: Requesting access to <LogStash server or localhost port 9200> for <ecarg> as Invalid.

The errors you get suggest that the command was not quoted properly, please double check. The examples linked in the wiki assume you have SSH tunneled port 9200 to localhost, so they use localhost:9200 everywhere. Replace that with https://logs-api.svc.eqiad.wmnet, check your quoting and you should be good.

Wed, Aug 14, 2:30 PM · SRE, SRE-Access-Requests

Tue, Aug 13

JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

Oh, wait. 0.9GB - I totally misread. That is obviously not okay :D Tried rescanning the drives without luck. I would assume it's broken.

Tue, Aug 13, 2:36 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm updated subscribers of T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

I located the missing disk and reseated it. it's showing as having a size of 0.94 GB. Not sure if it's bad or needs to be reformatted. lmk and I'll see if I can find a replacement.

Tue, Aug 13, 2:27 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

@Jhancock.wm could you please check kafka-main2010 again? After trying to re-image I now only see 5 disks in iDRAC.

Tue, Aug 13, 12:39 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm changed the status of T363210: kafka-main200[6789] and kafka-main2010 implementation tracking, a subtask of T363209: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010, from Stalled to In Progress.
Tue, Aug 13, 9:17 AM · SRE, ops-codfw, serviceops, DC-Ops
JMeybohm changed the status of T363210: kafka-main200[6789] and kafka-main2010 implementation tracking, a subtask of T368714: kafka-main replacement nodes don't fit kafka-main (storage wise), from Stalled to In Progress.
Tue, Aug 13, 9:17 AM · serviceops
JMeybohm changed the status of T363210: kafka-main200[6789] and kafka-main2010 implementation tracking from Stalled to In Progress.
Tue, Aug 13, 9:17 AM · Patch-For-Review, serviceops
JMeybohm closed T371667: Remove deprecated cloudnative-pg charts from chart-museum as Resolved.

Yes, that's perfect.

Tue, Aug 13, 9:16 AM · serviceops, Kubernetes
JMeybohm added a comment to T371422: Install (2) 960GB SSDs each in kafka-main10[06-10].

Correct. Anything that is at least as big as the ~900G of the four currently installed SSDs will be fine. Thanks!

Tue, Aug 13, 8:35 AM · SRE, serviceops, DC-Ops, ops-eqiad

Mon, Aug 12

JMeybohm added a comment to T371667: Remove deprecated cloudnative-pg charts from chart-museum.

Just to be extra sure, you want the following to be removed:

  • stable/cloudnative-pg-operator-0.2.0.tgz
  • stable/cloudnative-pg-operator-crds-0.1.0.tgz
  • stable/cluster-0.1.0.tgz
  • stable/cluster-0.1.1.tgz
  • stable/cluster-0.1.2.tgz
  • stable/cluster-0.1.3.tgz
  • stable/cluster-0.1.4.tgz
  • stable/cluster-0.1.5.tgz
  • stable/cluster-0.1.6.tgz
Mon, Aug 12, 2:08 PM · serviceops, Kubernetes
JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

One thing I've noticed is that kafka-main2010 seems to have a different disk then all the others (all others are 1.7T models):

Mon, Aug 12, 12:13 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm claimed T337928: cfssl-issuer: Generate Kubernetes Events.
Mon, Aug 12, 9:19 AM · Patch-For-Review, Prod-Kubernetes, serviceops, Kubernetes
JMeybohm added a comment to T372241: Better visibility for throttled pods.

Generally speaking throttling is not an issue (as long as availability/latency targets are still met) but more a measure against processes going rough (so it's very common and kind of desired to have containers throttled). AIUI the referenced task the throttling was not the issue in this case as well. So as long as there is no actual problem, I would refrain from trying to "fix" all throttling or alert on it - as it's also not clearly actionable.

Mon, Aug 12, 7:51 AM · serviceops, Kubernetes
JMeybohm added a comment to T372242: Alert on unscrapable pods.

With how the prometheus service discovery currently works (e.g scraping every container port by default) we do have a large number of "okay to be down" targets, so an alert like this will produce quite some alerts. It's also pretty common for pods to go away, which might produce a flurry of alerts as well.

Mon, Aug 12, 7:43 AM · SRE Observability (FY2024/2025-Q1), serviceops, Kubernetes

Thu, Aug 8

JMeybohm awarded T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) a Love token.
Thu, Aug 8, 2:23 PM · Patch-For-Review, SRE, serviceops
JMeybohm added a comment to T365361: Establish a process to periodically upgrade the CFSSL infrastructure.

For T337928: cfssl-issuer: Generate Kubernetes Events I tried to vendor our v1.6.1 which fails badly as etcd seemed to have renamed a bunch of their libraries at some time. To circumvent this I've imported upstream v1.6.5 via gbp into https://gitlab.wikimedia.org/repos/sre/cfssl/-/tree/upstream/1.6.5?ref_type=tags, rebased our patches onto that in https://gitlab.wikimedia.org/repos/sre/cfssl/-/tree/merge_1.6.5 and created a branch that includes the patches (https://gitlab.wikimedia.org/repos/sre/cfssl/-/tree/wmf-1.6.5?ref_type=heads) for cfssl-issuer to include

Thu, Aug 8, 2:10 PM · CFSSL-PKI, Infrastructure-Foundations

Wed, Aug 7

JMeybohm added a comment to T371422: Install (2) 960GB SSDs each in kafka-main10[06-10].

The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best.

Wed, Aug 7, 4:03 PM · SRE, serviceops, DC-Ops, ops-eqiad
JMeybohm added a comment to T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].

The nodes are not in service, so no need to schedule a maint-window from our side. Feel free to choose a time that suits you best.

Wed, Aug 7, 4:03 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm added a subtask for T368714: kafka-main replacement nodes don't fit kafka-main (storage wise): T371422: Install (2) 960GB SSDs each in kafka-main10[06-10].
Wed, Aug 7, 4:01 PM · serviceops
JMeybohm added a parent task for T371422: Install (2) 960GB SSDs each in kafka-main10[06-10]: T368714: kafka-main replacement nodes don't fit kafka-main (storage wise).
Wed, Aug 7, 4:01 PM · SRE, serviceops, DC-Ops, ops-eqiad
JMeybohm added a subtask for T368714: kafka-main replacement nodes don't fit kafka-main (storage wise): T371423: Install (2) 960GB SSDs each in kafka-main20[06-10].
Wed, Aug 7, 4:00 PM · serviceops
JMeybohm added a parent task for T371423: Install (2) 960GB SSDs each in kafka-main20[06-10]: T368714: kafka-main replacement nodes don't fit kafka-main (storage wise).
Wed, Aug 7, 4:00 PM · SRE, serviceops, DC-Ops, ops-codfw
JMeybohm added a comment to T371885: Gaps in Grafana graphs using Thanos.

Regarding the throttling: Maybe it would help to set GOMAXPROCS to something sensible/related to the CPU limit (see https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits, https://github.com/uber-go/automaxprocs/tree/master)

Wed, Aug 7, 3:59 PM · SRE Observability (FY2024/2025-Q1), serviceops, MW-on-K8s, Grafana, Observability-Metrics

Jul 19 2024

JMeybohm added a comment to T368366: Upgrade K8s docker images running in Wikimedia production on Buster to either Bullseye or Bookworm.

I used this horrible bash script to get a breakdown of image versions deployed on a given cluster:

for ns in `kubectl get ns | cut -d " " -f 1 | grep -v NAME`; do echo -e "\nnamespace: $ns\n"; kubectl get pods -n $ns -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" |tr -s '[[:space:]]' '\n' |sort |uniq -c; done

This is still painful but keeping a note in here anyway :)

Jul 19 2024, 2:58 PM · serviceops, Security, Infrastructure-Foundations
JMeybohm added a comment to T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP.

Implementation is (I think) completed. There is a README which hopefully explains what my intentions where and how the policies can be used to create something PSS like but with exceptions.

Jul 19 2024, 2:39 PM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes
JMeybohm added a comment to T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).

I have updated https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s which is now loading the namespace variable via kube_namespace_created and I also flipped the variable order from (app, namespace) to (namespace, app) which greatly improves loading time of all variables after namespace (and therefore the dashboard as a whole)

Jul 19 2024, 9:45 AM · Patch-For-Review, Kubernetes, Grafana, Observability-Metrics, serviceops

Jul 18 2024

JMeybohm added a comment to T369932: sextant: support module garbage collection.

I think this would be nice to make and keep things a bit more tidy 👍

Jul 18 2024, 2:57 PM · serviceops
JMeybohm added a comment to T368523: Migrate wikibase-termbox to node20.

I was hoping to only change how the Termbox connects to Wikidata, not the other direction (Termbox being called by Wikidata) – your comment sounds like it would affect both? (Maybe it’s not possible to separate them?) Would that mean changing the wmgWikibaseSSRTermboxServerUrl as well (and maybe a new hieradata/common/service.yaml and/or hieradata/common/profile/services_proxy/envoy.yaml entry in Puppet)?

Jul 18 2024, 11:24 AM · MW-1.43-notes (1.43.0-wmf.19; 2024-08-20), Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Wikidata, wmde-wikidata-tech

Jul 16 2024

JMeybohm updated subscribers of T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP.
Jul 16 2024, 3:54 PM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes
JMeybohm added a comment to T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP.

The policy parser tool has moved to https://gitlab.wikimedia.org/repos/sre/kyverno-policy-parser

Jul 16 2024, 3:54 PM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes

Jul 12 2024

JMeybohm renamed T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP from Create ValidationAdmissionPolicies to replace mediawiki PSP to Create ValidatingAdmissionPolicies to replace mediawiki PSP.
Jul 12 2024, 9:10 AM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes
JMeybohm added a comment to T368523: Migrate wikibase-termbox to node20.

Something like that, yes. Although you will need to use a mesh.public_port different from the one used for the "not test" release, as the nodePort would clash otherwise. See https://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports

Jul 12 2024, 9:04 AM · MW-1.43-notes (1.43.0-wmf.19; 2024-08-20), Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Wikidata, wmde-wikidata-tech
JMeybohm created T369884: Fix/remove deployment-charts update_version.py.
Jul 12 2024, 8:11 AM · Release-Engineering-Team (Priority Backlog 📥)

Jul 11 2024

JMeybohm added a comment to T368523: Migrate wikibase-termbox to node20.

I might be missing something obvious here (sorry for not mentioning earlier), but why does termbox test not use the service mesh to connect to mw-api-int-ro.discovery.wmnet like the other termbox releases do? That would leave the TLS part completely to envoy, removing the requirement to provide NODE_EXTRA_CA_CERTS etc.

Jul 11 2024, 8:45 AM · MW-1.43-notes (1.43.0-wmf.19; 2024-08-20), Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Wikidata, wmde-wikidata-tech

Jul 10 2024

JMeybohm added a comment to T368523: Migrate wikibase-termbox to node20.

Aha, “unable to get local issuer certificate”:

lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kube_env termbox staging
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl get pods
NAME                              READY   STATUS    RESTARTS   AGE
termbox-staging-c74b86488-hwm58   3/3     Running   0          11m
termbox-test-d78f78c6d-k5xnb      2/2     Running   0          11m
lucaswerkmeister-wmde@deploy1002 /srv/deployment-charts/helmfile.d/services/termbox $ kubectl logs termbox-test-d78f78c6d-k5xnb termbox-test --tail=1
{"name":"wikibase-termbox","hostname":"termbox-test-d78f78c6d-k5xnb","pid":21,"level":"ERROR","message":"unable to get local issuer certificate","request":{"headers":{"Accept":"application/json, text/plain, */*","User-Agent":"wikibase-termbox/0.1.0 (The Wikidata team) axios/^0.21.1","Host":"test.wikidata.org:4446"},"url":"api.php","params":{"action":"query","meta":"allmessages","ammessages":"wikibase-edit|wikibase-publish|wikibase-cancel|wikibase-aliases-separator|wikibase-entitytermsforlanguagelistview-more|wikibase-entitytermsforlanguagelistview-less|wikibase-entitytermsview-entitytermsforlanguagelistview-toggler|wikibase-label-empty|wikibase-description-empty|wikibase-label-edit-placeholder|wikibase-description-edit-placeholder|wikibase-alias-edit-placeholder|wikibase-anonymouseditwarning-heading|wikibase-anonymouseditwarning-message|wikibase-anonymouseditnotificationtempuser-message|wikibase-anonymouseditwarning-dismiss-button|wikibase-anonymouseditwarning-dismiss-persist|pt-login|pt-createaccount|wikibase-shortcopyrightwarning-accept-persist|wikibase-shortcopyrightwarning-heading|wikibase-entity-save-error-heading|wikibase-entity-save-error-message","amlang":"en","format":"json"}},"url":"/termbox?entity=Q235006&revision=676047&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ235006&preferredLanguages=en%7Cmul","reqId":"34fc55cc-dbfc-4c12-bc43-b64ebe2e009f","levelPath":"error/service","msg":"unable to get local issuer certificate","time":"2024-07-10T12:28:17.803Z","v":0}

I have no idea if this could be related to the new envoy version (T368366) or not, though.

Jul 10 2024, 1:27 PM · MW-1.43-notes (1.43.0-wmf.19; 2024-08-20), Patch-For-Review, Wikidata Dev Team (Wikidata.org Slice), Wikidata, wmde-wikidata-tech
JMeybohm added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting for the deamonset to be scheduled on the broken node. The node should auto-join the cluster (cordoned) when it comes back online.

Hi. I notice BGP for this host is still down on the switch side? If it's likely to continue let me know and I will set the Netbox flag for BGP to off and remove the config from the switch with Homer. Thanks.

Jul 10 2024, 7:27 AM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops
JMeybohm added a comment to T341560: Migrate mwmaint server functionality to mw-on-k8s.

We'll eventually send it to logstash too, but that hasn't happened yet.

Jul 10 2024, 7:16 AM · serviceops, MW-on-K8s

Jul 9 2024

JMeybohm added a comment to T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.

Another datapoint to add: During the problematic time frame alerting did catch some of the orchestrator restarts (it just alerts on restarts that are caused by OOM kills): https://logstash.wikimedia.org/goto/aac4d9745b946d83bc486d3300221658
The memory increase is mostly not visible in grafana, so I would suspect it to have happened fast (we only scrape data every 30s). Maybe that helps with tracking down the root cause.

Jul 9 2024, 4:26 PM · Abstract Wikipedia team (25Q1 (Jul–Sep))
JMeybohm added a comment to T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.

Again, limited understanding on what I'm looking at here, but https://logstash.wikimedia.org/goto/49bb255fe9a7c07a37389ad8e377b362 seems to assemble a portion of the problematic requests. Unfortunately not all because a 503 does not seem to always result in the zObject containing "Service Unavailable".

Jul 9 2024, 10:07 AM · Abstract Wikipedia team (25Q1 (Jul–Sep))
JMeybohm merged T367143: Miscweb K8s dashboard loading issues into T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).
Jul 9 2024, 9:56 AM · Patch-For-Review, Kubernetes, Grafana, Observability-Metrics, serviceops
JMeybohm merged task T367143: Miscweb K8s dashboard loading issues into T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).
Jul 9 2024, 9:55 AM · SRE Observability
JMeybohm added a comment to T367143: Miscweb K8s dashboard loading issues.

I've updated the variables as of T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...) - dashboard is now usable again.

Jul 9 2024, 9:55 AM · SRE Observability
JMeybohm created T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).
Jul 9 2024, 9:34 AM · Patch-For-Review, Kubernetes, Grafana, Observability-Metrics, serviceops
JMeybohm added a comment to T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.
Jul 9 2024, 8:30 AM · Abstract Wikipedia team (25Q1 (Jul–Sep))
JMeybohm added a comment to T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.

If I try running the function call above in the ApiSandbox, I get a "Service Unavailable."

I'm unable to reproduce that. But judging from the data I get back it seems to be a cached response (orchestrationCpuUsage\",\"K2\":\"688.86 ms is always the same for me)

Jul 9 2024, 8:11 AM · Abstract Wikipedia team (25Q1 (Jul–Sep))

Jul 8 2024

JMeybohm added a comment to T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.

If the issue is reproducible without calling any evaluator, could you check if curl'ing the orchestrator (e.g. bypassing the wiki) does also also result in an error? With bypassing the wikis you can also rule out timeout issues as you're (mostly, as there's also between the TLS terminator end the orchestrator) under control of those.
I would also suggest trying to figure out when the issues started and try to roll back to the release version that was active before to rule out code changes that happened in between.

Jul 8 2024, 6:00 PM · Abstract Wikipedia team (25Q1 (Jul–Sep))
JMeybohm changed the status of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21 from Open to Stalled.
Jul 8 2024, 10:23 AM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm updated the task description for T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21.
Jul 8 2024, 10:22 AM · Patch-For-Review, serviceops, Prod-Kubernetes
JMeybohm changed the status of T273507: PodSecurityPolicies will be deprecated with Kubernetes 1.21, a subtask of T341984: Update Kubernetes clusters to >1.25, from Open to Stalled.
Jul 8 2024, 10:21 AM · Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
JMeybohm updated the task description for T369491: Migrate aux cluster off of Pod Security Policies.
Jul 8 2024, 10:12 AM · Patch-For-Review, User-Elukey, Infrastructure-Foundations, Kubernetes
JMeybohm created T369493: Migrate ml-staging/ml-serve clusters off of Pod Security Policies.
Jul 8 2024, 10:12 AM · Machine-Learning-Team, Kubernetes
JMeybohm created T369492: Migrate dse cluster off of Pod Security Policies.
Jul 8 2024, 10:12 AM · Patch-For-Review, Data-Platform-SRE (2024.08.17 - 2024.09.06), Data-Engineering, Kubernetes
JMeybohm created T369491: Migrate aux cluster off of Pod Security Policies.
Jul 8 2024, 10:09 AM · Patch-For-Review, User-Elukey, Infrastructure-Foundations, Kubernetes

Jul 4 2024

JMeybohm updated subscribers of T368892: Enabling the nestedMetadata functionality on Wikifunctions.org causes evaluations to often fail with "gateway timeout" or “service unavailable”.

@cmassaro thanks for the ping. Unfortunately I have trouble finding the right data to analyze this further. The orchestrator does call the evaluator(s) directly (not via the service mesh) and does not produce metrics regarding those requests like latency, error rates etc. (at least I'm unable to find those). Looking at the logs of the components I can't see failed requests as well.

Jul 4 2024, 4:16 PM · Abstract Wikipedia team (25Q1 (Jul–Sep))
JMeybohm added a project to T368972: Use Envoy instead of LVS to route internal federation traffic for WDQS: serviceops.
Jul 4 2024, 2:52 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), wmde-wikidata-tech, serviceops, Wikidata-Query-Service, Wikidata

Jul 3 2024

JMeybohm added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

!log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet)

Jul 3 2024, 2:33 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
JMeybohm added a comment to T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.

!log jayme@cumin1002 conftool action : set/pooled=no; selector: name=(wikikube-worker1007.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet|kubernetes1060.eqiad.wmnet)

Jul 3 2024, 2:00 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
JMeybohm updated the task description for T365994: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad.
Jul 3 2024, 1:42 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE
JMeybohm added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

I've deleted the node from the k8s API as a required istio update would not finish successfully because it was waiting for the deamonset to be scheduled on the broken node. The node should auto-join the cluster (cordoned) when it comes back online.

Jul 3 2024, 9:04 AM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops