[go: nahoru, domu]

Page MenuHomePhabricator

Clement_Goubert (claime)
Senior SRE

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jul 26 2022, 2:11 PM (103 w, 3 d)
Availability
Available
IRC Nick
claime
LDAP User
Clément Goubert
MediaWiki User
CGoubert-WMF [ Global Accounts ]

Recent Activity

Today

Clement_Goubert added a subtask for T356241: Move video transcoding to use Shellbox: T370527: Remove mediawiki-installation dsh group check.
Fri, Jul 19, 3:24 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Clement_Goubert added a parent task for T370527: Remove mediawiki-installation dsh group check: T356241: Move video transcoding to use Shellbox.
Fri, Jul 19, 3:24 PM · MW-on-K8s, SRE Observability (FY2024/2025-Q1), Observability-Alerting
Clement_Goubert added a comment to T370527: Remove mediawiki-installation dsh group check.

It is still relevant for videoscalers, but we'll remove it once we're rid of them

Fri, Jul 19, 3:24 PM · MW-on-K8s, SRE Observability (FY2024/2025-Q1), Observability-Alerting
Clement_Goubert added a comment to T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).

I *tried* very hard to automate it with a cookbook, but the behavior is wildly inconsistent between runs, sometimes requiring a reboot for deleting vdisks, sometimes not, sometimes starting to run the vdisk deletion only to fail without an explanation.

Fri, Jul 19, 1:38 PM · Patch-For-Review, SRE, serviceops

Wed, Jul 17

Clement_Goubert added a comment to T370258: Degraded RAID on mw2432.

Please do :) I'll leave the task open so it doesn't open a new one when I inevitably break it again. For the record, any raid issue you'll get for mw2432, mw2433, mw2438, mw2439 until T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) is resolved is probably me. I will try and find out why downtiming the host didn't stop the autocreation of the ticket, but if I don't I'll watch for them being auto-opened and comment on them asap.

Wed, Jul 17, 4:46 PM · SRE, ops-codfw, DC-Ops
Clement_Goubert created P66753 (An Untitled Masterwork).
Wed, Jul 17, 4:10 PM
Clement_Goubert added a comment to T370258: Degraded RAID on mw2432.

Hi @Jhancock.wm very sorry for the noise, this is me trying to automate turning the RAID controller to HBA mode, there are no actual issues with the disk. I didn't know it would create a ticket for you automatically.

Wed, Jul 17, 2:33 PM · SRE, ops-codfw, DC-Ops
Clement_Goubert closed T370091: LDAP access to the analytics-privatedata-users group for Quiddity as Resolved.

I have merged the access change, puppet will deploy it in the next half-hour or so. I'm resolving this task, feel free to reopen should you encounter any issue.

Wed, Jul 17, 10:56 AM · SRE, Data-Engineering, LDAP-Access-Requests, SRE-Access-Requests
Clement_Goubert awarded T346690: mw-mcrouter daemonset on mw-on-k8s a Love token.
Wed, Jul 17, 10:09 AM · Patch-For-Review, MediaWiki-Platform-Team (Radar), serviceops, MW-on-K8s
Clement_Goubert added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

@Milimetric Can you sign L3 so we can move forward with this?

Wed, Jul 17, 9:41 AM · SRE, SRE-Access-Requests
Clement_Goubert updated the task description for T370091: LDAP access to the analytics-privatedata-users group for Quiddity.
Wed, Jul 17, 9:41 AM · SRE, Data-Engineering, LDAP-Access-Requests, SRE-Access-Requests
Clement_Goubert claimed T370091: LDAP access to the analytics-privatedata-users group for Quiddity.
Wed, Jul 17, 9:40 AM · SRE, Data-Engineering, LDAP-Access-Requests, SRE-Access-Requests

Tue, Jul 16

Clement_Goubert changed the status of T369314: Requesting access to stewards-users for JJMC89 from Open to In Progress.
Tue, Jul 16, 10:23 AM · SRE, SRE-Access-Requests
Clement_Goubert assigned T370091: LDAP access to the analytics-privatedata-users group for Quiddity to KStineRowe_WMF.

Can you please read and sign the L3 document, as well as read the Data Access User Responsibilities document.

Tue, Jul 16, 10:16 AM · SRE, Data-Engineering, LDAP-Access-Requests, SRE-Access-Requests
Clement_Goubert changed the status of T370091: LDAP access to the analytics-privatedata-users group for Quiddity from Open to In Progress.
Tue, Jul 16, 10:07 AM · SRE, Data-Engineering, LDAP-Access-Requests, SRE-Access-Requests

Mon, Jul 15

Clement_Goubert assigned T369314: Requesting access to stewards-users for JJMC89 to KFrancis.

@KFrancis can you please confirm NDA status?

Mon, Jul 15, 11:46 AM · SRE, SRE-Access-Requests
Clement_Goubert closed T368566: Grant Access to analytics-privatedata-users for Sharvaniharan as Resolved.

Resolving this as it seems everything is in order. Don't hesitate to reopen should you encounter any issues.

Mon, Jul 15, 9:09 AM · SRE-Access-Requests, SRE
Clement_Goubert closed T369517: Requesting Kerberos access for xiaoxiao as Resolved.

@XiaoXiao-WMF You should have received an email with instructions on how to set your kerberos password. Please reopen this task should you encounter any issues.

Mon, Jul 15, 9:05 AM · Patch-For-Review, SRE, SRE-Access-Requests, Data-Engineering
Clement_Goubert added a project to T370018: gitlab2002: wrong network for public IPV4 and IPV6: SRE.
Mon, Jul 15, 8:49 AM · SRE, collaboration-services
Clement_Goubert edited projects for T370018: gitlab2002: wrong network for public IPV4 and IPV6, added: collaboration-services; removed serviceops.
Mon, Jul 15, 8:49 AM · SRE, collaboration-services
Clement_Goubert changed the status of T369517: Requesting Kerberos access for xiaoxiao from Open to In Progress.
Mon, Jul 15, 8:46 AM · Patch-For-Review, SRE, SRE-Access-Requests, Data-Engineering

Thu, Jul 11

Clement_Goubert updated subscribers of T369676: `mwscript-k8s --attach` seems to terminate IO after a few seconds without input.

Tagging @RLazarus as I haven't really kept up with mwscript development

Thu, Jul 11, 9:57 AM · MW-on-K8s

Wed, Jul 10

Clement_Goubert closed T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy as Resolved.

We've only had one spike of job enqueuing errors since merging Restart=on-failure, I think I'll call this resolved for now, and reopen if we see problems again.

Wed, Jul 10, 2:43 PM · Infrastructure-Foundations, serviceops, SRE
Clement_Goubert moved T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy from Incoming 🐫 to Doing 😎 on the serviceops board.
Wed, Jul 10, 2:39 PM · Infrastructure-Foundations, serviceops, SRE
Clement_Goubert moved T354791: Reclaim jobrunner hardware for k8s from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Wed, Jul 10, 2:37 PM · Patch-For-Review, SRE, serviceops, MW-on-K8s
Clement_Goubert moved T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Wed, Jul 10, 2:37 PM · Patch-For-Review, SRE, serviceops
Clement_Goubert moved T369119: sextant update should support a minimal change mode from Incoming 🐫 to Doing 😎 on the serviceops board.
Wed, Jul 10, 2:35 PM · Patch-For-Review, serviceops, Kubernetes
Clement_Goubert closed T291918: Re-think how we separate traffic to mediawiki in clusters. as Resolved.

This seems pretty well setlled now with:

  • mw-api-ext for external api calls
  • mw-api-int for internal api calls
  • mw-web for external live users
  • mw-parsoid for parsoid
  • mw-jobrunner for non-video jobs
  • mw-videoscaler for videoscaling jobs
  • mw-debug for testing
  • mw-script for maintenance scripts and periodic jobs
  • mw-misc for noc.wikimedia.org
  • mw-wikifunctions dedicated for Abstract Wikipedia
Wed, Jul 10, 2:34 PM · MW-on-K8s, SRE, serviceops
Clement_Goubert moved T356293: Migrate MW appservers' base images to bullseye from Backlog to Blocked on the MW-on-K8s board.
Wed, Jul 10, 2:22 PM · MW-on-K8s, Patch-For-Review, serviceops, SRE
Clement_Goubert renamed T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet from kubernetes1051.eqiad.wmnet failed to pull mediawiki images to hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.
Wed, Jul 10, 2:05 PM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops
Clement_Goubert added a comment to T369676: `mwscript-k8s --attach` seems to terminate IO after a few seconds without input.

I'd recommend using the dedicated REPL for this

Wed, Jul 10, 10:30 AM · MW-on-K8s
Clement_Goubert added a comment to T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs.

[...]

The preferred alternative would presumably be mw-api-int-ro.discovery.wmnet:4446, but I'd like to confirm with @Clement_Goubert as well.

Wed, Jul 10, 9:40 AM · Discovery-Search (Current work), Wikidata

Mon, Jul 8

Clement_Goubert updated the task description for T290536: Serve production traffic via Kubernetes.
Mon, Jul 8, 11:41 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T290536: Serve production traffic via Kubernetes.

T355292: Port videoscaling to kubernetes should probably be a subtask of this (or maybe a subtask of T321899)? At least I’ve been told that videoscalers are blockers for the k8s migration being considered complete, and T355292 seems to be the currently active task in that area.

Mon, Jul 8, 11:37 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a parent task for T355292: Port videoscaling to kubernetes: T321786: Deploy mediawiki kubernetes services.
Mon, Jul 8, 11:36 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Clement_Goubert added a subtask for T321786: Deploy mediawiki kubernetes services: T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 11:35 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a subtask for T354532: Limit the concurrency of envoy in service mesh: T344814: mw-on-k8s tls-proxy container CPU throttling at low average load.
Mon, Jul 8, 11:34 AM · Kubernetes, Prod-Kubernetes, serviceops
Clement_Goubert added a parent task for T344814: mw-on-k8s tls-proxy container CPU throttling at low average load: T354532: Limit the concurrency of envoy in service mesh.
Mon, Jul 8, 11:34 AM · serviceops, MW-on-K8s
Clement_Goubert removed a parent task for T354532: Limit the concurrency of envoy in service mesh: T344814: mw-on-k8s tls-proxy container CPU throttling at low average load.
Mon, Jul 8, 11:34 AM · Kubernetes, Prod-Kubernetes, serviceops
Clement_Goubert removed a subtask for T344814: mw-on-k8s tls-proxy container CPU throttling at low average load: T354532: Limit the concurrency of envoy in service mesh.
Mon, Jul 8, 11:34 AM · serviceops, MW-on-K8s
Clement_Goubert closed T333120: Migrate internal traffic to k8s as Resolved.

All internal traffic has been migrated.

Mon, Jul 8, 11:34 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T333269: Benchmark baremetal vs k8s mediawiki perf (2023) as Invalid.

All traffic has been migrated to MW-on-K8s

Mon, Jul 8, 11:33 AM · MediaWiki-Platform-Team, serviceops
Clement_Goubert closed T333269: Benchmark baremetal vs k8s mediawiki perf (2023), a subtask of T290536: Serve production traffic via Kubernetes, as Invalid.
Mon, Jul 8, 11:33 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T282148: Support Canary releases on Kubernetes as Resolved.

We can get prom metrics using the release label. Boldly closing.

Mon, Jul 8, 11:33 AM · serviceops
Clement_Goubert closed T333120: Migrate internal traffic to k8s, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Mon, Jul 8, 11:33 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T276487: Progressive rollout of MediaWiki deployment on Kubernetes as Resolved.

This functionality has been added to scap.

Mon, Jul 8, 11:33 AM · Release-Engineering-Team (Priority Backlog 📥), serviceops, MW-on-K8s
Clement_Goubert closed T282148: Support Canary releases on Kubernetes , a subtask of T210143: Canaries canaries canaries, as Resolved.
Mon, Jul 8, 11:32 AM · User-brennen, User-WDoran, serviceops, SRE
Clement_Goubert changed the status of T290536: Serve production traffic via Kubernetes from Open to In Progress.
Mon, Jul 8, 11:29 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T290536: Serve production traffic via Kubernetes, a subtask of T319432: Migrate WMF production from PHP 7.4 to PHP 8.1, from Open to In Progress.
Mon, Jul 8, 11:26 AM · Dumps-Generation, MediaWiki-Platform-Team, serviceops
Clement_Goubert changed the status of T290536: Serve production traffic via Kubernetes, a subtask of T356293: Migrate MW appservers' base images to bullseye, from Open to In Progress.
Mon, Jul 8, 11:26 AM · MW-on-K8s, Patch-For-Review, serviceops, SRE
Clement_Goubert closed T362323: Move 100% of external traffic to Kubernetes as Resolved.

The work this task tracked is now completed. Remaining migrations T352650: Migrate current-generation dumps to run from our containerized images, T355292: Port videoscaling to kubernetes, T355292: Port videoscaling to kubernetes, and cleanup work T367949: Spin down api_appserver and appserver clusters will indeed be tracked through T290536.

Mon, Jul 8, 11:21 AM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
TheresNoTime awarded T362323: Move 100% of external traffic to Kubernetes a Fox token.
Mon, Jul 8, 11:20 AM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T362323: Move 100% of external traffic to Kubernetes, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Mon, Jul 8, 11:18 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T362323: Move 100% of external traffic to Kubernetes.
Mon, Jul 8, 11:16 AM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert moved T366752: Dumps 2.0 Phase III: Production level dumps from Incoming to Backlog on the Dumps 2.0 board.
Mon, Jul 8, 11:11 AM · Dumps 2.0, Epic
Clement_Goubert removed a subtask for T362323: Move 100% of external traffic to Kubernetes: T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 11:09 AM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert removed a parent task for T355292: Port videoscaling to kubernetes: T362323: Move 100% of external traffic to Kubernetes.
Mon, Jul 8, 11:09 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Clement_Goubert merged T321899: Create mw-videoscaler helmfile deployment into T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 11:00 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops
Clement_Goubert merged task T321899: Create mw-videoscaler helmfile deployment into T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 10:58 AM · Release-Engineering-Team (Seen), serviceops, MW-on-K8s
Clement_Goubert added a subtask for T290536: Serve production traffic via Kubernetes: T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 10:56 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a subtask for T362323: Move 100% of external traffic to Kubernetes: T355292: Port videoscaling to kubernetes.
Mon, Jul 8, 10:56 AM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added parent tasks for T355292: Port videoscaling to kubernetes: T290536: Serve production traffic via Kubernetes, T362323: Move 100% of external traffic to Kubernetes.
Mon, Jul 8, 10:56 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Thu, Jul 4

Clement_Goubert updated the task description for T367949: Spin down api_appserver and appserver clusters.
Thu, Jul 4, 2:31 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T367949: Spin down api_appserver and appserver clusters.
Thu, Jul 4, 2:10 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Wed, Jul 3

Clement_Goubert changed the status of T367949: Spin down api_appserver and appserver clusters from Open to In Progress.
Wed, Jul 3, 2:48 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Only hosts left are:

Wed, Jul 3, 2:45 PM · serviceops, MW-on-K8s
Clement_Goubert changed the status of T367949: Spin down api_appserver and appserver clusters, a subtask of T290536: Serve production traffic via Kubernetes, from Open to In Progress.
Wed, Jul 3, 2:44 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert lowered the priority of T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s from High to Low.

Lowering priority now that the production deployments of mw-on-k8s have been done.

Wed, Jul 3, 11:29 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
Clement_Goubert updated subscribers of T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Tagging in @RLazarus for mw-script, I don't know how you want to handle it given the transient nature of the environment.

Wed, Jul 3, 11:28 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
Clement_Goubert added a comment to T366819: Enable PCS to send resource change events to handle URL purges.

From inside the pod, nodejs DNS lookup returns the ipv6 for staging.svc.eqiad.wmnet

runuser@mobileapps-staging-6794bbd5c6-cp6gd:/srv/service$ nodejs 
Welcome to Node.js v18.19.0.
Type ".help" for more information.
> const dns = require('node:dns')
undefined
> dns.lookup('staging.svc.eqiad.wmnet', (err, address, family) => {
...   console.log('address: %j family: IPv%s', address, family);
... });
GetAddrInfoReqWrap {
  callback: [Function (anonymous)],
  family: 0,
  hostname: 'staging.svc.eqiad.wmnet',
  oncomplete: [Function: onlookup]
}
> address: "2620:0:861:102:10:64:16:55" family: IPv6
``
Wed, Jul 3, 9:32 AM · Patch-For-Review, RESTBase Sunsetting, Content-Transform-Team-WIP, Wikipedia-iOS-App-Backlog, RESTBase, serviceops
Clement_Goubert added a comment to T366819: Enable PCS to send resource change events to handle URL purges.
root@kubestage1003:/home/cgoubert# curl https://staging.svc.eqiad.wmnet:4492/v1/events -H "Content-Type: application/json" -d '@data.txt' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1359  100   549  100   810   4773   7043 --:--:-- --:--:-- --:--:-- 11817
{
  "invalid": [],
  "error": [
    {
      "status": "error",
      "event": {
        "$schema": "/resource_change/1.0.0",
        "meta": {
          "request_id": "311902cc-0319-41b0-9fc6-9ca1dcb47dd7",
          "id": "68a24380-37a9-11ef-b9d3-87400a6a8e37",
          "dt": "2024-07-01T12:57:01.624Z",
          "domain": "en.wikipedia.org",
          "uri": "en.wikipedia.org/api/rest_v1/page/mobile-html/Dog",
          "stream": "resource_purge"
        },
        "tags": [
          "pcs"
        ]
      },
      "context": {
        "message": "event 68a24380-37a9-11ef-b9d3-87400a6a8e37 of schema at /resource_change/1.0.0 destined to stream resource_purge is not allowed in stream; resource_purge is not configured."
      }
    }
  ]
}
Wed, Jul 3, 9:26 AM · Patch-For-Review, RESTBase Sunsetting, Content-Transform-Team-WIP, Wikipedia-iOS-App-Backlog, RESTBase, serviceops

Tue, Jul 2

Clement_Goubert added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

Host is flapping, setting downtime until tomorrow

Tue, Jul 2, 3:44 PM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops
Clement_Goubert added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Done for mw-misc and mw-wikifunctions.

Tue, Jul 2, 2:58 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
Clement_Goubert added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

The service is up and running in staging, and can be reached at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443 internally.

Thanks @Scott_French!
Where can I reach this from? Should I ssh into a specific machine?

Tue, Jul 2, 10:54 AM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Mon, Jul 1

Clement_Goubert added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

All main deployments of mw-on-k8s now send data to statsd-exporter.
Remaining are mw-misc, mw-wikifunctions, and the yet to be used mw-script and mw-videoscaler which I'll do this week, but won't be sending much data.

Mon, Jul 1, 2:59 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics

Fri, Jun 28

Clement_Goubert updated the task description for T368639: Relabel eqiad kubernetes nodes.
Fri, Jun 28, 3:11 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Thu, Jun 27

Clement_Goubert created T368639: Relabel eqiad kubernetes nodes.
Thu, Jun 27, 4:35 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert reopened T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy as "Open".

This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.

Thu, Jun 27, 4:34 PM · Infrastructure-Foundations, serviceops, SRE
Clement_Goubert closed T283861: prometheus-apache-exporter in buster does not support -log.format json as Resolved.

New version has been deployed.

Thu, Jun 27, 12:27 PM · Patch-For-Review, serviceops, MW-on-K8s
Clement_Goubert updated subscribers of T332016: Migrate docker registry hosts to bookworm.

If we jump right to bookworm, we need to copy the python3-docker-report package to bookworm.

Thu, Jun 27, 9:36 AM · serviceops

Wed, Jun 26

Clement_Goubert added a comment to T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes.

It's possible it's to do with docker using single-threaded gzip for compression on push https://github.com/moby/moby/issues/41987

Wed, Jun 26, 4:15 PM · Release-Engineering-Team, serviceops, Scap, MW-on-K8s
Clement_Goubert updated the task description for T362323: Move 100% of external traffic to Kubernetes.
Wed, Jun 26, 2:53 PM · Patch-For-Review, MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T332016: Migrate docker registry hosts to bookworm.
  • Necessary packages docker-registry and python3-docker-report are available for bullseye in the right versions
  • Summarizing from irc, the real risk is the nginx config. Tests would need to be ran for:
    • publish a non-restricted image from build2001
    • publish a restricted image from deploy1002 (we test both authenticated and unauthenticated POST to a nonexistent upload path in httpbb, but a real build + push test would be better)
    • pull a non-restricted image without credentials from the public interface (already in httpbb)
    • pull a restricted image without credentials and see it fail (already in httpbb)
    • check the pipeline on gitlab still works
Wed, Jun 26, 2:28 PM · serviceops
Clement_Goubert added a comment to T368405: Special:Homepage is rendered much slower (<1 sec to 2+ sec).

Hi, I just checked on a bare-metal debug server (mwdebug1001), and it takes 1.9s, so I doubt it's k8s-related

Wed, Jun 26, 8:54 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Growth-Team (FY2024-25 Q1 Sprint 1), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Data Products, User-Michael, Data-Platform, Performance Issue, GrowthExperiments-Homepage

Tue, Jun 25

Clement_Goubert edited projects for T341441: Pushing mediawiki-multiversion Docker image from deploy server takes 4 minutes, added: Scap, serviceops, Release-Engineering-Team; removed Deployments.

Looks like something between the deployment server and the registry:
registry rx graph for the last long build-and-push

image.png (517×1 px, 48 KB)

deploy tx graph for the same period
image.png (517×1 px, 56 KB)

Tue, Jun 25, 3:17 PM · Release-Engineering-Team, serviceops, Scap, MW-on-K8s
Clement_Goubert added a comment to T365655: mw-api-ext unavailability 2024-05-22 18:30 UTC .

mw-on-k8s pods should now rate limit the logs sent to mwlog at 100 messages per second per pod.

Tue, Jun 25, 12:55 PM · serviceops

Mon, Jun 24

Clement_Goubert added a comment to T368238: Wikifeeds' tls proxy cpu usage heavily increased in April.

I also tried to run perf to catch what is causing the CPU usage, but probably due to the lack of symbols is it not straightforward to get what's happening. The avg number of threads ran by envoy seems in the order of ~100, that could also be the cause of the throttling (namely too many threads trying to run at the same time, starving for cpu usage).

The other service using the restbase-for-services discovery config is mobileapps, but I don't see the same impact to the the tlsproxy

Mon, Jun 24, 2:54 PM · Wikifeeds, serviceops
Clement_Goubert closed T368058: Set all appservers to pooled=inactive in scap, a subtask of T367949: Spin down api_appserver and appserver clusters, as Resolved.
Mon, Jun 24, 12:24 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T368058: Set all appservers to pooled=inactive in scap as Resolved.
Mon, Jun 24, 12:24 PM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s
Clement_Goubert added a comment to T368058: Set all appservers to pooled=inactive in scap.

Remaining pooled servers:

cgoubert@cumin1002:~$ sudo confctl select 'cluster=(api_appserver|appserver)' get | grep yes
{"mw2276.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"}
{"mw2299.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"}
{"mw1364.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=appserver,service=nginx"}
{"mw1398.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=api_appserver,service=nginx"}
Mon, Jun 24, 12:24 PM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T368058: Set all appservers to pooled=inactive in scap.
Mon, Jun 24, 12:22 PM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T368058: Set all appservers to pooled=inactive in scap.
Mon, Jun 24, 12:05 PM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s
Clement_Goubert changed the status of T368058: Set all appservers to pooled=inactive in scap from Open to In Progress.
Mon, Jun 24, 10:55 AM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s
Clement_Goubert changed the status of T368058: Set all appservers to pooled=inactive in scap, a subtask of T367949: Spin down api_appserver and appserver clusters, from Open to In Progress.
Mon, Jun 24, 10:52 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a project to T368251: Create ValidatingAdmissionPolicies to replace mediawiki PSP: MW-on-K8s.
Mon, Jun 24, 9:58 AM · Patch-For-Review, MW-on-K8s, Kubernetes, serviceops, Prod-Kubernetes

Fri, Jun 21

Clement_Goubert closed T366481: registry2004 sometimes reporting: too many open files problems as Resolved.

Since the bump to 10240 open files resulted in one last spike of errors in the last 10 days then nothing, doubling it should cover regular usage.

Fri, Jun 21, 7:45 AM · serviceops, Wikimedia-production-error

Thu, Jun 20

Clement_Goubert created T368079: Relabel codfw kubernetes nodes.
Thu, Jun 20, 3:59 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Sure, we can start next Monday around 1400UTC if that works for you?

Thu, Jun 20, 2:52 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
Clement_Goubert added a comment to T367766: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet.

Thanks!

Thu, Jun 20, 2:49 PM · SRE, serviceops, ops-eqiad, DC-Ops
Clement_Goubert created T368058: Set all appservers to pooled=inactive in scap.
Thu, Jun 20, 2:43 PM · Release-Engineering-Team (Seen), SRE, serviceops, MW-on-K8s