User Details
- User Since
- Jul 26 2022, 2:11 PM (103 w, 3 d)
- Availability
- Available
- IRC Nick
- claime
- LDAP User
- Clément Goubert
- MediaWiki User
- CGoubert-WMF [ Global Accounts ]
Today
It is still relevant for videoscalers, but we'll remove it once we're rid of them
I *tried* very hard to automate it with a cookbook, but the behavior is wildly inconsistent between runs, sometimes requiring a reboot for deleting vdisks, sometimes not, sometimes starting to run the vdisk deletion only to fail without an explanation.
Wed, Jul 17
Please do :) I'll leave the task open so it doesn't open a new one when I inevitably break it again. For the record, any raid issue you'll get for mw2432, mw2433, mw2438, mw2439 until T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) is resolved is probably me. I will try and find out why downtiming the host didn't stop the autocreation of the ticket, but if I don't I'll watch for them being auto-opened and comment on them asap.
Hi @Jhancock.wm very sorry for the noise, this is me trying to automate turning the RAID controller to HBA mode, there are no actual issues with the disk. I didn't know it would create a ticket for you automatically.
I have merged the access change, puppet will deploy it in the next half-hour or so. I'm resolving this task, feel free to reopen should you encounter any issue.
@Milimetric Can you sign L3 so we can move forward with this?
Tue, Jul 16
Can you please read and sign the L3 document, as well as read the Data Access User Responsibilities document.
Mon, Jul 15
@KFrancis can you please confirm NDA status?
Resolving this as it seems everything is in order. Don't hesitate to reopen should you encounter any issues.
@XiaoXiao-WMF You should have received an email with instructions on how to set your kerberos password. Please reopen this task should you encounter any issues.
Thu, Jul 11
Tagging @RLazarus as I haven't really kept up with mwscript development
Wed, Jul 10
We've only had one spike of job enqueuing errors since merging Restart=on-failure, I think I'll call this resolved for now, and reopen if we see problems again.
This seems pretty well setlled now with:
- mw-api-ext for external api calls
- mw-api-int for internal api calls
- mw-web for external live users
- mw-parsoid for parsoid
- mw-jobrunner for non-video jobs
- mw-videoscaler for videoscaling jobs
- mw-debug for testing
- mw-script for maintenance scripts and periodic jobs
- mw-misc for noc.wikimedia.org
- mw-wikifunctions dedicated for Abstract Wikipedia
I'd recommend using the dedicated REPL for this
[...]
The preferred alternative would presumably be mw-api-int-ro.discovery.wmnet:4446, but I'd like to confirm with @Clement_Goubert as well.
Mon, Jul 8
All internal traffic has been migrated.
All traffic has been migrated to MW-on-K8s
We can get prom metrics using the release label. Boldly closing.
This functionality has been added to scap.
The work this task tracked is now completed. Remaining migrations T352650: Migrate current-generation dumps to run from our containerized images, T355292: Port videoscaling to kubernetes, T355292: Port videoscaling to kubernetes, and cleanup work T367949: Spin down api_appserver and appserver clusters will indeed be tracked through T290536.
Thu, Jul 4
Wed, Jul 3
Only hosts left are:
- The 5 nodes with an incorrect RAID config from T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) that haven't yet been reimaged
- The codfw nodes to be decommissioned
- The 3 nodes waiting for T367949: Spin down api_appserver and appserver clusters (the fourth is part of the nodes to be decommissioned)
Lowering priority now that the production deployments of mw-on-k8s have been done.
Tagging in @RLazarus for mw-script, I don't know how you want to handle it given the transient nature of the environment.
From inside the pod, nodejs DNS lookup returns the ipv6 for staging.svc.eqiad.wmnet
runuser@mobileapps-staging-6794bbd5c6-cp6gd:/srv/service$ nodejs Welcome to Node.js v18.19.0. Type ".help" for more information. > const dns = require('node:dns') undefined > dns.lookup('staging.svc.eqiad.wmnet', (err, address, family) => { ... console.log('address: %j family: IPv%s', address, family); ... }); GetAddrInfoReqWrap { callback: [Function (anonymous)], family: 0, hostname: 'staging.svc.eqiad.wmnet', oncomplete: [Function: onlookup] } > address: "2620:0:861:102:10:64:16:55" family: IPv6 ``
root@kubestage1003:/home/cgoubert# curl https://staging.svc.eqiad.wmnet:4492/v1/events -H "Content-Type: application/json" -d '@data.txt' | jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1359 100 549 100 810 4773 7043 --:--:-- --:--:-- --:--:-- 11817 { "invalid": [], "error": [ { "status": "error", "event": { "$schema": "/resource_change/1.0.0", "meta": { "request_id": "311902cc-0319-41b0-9fc6-9ca1dcb47dd7", "id": "68a24380-37a9-11ef-b9d3-87400a6a8e37", "dt": "2024-07-01T12:57:01.624Z", "domain": "en.wikipedia.org", "uri": "en.wikipedia.org/api/rest_v1/page/mobile-html/Dog", "stream": "resource_purge" }, "tags": [ "pcs" ] }, "context": { "message": "event 68a24380-37a9-11ef-b9d3-87400a6a8e37 of schema at /resource_change/1.0.0 destined to stream resource_purge is not allowed in stream; resource_purge is not configured." } } ] }
Tue, Jul 2
Host is flapping, setting downtime until tomorrow
Done for mw-misc and mw-wikifunctions.
Mon, Jul 1
All main deployments of mw-on-k8s now send data to statsd-exporter.
Remaining are mw-misc, mw-wikifunctions, and the yet to be used mw-script and mw-videoscaler which I'll do this week, but won't be sending much data.
Fri, Jun 28
Thu, Jun 27
This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.
New version has been deployed.
If we jump right to bookworm, we need to copy the python3-docker-report package to bookworm.
Wed, Jun 26
It's possible it's to do with docker using single-threaded gzip for compression on push https://github.com/moby/moby/issues/41987
- Necessary packages docker-registry and python3-docker-report are available for bullseye in the right versions
- Summarizing from irc, the real risk is the nginx config. Tests would need to be ran for:
- publish a non-restricted image from build2001
- publish a restricted image from deploy1002 (we test both authenticated and unauthenticated POST to a nonexistent upload path in httpbb, but a real build + push test would be better)
- pull a non-restricted image without credentials from the public interface (already in httpbb)
- pull a restricted image without credentials and see it fail (already in httpbb)
- check the pipeline on gitlab still works
Hi, I just checked on a bare-metal debug server (mwdebug1001), and it takes 1.9s, so I doubt it's k8s-related
Tue, Jun 25
Looks like something between the deployment server and the registry:
registry rx graph for the last long build-and-push
deploy tx graph for the same period
mw-on-k8s pods should now rate limit the logs sent to mwlog at 100 messages per second per pod.
Mon, Jun 24
Remaining pooled servers:
cgoubert@cumin1002:~$ sudo confctl select 'cluster=(api_appserver|appserver)' get | grep yes {"mw2276.codfw.wmnet": {"weight": 25, "pooled": "yes"}, "tags": "dc=codfw,cluster=appserver,service=nginx"} {"mw2299.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"} {"mw1364.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=appserver,service=nginx"} {"mw1398.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad,cluster=api_appserver,service=nginx"}
Fri, Jun 21
Since the bump to 10240 open files resulted in one last spike of errors in the last 10 days then nothing, doubling it should cover regular usage.
Thu, Jun 20
Sure, we can start next Monday around 1400UTC if that works for you?