[go: nahoru, domu]

Page MenuHomePhabricator

kamila (Kamila Součková)
Site Reliability Engineer - ServiceOps

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Mar 16 2023, 2:18 PM (76 w, 3 d)
Availability
Available
IRC Nick
kamila_
LDAP User
Kamila Součková
MediaWiki User
KSoučková-WMF [ Global Accounts ]

Recent Activity

Thu, Aug 29

kamila updated the task description for T373591: Relabel codfw kubernetes nodes.
Thu, Aug 29, 6:55 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Tue, Aug 27

kamila created T373457: Relabel codfw kubernetes nodes.
Tue, Aug 27, 4:14 PM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Aug 2 2024

kamila added a comment to T371214: Move mw-accesslog-metrics instance to k8s.

On further thought, doing it properly, i.e. creating a k8s-aux cluster in codfw and placing it there, seems like a better idea. That's going to take more time than just moving benthos. @fgiunchedi this isn't urgent, correct?

Aug 2 2024, 11:24 AM · Observability-Metrics, serviceops, MW-on-K8s

Jul 29 2024

kamila claimed T371214: Move mw-accesslog-metrics instance to k8s.
Jul 29 2024, 2:34 PM · Observability-Metrics, serviceops, MW-on-K8s

Jul 23 2024

kamila added a comment to T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.

FTR, I have reverted the buffer patch, as it shouldn't be necessary now that we have more partitions thanks to T369256, and I cannot rule it out as a possible cause for some of the "benthos is wedged" weirdness observed last week. That means we no longer know how much reserve capacity we have. In case of trouble, revert^2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055399 .

Jul 23 2024, 10:12 AM · Observability-Logging, SRE, serviceops, MW-on-K8s

Jul 17 2024

kamila merged task T370264: benthos mw-accesslog-metrics interpolation errors into T368417: Unexpected json-in-json from mediawiki-httpd-accesslog.
Jul 17 2024, 2:54 PM · MW-on-K8s, Observability-Logging, serviceops
kamila merged T370264: benthos mw-accesslog-metrics interpolation errors into T368417: Unexpected json-in-json from mediawiki-httpd-accesslog.
Jul 17 2024, 2:53 PM · Observability-Logging
kamila added a comment to T370264: benthos mw-accesslog-metrics interpolation errors.

AFAICT, this is only caused by extremely long URLs that result in truncated (and thus invalid) JSON -- see also T368417. These URLs are not "real" user-generated traffic, and are not really useful. I will look into options for making them go away.

Jul 17 2024, 2:44 PM · MW-on-K8s, Observability-Logging, serviceops

Jul 16 2024

kamila awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Love token.
Jul 16 2024, 10:28 AM · Observability-Logging

Jul 15 2024

kamila awarded T317794: requestctl can't act on cache hits a Love token.
Jul 15 2024, 12:25 PM · SRE-Sprint-Week-Sustainability-March2023, Patch-For-Review, Traffic, Sustainability (Incident Followup), conftool

Jul 12 2024

kamila added a comment to T368417: Unexpected json-in-json from mediawiki-httpd-accesslog.

There were two unrelated causes for these:

Jul 12 2024, 3:39 PM · Observability-Logging

Jul 3 2024

kamila closed T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors as Resolved.

Increasing batch size slightly improved the situation, very slowly clearing the backlog, suggesting that this was caused by some flavour of a performance issue:

Screenshot from 2024-07-03 16-26-45.png (480×1 px, 73 KB)

Because I did not see any direct resource starvation on the host, my hunch was that this was something about the message delivery.

Jul 3 2024, 2:50 PM · Observability-Logging, SRE, serviceops, MW-on-K8s
kamila closed T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors, a subtask of T276095: Keep calculating latencies for MediaWiki requests in the WikiKube environment, as Resolved.
Jul 3 2024, 2:49 PM · Observability-Logging, SRE, serviceops, MW-on-K8s

Jun 27 2024

kamila closed T340935: Some apache access logs are invalid json as Resolved.

Closing this and opening T368640, as what I'm seeing now is a tiny sub-problem of the original problem

Jun 27 2024, 4:54 PM · Observability-Logging, serviceops, MW-on-K8s
kamila triaged T368640: glogger produces invalid JSON when given input with non-printable characters as Low priority.
Jun 27 2024, 4:51 PM · Observability-Logging, serviceops, MW-on-K8s
kamila created T368640: glogger produces invalid JSON when given input with non-printable characters.
Jun 27 2024, 4:51 PM · Observability-Logging, serviceops, MW-on-K8s
kamila awarded T357309: Create a deployment for `shellbox-timedmedia` a Love token.
Jun 27 2024, 9:59 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Jun 26 2024

kamila added a comment to T367076: benthos mw-accesslog-metrics kafka lag and interpolation errors.

I believe the errors are unrelated (they are due to T340935 and we've had bad messages before and they didn't cause the problem).

Jun 26 2024, 1:36 PM · Observability-Logging, SRE, serviceops, MW-on-K8s

Jun 24 2024

kamila updated the task description for T367757: Request to add mnz to analytics-research-admins.
Jun 24 2024, 5:30 PM · Patch-For-Review, SRE, SRE-Access-Requests

Jun 21 2024

kamila updated the task description for T368140: Grant Access to wmf for daphnesmit.
Jun 21 2024, 1:08 PM · SRE, LDAP-Access-Requests
kamila added a comment to T368140: Grant Access to wmf for daphnesmit.

@DSmit-WMF Are you requesting SSH access too, or just tools?

Jun 21 2024, 1:03 PM · SRE, LDAP-Access-Requests
kamila moved T367757: Request to add mnz to analytics-research-admins from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Jun 21 2024, 12:57 PM · Patch-For-Review, SRE, SRE-Access-Requests
kamila updated subscribers of T367757: Request to add mnz to analytics-research-admins.

@KFrancis can you please make sure @MunizaA's NDA is signed? Thank you!

Jun 21 2024, 12:55 PM · Patch-For-Review, SRE, SRE-Access-Requests
kamila updated the task description for T367757: Request to add mnz to analytics-research-admins.
Jun 21 2024, 12:52 PM · Patch-For-Review, SRE, SRE-Access-Requests
kamila updated the task description for T367757: Request to add mnz to analytics-research-admins.
Jun 21 2024, 12:32 PM · Patch-For-Review, SRE, SRE-Access-Requests
kamila updated subscribers of T368027: Grant Access to analytics-privatedata-users for cwylo.

@leila could you please confirm this access request? Thank you!

Jun 21 2024, 11:25 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila updated the task description for T368027: Grant Access to analytics-privatedata-users for cwylo.
Jun 21 2024, 11:21 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila added a comment to T368027: Grant Access to analytics-privatedata-users for cwylo.

@cwylo Can you please confirm that you have read the Analytics Data Access User Responsibilities ? Thank you!

Jun 21 2024, 11:20 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila updated the task description for T368027: Grant Access to analytics-privatedata-users for cwylo.
Jun 21 2024, 11:20 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila added a comment to T368027: Grant Access to analytics-privatedata-users for cwylo.

Needs approval from one of:

  • Olja Dimitrjevic
  • Dan Andreescu
  • Will Doran
  • Andreas Hoelzl
  • Andrew Otto
Jun 21 2024, 11:12 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila moved T368027: Grant Access to analytics-privatedata-users for cwylo from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Jun 21 2024, 11:12 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila moved T368027: Grant Access to analytics-privatedata-users for cwylo from Backlog to Manager Approval Pending on the LDAP-Access-Requests board.
Jun 21 2024, 11:12 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila added a project to T368027: Grant Access to analytics-privatedata-users for cwylo: Data-Engineering.
Jun 21 2024, 11:10 AM · SRE-Access-Requests, Data-Engineering, SRE
kamila closed T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G as Resolved.

Done, thanks a lot for the help @Papaul !

Jun 21 2024, 10:34 AM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
kamila closed T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G, a subtask of T353464: Migrate wikikube control planes to hardware nodes, as Resolved.
Jun 21 2024, 10:33 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
kamila closed T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G, a subtask of T366094: k8s master capacity issues, as Resolved.
Jun 21 2024, 10:33 AM · serviceops, SRE

Jun 19 2024

kamila moved T367681: Update terms and timeline of access already granted for AndyRussG from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Jun 19 2024, 4:36 PM · SRE, LDAP-Access-Requests
kamila updated the task description for T367872: Grant Access to analytics-privatedata-users for DMburugu.
Jun 19 2024, 3:47 PM · SRE-Access-Requests, Data-Engineering, SRE
kamila added a comment to T367681: Update terms and timeline of access already granted for AndyRussG.

@AndyRussG since you're now with WMDE, we'd like to update your email, can you please let me know your @wikimedia.de email address?

Jun 19 2024, 3:40 PM · SRE, LDAP-Access-Requests
kamila added a comment to T367872: Grant Access to analytics-privatedata-users for DMburugu.

@DMburugu can you please confirm that you have read the Analytics Data Access User Responsibilities? Thanks!

Jun 19 2024, 1:02 PM · SRE-Access-Requests, Data-Engineering, SRE
kamila updated the task description for T367872: Grant Access to analytics-privatedata-users for DMburugu.
Jun 19 2024, 12:57 PM · SRE-Access-Requests, Data-Engineering, SRE
kamila closed T367914: Update "WMDE group" approvers on Wikitech as Resolved.
Jun 19 2024, 12:37 PM · wikitech.wikimedia.org, SRE-Access-Requests, SRE
kamila created T367967: Requesting content administrator access for Kamila Součková.
Jun 19 2024, 12:25 PM · wikitech.wikimedia.org
kamila closed T367184: Grant Access to ldap/wmde for Audrey Penven as Resolved.

Done, though it's my first time doing clinic duty, so let me know if it doesn't work :D

Jun 19 2024, 10:31 AM · SRE, LDAP-Access-Requests
kamila added a member for WMF-NDA: AudreyPenven_WMDE.
Jun 19 2024, 10:25 AM

Jun 18 2024

kamila updated subscribers of T367681: Update terms and timeline of access already granted for AndyRussG.

As per the process at https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMDE_Group , we need confirmation from an engineering manager. @WMDE-leszek would you mind confirming this request please?

Jun 18 2024, 6:53 PM · SRE, LDAP-Access-Requests
kamila added a comment to T367681: Update terms and timeline of access already granted for AndyRussG.

@AndyRussG Can you please confirm the new end date? I assume you meant July 30, 2024, correct?

Jun 18 2024, 5:33 PM · SRE, LDAP-Access-Requests
kamila changed the status of T367184: Grant Access to ldap/wmde for Audrey Penven from In Progress to Stalled.
Jun 18 2024, 2:58 PM · SRE, LDAP-Access-Requests

Jun 17 2024

kamila added a comment to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.

wikikube-ctrl2003 looks happy, thanks for the help!

Jun 17 2024, 3:33 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
kamila changed the status of T367295: Requesting access to private data-based dashboards for Jsn.sherman from Open to Stalled.

waiting for approval

Jun 17 2024, 1:57 PM · Data-Engineering, SRE, SRE-Access-Requests
kamila changed the status of T365074: Requesting access to cassandra-staging-devs for milimetric from In Progress to Stalled.

Stalled on @Milimetric signing L3

Jun 17 2024, 1:56 PM · SRE, SRE-Access-Requests
kamila claimed T365074: Requesting access to cassandra-staging-devs for milimetric.
Jun 17 2024, 1:52 PM · SRE, SRE-Access-Requests
kamila updated the task description for T365074: Requesting access to cassandra-staging-devs for milimetric.
Jun 17 2024, 1:51 PM · SRE, SRE-Access-Requests
kamila added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

@Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with something like ssh-rsa AAAA...

There's an existing user already, this only needs to have the group added: L1920 in data.yaml

Jun 17 2024, 12:39 PM · SRE, SRE-Access-Requests
kamila added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

@Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with something like ssh-rsa AAAA...

Jun 17 2024, 12:33 PM · SRE, SRE-Access-Requests
kamila added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

@Milimetric I cannot find your signature on L3, could you please ensure that you have signed it?

Jun 17 2024, 12:27 PM · SRE, SRE-Access-Requests

Jun 14 2024

kamila awarded T367466: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences a Like token.
Jun 14 2024, 1:56 PM · SRE-tools, Infrastructure-Foundations, Spicerack, Observability-Alerting
kamila closed T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G as Resolved.

New NICs seem to be happy including overnight network testing out of sheer paranoia, so calling it done 🎉

Jun 14 2024, 8:19 AM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
kamila closed T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G, a subtask of T353464: Migrate wikikube control planes to hardware nodes, as Resolved.
Jun 14 2024, 8:19 AM · Patch-For-Review, serviceops, Prod-Kubernetes, Kubernetes
kamila closed T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G, a subtask of T366094: k8s master capacity issues, as Resolved.
Jun 14 2024, 8:19 AM · serviceops, SRE

Jun 13 2024

kamila added a comment to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.

@kamila your plan works for us as well, just depool and power the first server you want to move in your time zone and let us know which one and when we are on site in our time zone we will put in the 10G nic, move it to the new rack and do all the Netbox changes and hang it back to you for re-images. Once you are happy with it we can move to the next one.

Jun 13 2024, 10:07 AM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops

Jun 12 2024

kamila reopened T340935: Some apache access logs are invalid json as "Open".

I'm seeing more bad messages that look like an encoding problem:

Jun 12 2024, 10:40 AM · Observability-Logging, serviceops, MW-on-K8s

Jun 11 2024

kamila added a comment to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.

@Papaul Thanks for the additional details!

Jun 11 2024, 1:46 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops

Jun 10 2024

kamila added a comment to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.

@Papaul could you please let me know when would be a good time for you to do this? We don't have any specific time requirements, just that earlier would be nice. I would like to do it in two steps (first just 1 server, then the other 2) for capacity reasons. I am in CEST, so US mornings would work well for me, but I can also decom in advance and reimage the next day if you'd prefer to do it asynchronously, just let me know.

Jun 10 2024, 3:32 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops

Jun 6 2024

kamila added a comment to T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G.

@VRiley-WMF the reimage of wikikube-ctrl1001 was finally successful, I want to run a few more tests due to having had network(?) problems during the reimage, but I _think_ we should be good to proceed next week, if that works for you. Tuesday would be the best day for me, but I can make other days work too. I believe that capacity-wise, we can move the two remaining machines at the same time if that's easier for you.

Jun 6 2024, 6:49 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad

Jun 4 2024

kamila added a comment to T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G.

@VRiley-WMF Yes, that works, thank you!

Jun 4 2024, 9:27 AM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad

May 31 2024

kamila added a comment to T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G.

@VRiley-WMF I am in UTC+2, so US mornings are best for me. Would Tuesday work for you? Thank you!

May 31 2024, 1:25 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad

May 29 2024

kamila created T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.
May 29 2024, 4:41 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
kamila created T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G.
May 29 2024, 4:40 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad

Mar 27 2024

kamila updated the task description for T359423: Migrate charts to Calico Network Policies.
Mar 27 2024, 2:44 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops

Mar 25 2024

TheDJ awarded T357296: Create new flavour of shellbox for video transcoding a Party Time token.
Mar 25 2024, 1:27 PM · Video, MW-on-K8s, serviceops
kamila closed T357296: Create new flavour of shellbox for video transcoding as Resolved.

Based on some quick tests the image seems to be working \o/

Mar 25 2024, 12:23 PM · Video, MW-on-K8s, serviceops
kamila closed T357296: Create new flavour of shellbox for video transcoding, a subtask of T356241: Move video transcoding to use Shellbox, as Resolved.
Mar 25 2024, 12:23 PM · Video, TimedMediaHandler, MW-on-K8s, serviceops

Mar 21 2024

kamila added a comment to T197470: find a way to systematically update the deployment server name across all repos.

For the record, I used sudo bash -c 'find /srv/deployment -name DEPLOY_HEAD | xargs sed -i "s/git_server: deploy1002.eqiad.wmnet/git_server: deploy2002.codfw.wmnet/"' (or the other way around) on all deployment servers to work around this.

Mar 21 2024, 11:53 AM · Release-Engineering-Team (Priority Backlog 📥), Scap
kamila added a comment to T360597: Increased latency, timeouts from wikifeeds since march 10th.

Note also the increase in RX (but not really TX) traffic that coincides with these:

Screenshot from 2024-03-21 11-59-27.png (1×1 px, 217 KB)

Mar 21 2024, 11:21 AM · Content-Transform-Team-WIP, Patch-For-Review, serviceops, Content-Transform-Team

Feb 26 2024

kamila added a comment to T312067: SRE needs a logo.

Inspired by some of the above:

Screenshot_20240226_161618_Sketchbook.png (780×769 px, 174 KB)
Screenshot_20240226_161654_Sketchbook.png (641×710 px, 104 KB)
Screenshot_20240226_161719_Sketchbook.png (625×754 px, 136 KB)

Feb 26 2024, 3:23 PM · Design, Wikimedia-Design, SRE Program Management, Logos, SRE

Feb 21 2024

kamila changed the status of T357309: Create a deployment for `shellbox-timedmedia`, a subtask of T356241: Move video transcoding to use Shellbox, from Open to In Progress.
Feb 21 2024, 10:42 AM · Video, TimedMediaHandler, MW-on-K8s, serviceops
kamila changed the status of T357309: Create a deployment for `shellbox-timedmedia` from Open to In Progress.
Feb 21 2024, 10:42 AM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Feb 19 2024

kamila created T357907: Migrate remaining internal MW API traffic to k8s.
Feb 19 2024, 2:19 PM · MW-on-K8s, serviceops

Feb 12 2024

kamila claimed T357309: Create a deployment for `shellbox-timedmedia`.
Feb 12 2024, 6:15 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Feb 7 2024

kamila updated the task description for T356877: Increase visibility of kubernetes network status.
Feb 7 2024, 3:09 PM · observability, Prod-Kubernetes, serviceops, Kubernetes
kamila updated the task description for T356877: Increase visibility of kubernetes network status.
Feb 7 2024, 3:01 PM · observability, Prod-Kubernetes, serviceops, Kubernetes
kamila created T356877: Increase visibility of kubernetes network status.
Feb 7 2024, 2:59 PM · observability, Prod-Kubernetes, serviceops, Kubernetes

Feb 6 2024

kamila added a comment to T356709: Debian installer waits for input for network config during host reimage.

mw1386 seems to be going fine now, so yes, we can close this. Sorry and thanks for finding the cause @Volans <3

Feb 6 2024, 12:23 PM · Infrastructure-Foundations, serviceops

Feb 5 2024

kamila created T356709: Debian installer waits for input for network config during host reimage.
Feb 5 2024, 9:45 PM · Infrastructure-Foundations, serviceops

Feb 1 2024

kamila created P56055 POC benthos creating k8s jobs from kafka.
Feb 1 2024, 11:49 AM

Jan 23 2024

kamila closed T355187: Reimage cookbook fails to downtime hosts when run concurrently as Resolved.

I believe the above patch fixed it, so I'm closing this. I will reopen in case I see the race again.

Jan 23 2024, 12:29 PM · SRE-tools, Infrastructure-Foundations

Jan 17 2024

kamila closed T350846: Migrate mobileapps to k8s as Resolved.

All traffic is now going to k8s \o/

Jan 17 2024, 5:30 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
kamila closed T350846: Migrate mobileapps to k8s, a subtask of T333120: Migrate internal traffic to k8s, as Resolved.
Jan 17 2024, 5:30 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Jan 16 2024

kamila created T355187: Reimage cookbook fails to downtime hosts when run concurrently.
Jan 16 2024, 8:20 PM · SRE-tools, Infrastructure-Foundations

Jan 15 2024

kamila awarded T353836: RequestTimeoutException when attempting to visit user page of non-existent user a Love token.
Jan 15 2024, 1:28 PM · SecTeam-Processed, Performance Issue, MediaWiki-Blocks, MediaWiki-Engineering, Security, Wikimedia-production-error

Jan 10 2024

kamila closed T354413: Reboot issues for mw13[77-83].eqiad.wmnet as Resolved.

With Alex's patches, I ran 7 reimages and 20 reboots without the issue reappearing. It might be worthwhile to understand the issue better to see if that workaround is adequate, but as it appears to not be blocking us anymore, I'm closing this.

Jan 10 2024, 6:49 PM · serviceops, MW-on-K8s
kamila closed T354413: Reboot issues for mw13[77-83].eqiad.wmnet, a subtask of T351074: Move servers from the appserver/api cluster to kubernetes, as Resolved.
Jan 10 2024, 6:48 PM · serviceops, MW-on-K8s

Jan 8 2024

kamila updated the task description for T354413: Reboot issues for mw13[77-83].eqiad.wmnet.
Jan 8 2024, 5:42 PM · serviceops, MW-on-K8s
kamila triaged T354413: Reboot issues for mw13[77-83].eqiad.wmnet as High priority.
Jan 8 2024, 5:41 PM · serviceops, MW-on-K8s
kamila added a comment to T354413: Reboot issues for mw13[77-83].eqiad.wmnet.

Note a few other things that have been tried:

Jan 8 2024, 5:38 PM · serviceops, MW-on-K8s

Jan 5 2024

kamila added a comment to T354413: Reboot issues for mw13[77-83].eqiad.wmnet.

Note that this is non-deterministic: the problem seems to happen more than half the time but far from always. So several reboots may be required to reproduce. Yay!

Jan 5 2024, 5:07 PM · serviceops, MW-on-K8s
kamila added a comment to T354413: Reboot issues for mw13[77-83].eqiad.wmnet.

Is this reproducible with every reboot or just some? One thing worth doing is to connect to the serial console an then issue a reboot over Cumin. Maybe we're seeing a kernel oops during the system tear down?

Jan 5 2024, 12:24 PM · serviceops, MW-on-K8s
kamila added a comment to T354413: Reboot issues for mw13[77-83].eqiad.wmnet.

Additional findings:

  • the watchdog: watchdog0: watchdog did not stop! message seems to be a red herring, it's always there
  • the problem only occurs when running nohup reboot (which is what the cookbooks do), not when I run just reboot
  • it cannot be reproduced on an insetup system with overlayfs in use
Jan 5 2024, 12:22 PM · serviceops, MW-on-K8s
kamila added a comment to T354413: Reboot issues for mw13[77-83].eqiad.wmnet.

Note that we have tried updating the firmware: mw1388 is on new UEFI and iDRAC and still exhibits this problem.

Jan 5 2024, 11:43 AM · serviceops, MW-on-K8s