[go: nahoru, domu]

Page MenuHomePhabricator

BTullis (Ben)
Senior SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jun 29 2021, 9:56 AM (159 w, 3 d)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Today

BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

Hmm. I tried patching the growthbook front-end code with this gist: https://gist.github.com/zicklag/1bb50db6c5138de347c224fda14286da
...with the intention of using setGlobalDispatcher if the http_proxy environment variable is set.

Fri, Jul 19, 5:18 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

Same issue, unfortunately. I also found this conversation: https://github.com/vercel/next.js/discussions/44959 which mentions being unable to prevent the downloading.

Fri, Jul 19, 4:29 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

I'm not sure if it will work, because the message still says Downloading swc... but then it says SWC is disabled.

image.png (226×1 px, 61 KB)

I won't know until I merge to main, anyway.

Fri, Jul 19, 4:19 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

A search for Next.js download SWC (from download-swc.js in that stacktrace you provided yielded: https://nextjs.org/docs/messages/failed-loading-swc

Quoting that page:

If SWC continues to fail to load you can opt-out by disabling swcMinify in your next.config.js or by adding a .babelrc to your project with the following content:

.babelrc
{
  "presets": ["next/babel"]
}
Fri, Jul 19, 3:55 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

There is a slight problem with building the Growthbook image.

Fri, Jul 19, 3:16 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis updated the task description for T365839: Deploy an instance of GrowthBook to Kubernetes.
Fri, Jul 19, 11:06 AM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis updated the task description for T365839: Deploy an instance of GrowthBook to Kubernetes.
Fri, Jul 19, 9:18 AM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products

Yesterday

BTullis updated the task description for T365839: Deploy an instance of GrowthBook to Kubernetes.
Thu, Jul 18, 5:06 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis closed T369278: MapReduce history server is repeatedly crashing as Resolved.

That change worked as expected to change the heap size.

image.png (471×1 px, 147 KB)

This panel confirms the size change.
image.png (922×1 px, 56 KB)

Thu, Jul 18, 3:21 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

I've now got the ferretdb image ready as well. I will add submit requests to add both projects to the trusted runners, then start working on the helm chart for it.

Thu, Jul 18, 2:03 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis moved T369278: MapReduce history server is repeatedly crashing from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Thu, Jul 18, 2:00 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis moved T369278: MapReduce history server is repeatedly crashing from In Progress to Needs Review on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Thu, Jul 18, 11:37 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis closed T370392: an-launcher1002 is short of disk space on /srv as Resolved.
Thu, Jul 18, 8:59 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T370392: an-launcher1002 is short of disk space on /srv.
btullis@an-launcher1002:~$ sudo lvresize -l+100%FREE vg0/srv
  Size of logical volume vg0/srv changed from 74.27 GiB (19014 extents) to <111.07 GiB (28433 extents).
  Logical volume vg0/srv successfully resized.
Thu, Jul 18, 8:59 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T370392: an-launcher1002 is short of disk space on /srv.

There is 37 GB of space free on the vg0 volume group.

btullis@an-launcher1002:~$ sudo vgs
  VG  #PV #LV #SN Attr   VSize   VFree 
  vg0   1   3   0 wz--n- 185.90g 37.18g

We can allocate the this space to /srv to help deal with the issue.

Thu, Jul 18, 8:52 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis triaged T370392: an-launcher1002 is short of disk space on /srv as High priority.
Thu, Jul 18, 8:48 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis created T370392: an-launcher1002 is short of disk space on /srv.
Thu, Jul 18, 8:47 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)

Wed, Jul 17

BTullis added a comment to T370122: dbstore1008:3317 (s7) crashed.

@BTullis most likely, as I forgot this host has special grants yeah. If you can apply them, that'd be good. If not, I can do it first thing tomorrow morning.

Thanks for the heads up

No problem. Thanks. I'll do it now.

Wed, Jul 17, 3:47 PM · Data-Engineering, DBA
BTullis added a comment to T370122: dbstore1008:3317 (s7) crashed.

@Marostegui - is there a chance that you might have missed some of the grants after re-cloning s7 yesterday.
We have had a sqoop failure with a message about access being denied for the research user.

Access denied for user 'research'@'10.64.21.109'

I am happy to re-apply them if needed, but it would be useful to know what process you followed. Thanks.

Wed, Jul 17, 3:35 PM · Data-Engineering, DBA
BTullis added a comment to T369278: MapReduce history server is repeatedly crashing.

Looking into this, it seems that we don't modify the amount of Java heap available to the mapreduce history server, so it picks up the default value of 1 GB.

image.png (152×1 px, 51 KB)

Wed, Jul 17, 1:45 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis moved T368518: decommission clouddb1021 from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.

I have stopped all of the sections with the following command and confirmed a clean shutdown.

btullis@clouddb1021:~$ for i in $(seq 1 8); do sudo systemctl stop mariadb@s$i ; done
btullis@clouddb1021:~$ for i in $(seq 1 8); do systemctl status mariadb@s$i ; done
● mariadb@s1.service - mariadb database server
     Loaded: loaded (/lib/systemd/system/mariadb@.service; disabled; vendor preset: enabled)
     Active: inactive (dead)
Wed, Jul 17, 8:49 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), decommission-hardware

Tue, Jul 16

BTullis claimed T368518: decommission clouddb1021.
Tue, Jul 16, 3:48 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), decommission-hardware
BTullis moved T368518: decommission clouddb1021 from Backlog to In Progress on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Tue, Jul 16, 3:48 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), decommission-hardware
BTullis renamed T365449: Upgrade Airflow to 2.9.3 from Upgrade Airflow to 2.9.2 to Upgrade Airflow to 2.9.3.
Tue, Jul 16, 3:35 PM · Release-Engineering-Team (Seen), Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review, Data Pipelines, Data-Engineering
BTullis added a comment to T365424: Upgrade clouddb* hosts to Bookworm.

@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.

Tue, Jul 16, 10:42 AM · cloud-services-team (FY2023/2024-Q3-Q4), Data-Persistence, Data-Services
BTullis added a comment to T370122: dbstore1008:3317 (s7) crashed.

@Marostegui - please feel free to go ahead and re-clone it. Now 'ish a relatively good time of the month to do it.

Tue, Jul 16, 8:26 AM · Data-Engineering, DBA

Mon, Jul 15

BTullis added a comment to T370050: Some Wikidata + MediaInfo dumps missing for week of 2024-07-08.

I can give a status update here, which I hope will be useful.

Mon, Jul 15, 5:15 PM · Wikidata Dev Team (Wikidata.org Slice), Data-Platform, wmde-wikidata-tech, Wikidata, Dumps-Generation
BTullis updated subscribers of T365839: Deploy an instance of GrowthBook to Kubernetes.

I have created an initial blubber/kokkuri pipeline for building our own version of Growthbook, based on the upstream Dockerfile.

Mon, Jul 15, 4:42 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis updated the task description for T365839: Deploy an instance of GrowthBook to Kubernetes.
Mon, Jul 15, 1:58 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis moved T368757: Create a git-sync container image to be used with airflow from Backlog to In Progress on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Mon, Jul 15, 1:36 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis edited projects for T351731: Turnilo: invalid transforms on wmf_netflow dashboard, added: Data-Platform-SRE; removed Data-Engineering.

The problem appears to have come back after I did a full restart of the Druid cluster on July 11th.
This is how the list of dimensions appears when the problem is in effect.

turnilo-before.png (808×683 px, 43 KB)

This is how it should appear:
turnilo-after.png (887×658 px, 46 KB)

It was fixed by doing a systemctl restart turnilo.service on an-tool1007.

Mon, Jul 15, 1:15 PM · Data-Platform-SRE
BTullis updated subscribers of T362529: Create a Wikimedians of United Arab Emirates User Group Wiki.

@Zabe it seems we were missing the "storage layer" task we usually get. Anyway, this is done on our side.

@BTullis @fnegri remains the views creation. Note, I'm not sure how to take account of an-redacteddb1001 in that whole procedure. Please let me know if I have to integrate it somewhere.

Mon, Jul 15, 9:44 AM · MW-1.43-notes (1.43.0-wmf.15; 2024-07-23), Data-Services, Patch-For-Review, Wiki-Setup (Create)
BTullis moved T369278: MapReduce history server is repeatedly crashing from Backlog to In Progress on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Mon, Jul 15, 9:02 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis reopened T369278: MapReduce history server is repeatedly crashing as "Open".

Reopening, as we have seen this happening again.

Mon, Jul 15, 9:02 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T365424: Upgrade clouddb* hosts to Bookworm.

@fnegri before clouddb1021 gets decommissioned - it could be a good test for you to reimage it to bookworm and see how the process can look like for the rest of hosts cc @BTullis

Mon, Jul 15, 8:48 AM · cloud-services-team (FY2023/2024-Q3-Q4), Data-Persistence, Data-Services
BTullis added a comment to T365453: Bring an-redacteddb1001 into service to replace clouddb1021.

@BTullis we probably need to add an-redacteddb1001 to hieradata/regex.yaml somewhere.

Mon, Jul 15, 8:41 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)

Thu, Jul 11

BTullis closed T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6 as Resolved.

I believe that this is all done now. I have swapped the roles back, so that an-mariadb1001 is the active master and an-mariadb1002 is the replica.
The backup host, db1208, is set to replicate from an-mariadb1001 again.

Thu, Jul 11, 10:57 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review

Wed, Jul 10

BTullis claimed T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

Hi @Kappakayala - Is the data present now, or is it still absent? i.e. Is there an ongoing incident, or is this a retrospective investigation?

Wed, Jul 10, 4:22 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
BTullis added a comment to T351731: Turnilo: invalid transforms on wmf_netflow dashboard.

Thanks @ayounsi and @elukey for handling this.
Not sure if it is related, but I did a full restart (not rolling) today of both druid clusters, because of a role-swap of an-mariadb100[1-2] servers. The process was here: T365503#9965173

Wed, Jul 10, 3:55 PM · Data-Platform-SRE
BTullis closed T369116: an-presto1004 has reduced total memory size as Resolved.

Maybe the cold boot did help, after all. I just checked and the total memory is shown as 128 GB again.

image.png (921×1 px, 114 KB)

Wed, Jul 10, 3:46 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

I have now reimaged an-mariadb1001 and it has installed mariadb 10.6.

btullis@an-mariadb1001:~$ sudo mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 491
Server version: 10.6.18-MariaDB-log MariaDB Server

I'll leave this ticket open until after tomorrow's failback from an-mariadb1002.

Wed, Jul 10, 2:46 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis updated the task description for T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.
Wed, Jul 10, 2:43 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis updated subscribers of T365839: Deploy an instance of GrowthBook to Kubernetes.
Wed, Jul 10, 2:37 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

Ok, it seems that we have an issue because of T213996: New MongoDB version is not DFSG-compatible, dropped by Debian.
Namely that MongoDB is now SSPL licensed, which means that we can't run it in production at the moment.

Wed, Jul 10, 2:36 PM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

This is the natural schedule of when the dumps that were disabled will run next.

btullis@snapshot1017:~$ systemctl list-timers | grep n/a
Wed 2024-07-10 23:00:00 UTC 9h left             n/a                         n/a                wikidatardf-truthy-dumps.timer                    wikidatardf-truthy-dumps.service
Thu 2024-07-11 05:00:00 UTC 15h left            n/a                         n/a                categoriesrdf-dump-daily.timer                    categoriesrdf-dump-daily.service
Fri 2024-07-12 09:10:00 UTC 1 day 19h left      n/a                         n/a                xlation-dumps.timer                               xlation-dumps.service
Fri 2024-07-12 23:00:00 UTC 2 days left         n/a                         n/a                wikidatardf-lexemes-dumps.timer                   wikidatardf-lexemes-dumps.service
Sat 2024-07-13 08:15:00 UTC 2 days left         n/a                         n/a                global_blocks_dump.timer                          global_blocks_dump.service
Sat 2024-07-13 08:15:00 UTC 2 days left         n/a                         n/a                growth_mentorship_dump.timer                      growth_mentorship_dump.service
Sat 2024-07-13 20:00:00 UTC 3 days left         n/a                         n/a                categoriesrdf-dump.timer                          categoriesrdf-dump.service
Sun 2024-07-14 19:00:00 UTC 4 days left         n/a                         n/a                commonsrdf-dump.timer                             commonsrdf-dump.service
Mon 2024-07-15 03:15:00 UTC 4 days left         n/a                         n/a                commonsjson-dump.timer                            commonsjson-dump.service
Mon 2024-07-15 03:15:00 UTC 4 days left         n/a                         n/a                wikidatajson-dump.timer                           wikidatajson-dump.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s1.timer                        cirrussearch-dump-s1.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s11.timer                       cirrussearch-dump-s11.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s2.timer                        cirrussearch-dump-s2.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s3.timer                        cirrussearch-dump-s3.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s4.timer                        cirrussearch-dump-s4.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s5.timer                        cirrussearch-dump-s5.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s6.timer                        cirrussearch-dump-s6.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s7.timer                        cirrussearch-dump-s7.service
Mon 2024-07-15 16:15:00 UTC 5 days left         n/a                         n/a                cirrussearch-dump-s8.timer                        cirrussearch-dump-s8.service
Mon 2024-07-15 23:00:00 UTC 5 days left         n/a                         n/a                wikidatardf-all-dumps.timer                       wikidatardf-all-dumps.service
Wed 2024-07-17 03:15:00 UTC 6 days left         n/a                         n/a                wikidatajson-lexemes-dump.timer                   wikidatajson-lexemes-dump.service

We could see if any of them should be triggered manually, or we could just wait for them to start by themselves.

Wed, Jul 10, 1:59 PM · Data Products (Data Products Sprint 16), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Data-Engineering, Dumps-Generation, SRE
BTullis added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

These timers were inactive and are going to be activated by this change:

n/a                         n/a                 Mon 2024-07-08 05:00:00 UTC 2 days ago         categoriesrdf-dump-daily.timer                    categoriesrdf-dump-daily.service
n/a                         n/a                 Sat 2024-07-06 20:00:00 UTC 3 days ago         categoriesrdf-dump.timer                          categoriesrdf-dump.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s1.timer                        cirrussearch-dump-s1.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s11.timer                       cirrussearch-dump-s11.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s2.timer                        cirrussearch-dump-s2.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s3.timer                        cirrussearch-dump-s3.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s4.timer                        cirrussearch-dump-s4.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s5.timer                        cirrussearch-dump-s5.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s6.timer                        cirrussearch-dump-s6.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s7.timer                        cirrussearch-dump-s7.service
n/a                         n/a                 Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s8.timer                        cirrussearch-dump-s8.service
Wed, Jul 10, 1:46 PM · Data Products (Data Products Sprint 16), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Data-Engineering, Dumps-Generation, SRE
BTullis added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

The role swap is done. There were a few issues, but nothing major.

Wed, Jul 10, 1:00 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T365839: Deploy an instance of GrowthBook to Kubernetes.

Growthbook requires a MongoDB database to use for its metadata and experiment results.

Wed, Jul 10, 11:49 AM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products
BTullis claimed T365839: Deploy an instance of GrowthBook to Kubernetes.
Wed, Jul 10, 9:01 AM · Patch-For-Review, Data-Platform-SRE (2024.07.08 - 2024.07.28), Data Products

Tue, Jul 9

BTullis claimed T369634: Decide how to do DAG logging on dse-k8s.

I'm going to assign it to myself, while I do a bit of research, if that's OK.

Tue, Jul 9, 6:53 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis added a comment to T369634: Decide how to do DAG logging on dse-k8s.

This is an extremely pertinent question. Thanks @bking for creating the ticket. It's got me thinking.

Tue, Jul 9, 6:52 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis triaged T369612: Remove obsolete databases from analytics_meta MariaDB servers as Low priority.
Tue, Jul 9, 3:47 PM · Data-Platform-SRE
BTullis moved T368033: Design a suitable DAG deployment method from Incoming to 2024.07.08 - 2024.07.28 on the Data-Platform-SRE board.
Tue, Jul 9, 3:42 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Data-Engineering
bking awarded T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster a Yellow Medal token.
Tue, Jul 9, 12:55 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

I have scheduled a role swap for tomorrow at 10:00 UTC and I will have paused ingestions to the data lake via gobblin an hour beforehand. This is to facilitate the swith upgrade in: T348977 at 14:00 tomorrow.
We will then need to do the reverse operation on the following day, as T365996 will make an-mariadb1002 unavailable.

Tue, Jul 9, 12:15 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis renamed T369612: Remove obsolete databases from analytics_meta MariaDB servers from Remove obsolete databases from analytics_meta MariaDB servers. to Remove obsolete databases from analytics_meta MariaDB servers.
Tue, Jul 9, 10:09 AM · Data-Platform-SRE
BTullis created T369612: Remove obsolete databases from analytics_meta MariaDB servers.
Tue, Jul 9, 10:07 AM · Data-Platform-SRE
BTullis moved T363003: Replicate airflow user group structure in LDAP from Backlog to In Progress on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.
Tue, Jul 9, 12:15 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis claimed T363003: Replicate airflow user group structure in LDAP.
Tue, Jul 9, 12:14 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Tue, Jul 9, 12:11 AM · Data-Platform-SRE, Epic
BTullis added a subtask for T362788: Migrate Airflow to the dse-k8s cluster: T369582: Enable prometheus metrics on the cephosd cluster.
Tue, Jul 9, 12:08 AM · Data-Platform-SRE, Epic
BTullis added a parent task for T369582: Enable prometheus metrics on the cephosd cluster: T362788: Migrate Airflow to the dse-k8s cluster.
Tue, Jul 9, 12:08 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis created T369583: Configure availability and health monitoring for the cephosd cluster.
Tue, Jul 9, 12:07 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)

Mon, Jul 8

BTullis created T369582: Enable prometheus metrics on the cephosd cluster.
Mon, Jul 8, 11:58 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Mon, Jul 8, 11:03 PM · Data-Platform-SRE, Epic
BTullis closed T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster as Resolved.

Marking this as resolved.

Mon, Jul 8, 9:38 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis closed T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster, a subtask of T364386: Validate postgres operator and Ceph integration, as Resolved.
Mon, Jul 8, 9:38 PM · Data-Platform-SRE, Epic
BTullis updated the task description for T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.
Mon, Jul 8, 9:37 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

It's working! Here is my test pod with a 1 GB ext4 file system mounted at /var/lib/www/html

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pods
NAME               READY   STATUS    RESTARTS   AGE
csi-rbd-demo-pod   1/1     Running   0          15s

Entering the pod:

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests exec -it csi-rbd-demo-pod -- bash

Showing the free space in the file system

root@csi-rbd-demo-pod:/# df -h /var/lib/www/html/
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd0       974M   24K  958M   1% /var/lib/www/html
Mon, Jul 8, 9:37 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis awarded P65412 (An Untitled Masterwork) a Baby Tequila token.
Mon, Jul 8, 3:37 PM
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

The file system based tests haven't worked yet.
I tried the following resources.

oot@deploy1002:/home/btullis# cat pvc.yaml 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: rbd-pvc
  namespace: btullis-pvc-tests
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-rbd-ssd
root@deploy1002:/home/btullis# cat pod.yaml 
---
apiVersion: v1
kind: Pod
metadata:
  name: csi-rbd-demo-pod
  namespace: btullis-pvc-tests
spec:
  containers:
    - name: do-nothing
      image: docker-registry.discovery.wmnet/bookworm:20240630
      command: ["/bin/sh", "-c"]
      args: ["tail -f /dev/null"]
      volumeMounts:
        - name: mypvc
          mountPath: /var/lib/www/html
  volumes:
    - name: mypvc
      persistentVolumeClaim:
        claimName: rbd-pvc
        readOnly: false

The pod is stuck in a container-creating state, with the following mount warnings.

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pod csi-rbd-demo-pod 
Name:         csi-rbd-demo-pod
Namespace:    btullis-pvc-tests
Priority:     0
Node:         dse-k8s-worker1006.eqiad.wmnet/10.64.132.8
Start Time:   Mon, 08 Jul 2024 13:25:52 +0000
Labels:       <none>
Annotations:  kubernetes.io/psp: privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  do-nothing:
    Container ID:  
    Image:         docker-registry.discovery.wmnet/bookworm:20240630
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
    Args:
      tail -f /dev/null
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2tjs7 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  rbd-pvc
    ReadOnly:   false
  kube-api-access-2tjs7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               7m50s                  default-scheduler        Successfully assigned btullis-pvc-tests/csi-rbd-demo-pod to dse-k8s-worker1006.eqiad.wmnet
  Normal   SuccessfulAttachVolume  7m50s                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27"
  Warning  FailedMount             5m50s                  kubelet                  MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             3m41s (x8 over 5m49s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159 already exists
  Warning  FailedMount             76s (x3 over 5m48s)    kubelet                  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-2tjs7]: timed out waiting for the condition

The pv looks good, so maybe it's something as simple as not having mkfs.ext4 available in the plugin container.

root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pv pvc-575c9e33-4305-4da2-8e2e-ec669a637e27 
Name:            pvc-575c9e33-4305-4da2-8e2e-ec669a637e27
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com
                 volume.kubernetes.io/provisioner-deletion-secret-name: csi-rbd-secret
                 volume.kubernetes.io/provisioner-deletion-secret-namespace: kube-system
Finalizers:      [external-provisioner.volume.kubernetes.io/finalizer kubernetes.io/pv-protection]
StorageClass:    ceph-rbd-ssd
Status:          Bound
Claim:           btullis-pvc-tests/rbd-pvc
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        1Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            rbd.csi.ceph.com
    FSType:            ext4
    VolumeHandle:      0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159
    ReadOnly:          false
    VolumeAttributes:      clusterID=6d4278e1-ea45-4d29-86fe-85b44c150813
                           imageFeatures=layering
                           imageName=csi-vol-82644734-3d2d-11ef-a792-be78068bb159
                           journalPool=dse-k8s-csi-ssd
                           pool=dse-k8s-csi-ssd
                           storage.kubernetes.io/csiProvisionerIdentity=1720432837686-9999-rbd.csi.ceph.com
Events:                <none>

I'll check some more logs.

Mon, Jul 8, 1:36 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

Deleting the pod and pvc worked as expected.

root@deploy1002:/home/btullis# kubectl delete -f raw-block-pod.yaml 
pod "pod-with-raw-block-volume" deleted
Mon, Jul 8, 1:19 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

This is now working.

Mon, Jul 8, 1:11 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.

I have upgraded an-mariadb1002 to bookworm and that has successfully installed MariaDB 10.6.
Replication seems fine. I now have to prepare a few patches to switch the primary role from an-mariadb1001 to an-mariadb1002 before Wednesday's switch reboot in T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2

Mon, Jul 8, 1:08 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis updated the task description for T365503: Upgrade mariadb on analytics_meta from 10.4 to 10.6.
Mon, Jul 8, 1:05 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis closed T369210: NEW BUG REPORT conda-analytics-clone creates environments named with a trailing brace as Resolved.
Mon, Jul 8, 12:13 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis closed T369240: NEW BUG REPORT Update conda-analytics package specifications as Resolved.
Mon, Jul 8, 12:13 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis added a comment to T369240: NEW BUG REPORT Update conda-analytics package specifications.

I believe that this is now complete.

btullis@stat1011:~$ source conda-analytics-activate 2024-07-08T12.10.17_btullis
(2024-07-08T12.10.17_btullis) btullis@stat1011:~$ cat ~/.conda/envs/2024-07-08T12.10.17_btullis/conda-meta/pinned
pyspark=3.1.2
numpy<1.24.0
pandas<2.0.0
pyarrow=9.0.0
jupyter_core=5.5
jupyterhub=1.5.0
jupyterhub-systemdspawner=0.15.0
jupyterhub-ldapauthenticator=1.3.2
jupyterlab_server=2.25
sqlalchemy<2.0
jupyterlab=3.4.8
(2024-07-08T12.10.17_btullis) btullis@stat1011:~$

Please do let me know if you experience any issues with the new evironments.

Mon, Jul 8, 12:13 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis edited projects for T368518: decommission clouddb1021, added: Data-Platform-SRE (2024.06.17 - 2024.07.07); removed Data-Platform-SRE.
Mon, Jul 8, 10:41 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), decommission-hardware
ABran-WMF awarded T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts a Party Time token.
Mon, Jul 8, 9:59 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
BTullis closed T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts as Resolved.

OK, thanks all.
I've deployed the updated package to cumin2002 as well now.

btullis@cumin2002:~$ sudo db-mysql an-redacteddb1001.eqiad.wmnet:3311
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 1616808
Server version: 10.6.18-MariaDB MariaDB Server
Mon, Jul 8, 9:57 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
BTullis closed T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts, a subtask of T365453: Bring an-redacteddb1001 into service to replace clouddb1021, as Resolved.
Mon, Jul 8, 9:55 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)

Sat, Jul 6

BTullis awarded T365155: Text id verification makes dumps skip many good rows a Party Time token.
Sat, Jul 6, 4:49 PM · Data Products (Data Products Sprint 16), Dumps-Generation

Fri, Jul 5

BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

I have cleaned up these unmanaged resources so that they don't cause errors.

root@deploy1002:~# kubectl delete -f raw-block-pod.yaml 
pod "pod-with-raw-block-volume" deleted
root@deploy1002:~# kubectl delete -f raw-block-pvc.yaml 
persistentvolumeclaim "raw-block-pvc" deleted

I will leave the btullis-pvc-tests namespace in place, but it is empty again.

root@deploy1002:~# kubectl -n btullis-pvc-tests get all
No resources found in btullis-pvc-tests namespace.
Fri, Jul 5, 5:44 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

I am beginning tests with some very basic resources.

Fri, Jul 5, 5:41 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis committed rLPRIb4ed7101852c: Add dummy keytabs for analytics-wmde on stat servers..
Add dummy keytabs for analytics-wmde on stat servers.
Fri, Jul 5, 11:17 AM
BTullis added a comment to T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy.

I'd like to keep this open, mainly for documentation, since it's still true that we can't use the latest versions of Pandas and Numpy because of the package versions in Conda-Analytics.

Fri, Jul 5, 10:08 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Movement-Insights
BTullis closed T369278: MapReduce history server is repeatedly crashing as Resolved.

I haven't seen any further occurrences so I'll close this ticket, but I'll be monitoring for stability.

Fri, Jul 5, 9:23 AM · Data-Platform-SRE (2024.07.08 - 2024.07.28)
BTullis moved T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.
Fri, Jul 5, 9:19 AM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
BTullis closed T369127: NEW BUG REPORT conda-analytics-clone fails as Resolved.

This issue is resolved, although we have a new version 0.0.35 of conda-analytics to deploy on Monday, which will fix the trailing brace issue in environment names. T369210

Fri, Jul 5, 9:17 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis moved T369240: NEW BUG REPORT Update conda-analytics package specifications from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.
Fri, Jul 5, 9:14 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis moved T369210: NEW BUG REPORT conda-analytics-clone creates environments named with a trailing brace from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.
Fri, Jul 5, 9:14 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform

Thu, Jul 4

BTullis added a comment to T369210: NEW BUG REPORT conda-analytics-clone creates environments named with a trailing brace.

This looks good.

btullis@an-test-client1002:~$ cat .conda/envs/mycoolenv/conda-meta/pinned
pyspark=3.1.2
numpy<1.24.0
pandas<2.0.0
pyarrow=9.0.0
jupyter_core=5.5
jupyterhub=1.5.0
jupyterhub-systemdspawner=0.15.0
jupyterhub-ldapauthenticator=1.3.2
jupyterlab_server=2.25
sqlalchemy<2.0
jupyterlab=3.4.8

So does this:

image.png (884×1 px, 93 KB)

Thu, Jul 4, 5:21 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis added a comment to T369210: NEW BUG REPORT conda-analytics-clone creates environments named with a trailing brace.

I'm testing this now on an-test-client1002.

btullis@an-test-client1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.35/conda-analytics-0.0.35_amd64.deb
--2024-07-04 16:31:52--  https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.35/conda-analytics-0.0.35_amd64.deb
Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:860:1:208:80:153:8, 208.80.153.8
Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:860:1:208:80:153:8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1080788932 (1.0G) [application/octet-stream]
Saving to: ‘conda-analytics-0.0.35_amd64.deb’
Thu, Jul 4, 4:47 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), Data-Platform
BTullis updated the task description for T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.
Thu, Jul 4, 4:31 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T327259: Enable the Container Storage Interface (CSI) and the Ceph CSI plugin on dse-k8s cluster.

Significant progress now, as with cephcsi: Grant elevated privileges to the driver-registrar container merged, we now have both the nodeplugin daemonset and the provisioner deployment stable.

root@deploy1002:~# kubectl -n kube-system -l release=ceph-csi-rbd get pods
NAME                                        READY   STATUS    RESTARTS   AGE
ceph-csi-rbd-nodeplugin-6vq2c               2/2     Running   0          5m42s
ceph-csi-rbd-nodeplugin-8jqql               2/2     Running   0          5m40s
ceph-csi-rbd-nodeplugin-ffh5h               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-g8dtw               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-hlr6v               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-pnwpg               2/2     Running   0          5m42s
ceph-csi-rbd-nodeplugin-tjr45               2/2     Running   0          5m43s
ceph-csi-rbd-nodeplugin-xkqdx               2/2     Running   0          5m42s
ceph-csi-rbd-provisioner-6f9fc45549-4vczx   5/5     Running   0          5m43s
ceph-csi-rbd-provisioner-6f9fc45549-cxm7l   5/5     Running   0          5m39s
ceph-csi-rbd-provisioner-6f9fc45549-t4t4m   5/5     Running   0          5m43s

Our storageClass called ceph-rbd-ssd is now available.

root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get storageclass
NAME           PROVISIONER        RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-rbd-ssd   rbd.csi.ceph.com   Delete          Immediate           true                   5d23h
Thu, Jul 4, 4:30 PM · Data-Platform-SRE (2024.07.08 - 2024.07.28), Patch-For-Review
BTullis added a comment to T368354: Modify db-mysql to connect to an-redacteddb1001 from cumin hosts.

OK, I have fixed the build, deleted the botched packages from apt1002 and replaced them.

Thu, Jul 4, 4:10 PM · Patch-For-Review, Data-Services, Data-Persistence, Data-Platform-SRE (2024.06.17 - 2024.07.07)
BTullis committed rOSWDe35bddea72e7: Release version 0.1.5 for bookworm.
Release version 0.1.5 for bookworm
Thu, Jul 4, 3:56 PM
BTullis committed rOSWD2bc69de5d64e: Release version 0.1.5 for bullseye.
Release version 0.1.5 for bullseye
Thu, Jul 4, 3:56 PM
BTullis committed rOSWDb4f4823fcad1: add badges (authored by ABran-WMF).
add badges
Thu, Jul 4, 3:56 PM