User Details
- User Since
- Jun 29 2021, 9:56 AM (159 w, 3 d)
- Availability
- Available
- IRC Nick
- btullis
- LDAP User
- Btullis
- MediaWiki User
- BTullis (WMF) [ Global Accounts ]
Today
Hmm. I tried patching the growthbook front-end code with this gist: https://gist.github.com/zicklag/1bb50db6c5138de347c224fda14286da
...with the intention of using setGlobalDispatcher if the http_proxy environment variable is set.
Same issue, unfortunately. I also found this conversation: https://github.com/vercel/next.js/discussions/44959 which mentions being unable to prevent the downloading.
I'm not sure if it will work, because the message still says Downloading swc... but then it says SWC is disabled.
I won't know until I merge to main, anyway.
There is a slight problem with building the Growthbook image.
Yesterday
I've now got the ferretdb image ready as well. I will add submit requests to add both projects to the trusted runners, then start working on the helm chart for it.
btullis@an-launcher1002:~$ sudo lvresize -l+100%FREE vg0/srv Size of logical volume vg0/srv changed from 74.27 GiB (19014 extents) to <111.07 GiB (28433 extents). Logical volume vg0/srv successfully resized.
There is 37 GB of space free on the vg0 volume group.
btullis@an-launcher1002:~$ sudo vgs VG #PV #LV #SN Attr VSize VFree vg0 1 3 0 wz--n- 185.90g 37.18g
We can allocate the this space to /srv to help deal with the issue.
Wed, Jul 17
No problem. Thanks. I'll do it now.
@Marostegui - is there a chance that you might have missed some of the grants after re-cloning s7 yesterday.
We have had a sqoop failure with a message about access being denied for the research user.
Access denied for user 'research'@'10.64.21.109'
I am happy to re-apply them if needed, but it would be useful to know what process you followed. Thanks.
Looking into this, it seems that we don't modify the amount of Java heap available to the mapreduce history server, so it picks up the default value of 1 GB.
I have stopped all of the sections with the following command and confirmed a clean shutdown.
btullis@clouddb1021:~$ for i in $(seq 1 8); do sudo systemctl stop mariadb@s$i ; done btullis@clouddb1021:~$ for i in $(seq 1 8); do systemctl status mariadb@s$i ; done ● mariadb@s1.service - mariadb database server Loaded: loaded (/lib/systemd/system/mariadb@.service; disabled; vendor preset: enabled) Active: inactive (dead)
Tue, Jul 16
@Marostegui - please feel free to go ahead and re-clone it. Now 'ish a relatively good time of the month to do it.
Mon, Jul 15
I can give a status update here, which I hope will be useful.
I have created an initial blubber/kokkuri pipeline for building our own version of Growthbook, based on the upstream Dockerfile.
The problem appears to have come back after I did a full restart of the Druid cluster on July 11th.
This is how the list of dimensions appears when the problem is in effect.
This is how it should appear:
It was fixed by doing a systemctl restart turnilo.service on an-tool1007.
Reopening, as we have seen this happening again.
Thu, Jul 11
I believe that this is all done now. I have swapped the roles back, so that an-mariadb1001 is the active master and an-mariadb1002 is the replica.
The backup host, db1208, is set to replicate from an-mariadb1001 again.
Wed, Jul 10
Hi @Kappakayala - Is the data present now, or is it still absent? i.e. Is there an ongoing incident, or is this a retrospective investigation?
Thanks @ayounsi and @elukey for handling this.
Not sure if it is related, but I did a full restart (not rolling) today of both druid clusters, because of a role-swap of an-mariadb100[1-2] servers. The process was here: T365503#9965173
Maybe the cold boot did help, after all. I just checked and the total memory is shown as 128 GB again.
I have now reimaged an-mariadb1001 and it has installed mariadb 10.6.
btullis@an-mariadb1001:~$ sudo mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 491 Server version: 10.6.18-MariaDB-log MariaDB Server
I'll leave this ticket open until after tomorrow's failback from an-mariadb1002.
Ok, it seems that we have an issue because of T213996: New MongoDB version is not DFSG-compatible, dropped by Debian.
Namely that MongoDB is now SSPL licensed, which means that we can't run it in production at the moment.
This is the natural schedule of when the dumps that were disabled will run next.
btullis@snapshot1017:~$ systemctl list-timers | grep n/a Wed 2024-07-10 23:00:00 UTC 9h left n/a n/a wikidatardf-truthy-dumps.timer wikidatardf-truthy-dumps.service Thu 2024-07-11 05:00:00 UTC 15h left n/a n/a categoriesrdf-dump-daily.timer categoriesrdf-dump-daily.service Fri 2024-07-12 09:10:00 UTC 1 day 19h left n/a n/a xlation-dumps.timer xlation-dumps.service Fri 2024-07-12 23:00:00 UTC 2 days left n/a n/a wikidatardf-lexemes-dumps.timer wikidatardf-lexemes-dumps.service Sat 2024-07-13 08:15:00 UTC 2 days left n/a n/a global_blocks_dump.timer global_blocks_dump.service Sat 2024-07-13 08:15:00 UTC 2 days left n/a n/a growth_mentorship_dump.timer growth_mentorship_dump.service Sat 2024-07-13 20:00:00 UTC 3 days left n/a n/a categoriesrdf-dump.timer categoriesrdf-dump.service Sun 2024-07-14 19:00:00 UTC 4 days left n/a n/a commonsrdf-dump.timer commonsrdf-dump.service Mon 2024-07-15 03:15:00 UTC 4 days left n/a n/a commonsjson-dump.timer commonsjson-dump.service Mon 2024-07-15 03:15:00 UTC 4 days left n/a n/a wikidatajson-dump.timer wikidatajson-dump.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s1.timer cirrussearch-dump-s1.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s11.timer cirrussearch-dump-s11.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s2.timer cirrussearch-dump-s2.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s3.timer cirrussearch-dump-s3.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s4.timer cirrussearch-dump-s4.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s5.timer cirrussearch-dump-s5.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s6.timer cirrussearch-dump-s6.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s7.timer cirrussearch-dump-s7.service Mon 2024-07-15 16:15:00 UTC 5 days left n/a n/a cirrussearch-dump-s8.timer cirrussearch-dump-s8.service Mon 2024-07-15 23:00:00 UTC 5 days left n/a n/a wikidatardf-all-dumps.timer wikidatardf-all-dumps.service Wed 2024-07-17 03:15:00 UTC 6 days left n/a n/a wikidatajson-lexemes-dump.timer wikidatajson-lexemes-dump.service
We could see if any of them should be triggered manually, or we could just wait for them to start by themselves.
These timers were inactive and are going to be activated by this change:
n/a n/a Mon 2024-07-08 05:00:00 UTC 2 days ago categoriesrdf-dump-daily.timer categoriesrdf-dump-daily.service n/a n/a Sat 2024-07-06 20:00:00 UTC 3 days ago categoriesrdf-dump.timer categoriesrdf-dump.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s1.timer cirrussearch-dump-s1.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s11.timer cirrussearch-dump-s11.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s2.timer cirrussearch-dump-s2.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s3.timer cirrussearch-dump-s3.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s4.timer cirrussearch-dump-s4.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s5.timer cirrussearch-dump-s5.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s6.timer cirrussearch-dump-s6.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s7.timer cirrussearch-dump-s7.service n/a n/a Mon 2024-07-01 16:15:00 UTC 1 weeks 1 days ago cirrussearch-dump-s8.timer cirrussearch-dump-s8.service
The role swap is done. There were a few issues, but nothing major.
Growthbook requires a MongoDB database to use for its metadata and experiment results.
Tue, Jul 9
I'm going to assign it to myself, while I do a bit of research, if that's OK.
This is an extremely pertinent question. Thanks @bking for creating the ticket. It's got me thinking.
I have scheduled a role swap for tomorrow at 10:00 UTC and I will have paused ingestions to the data lake via gobblin an hour beforehand. This is to facilitate the swith upgrade in: T348977 at 14:00 tomorrow.
We will then need to do the reverse operation on the following day, as T365996 will make an-mariadb1002 unavailable.
Mon, Jul 8
Marking this as resolved.
It's working! Here is my test pod with a 1 GB ext4 file system mounted at /var/lib/www/html
root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests get pods NAME READY STATUS RESTARTS AGE csi-rbd-demo-pod 1/1 Running 0 15s
Entering the pod:
root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests exec -it csi-rbd-demo-pod -- bash
Showing the free space in the file system
root@csi-rbd-demo-pod:/# df -h /var/lib/www/html/ Filesystem Size Used Avail Use% Mounted on /dev/rbd0 974M 24K 958M 1% /var/lib/www/html
The file system based tests haven't worked yet.
I tried the following resources.
oot@deploy1002:/home/btullis# cat pvc.yaml --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: rbd-pvc namespace: btullis-pvc-tests spec: accessModes: - ReadWriteOnce volumeMode: Filesystem resources: requests: storage: 1Gi storageClassName: ceph-rbd-ssd
root@deploy1002:/home/btullis# cat pod.yaml --- apiVersion: v1 kind: Pod metadata: name: csi-rbd-demo-pod namespace: btullis-pvc-tests spec: containers: - name: do-nothing image: docker-registry.discovery.wmnet/bookworm:20240630 command: ["/bin/sh", "-c"] args: ["tail -f /dev/null"] volumeMounts: - name: mypvc mountPath: /var/lib/www/html volumes: - name: mypvc persistentVolumeClaim: claimName: rbd-pvc readOnly: false
The pod is stuck in a container-creating state, with the following mount warnings.
root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pod csi-rbd-demo-pod Name: csi-rbd-demo-pod Namespace: btullis-pvc-tests Priority: 0 Node: dse-k8s-worker1006.eqiad.wmnet/10.64.132.8 Start Time: Mon, 08 Jul 2024 13:25:52 +0000 Labels: <none> Annotations: kubernetes.io/psp: privileged Status: Pending IP: IPs: <none> Containers: do-nothing: Container ID: Image: docker-registry.discovery.wmnet/bookworm:20240630 Image ID: Port: <none> Host Port: <none> Command: /bin/sh -c Args: tail -f /dev/null State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/lib/www/html from mypvc (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2tjs7 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: mypvc: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: rbd-pvc ReadOnly: false kube-api-access-2tjs7: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m50s default-scheduler Successfully assigned btullis-pvc-tests/csi-rbd-demo-pod to dse-k8s-worker1006.eqiad.wmnet Normal SuccessfulAttachVolume 7m50s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" Warning FailedMount 5m50s kubelet MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning FailedMount 3m41s (x8 over 5m49s) kubelet MountVolume.MountDevice failed for volume "pvc-575c9e33-4305-4da2-8e2e-ec669a637e27" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159 already exists Warning FailedMount 76s (x3 over 5m48s) kubelet Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-2tjs7]: timed out waiting for the condition
The pv looks good, so maybe it's something as simple as not having mkfs.ext4 available in the plugin container.
root@deploy1002:/home/btullis# kubectl -n btullis-pvc-tests describe pv pvc-575c9e33-4305-4da2-8e2e-ec669a637e27 Name: pvc-575c9e33-4305-4da2-8e2e-ec669a637e27 Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com volume.kubernetes.io/provisioner-deletion-secret-name: csi-rbd-secret volume.kubernetes.io/provisioner-deletion-secret-namespace: kube-system Finalizers: [external-provisioner.volume.kubernetes.io/finalizer kubernetes.io/pv-protection] StorageClass: ceph-rbd-ssd Status: Bound Claim: btullis-pvc-tests/rbd-pvc Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 1Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: rbd.csi.ceph.com FSType: ext4 VolumeHandle: 0001-0024-6d4278e1-ea45-4d29-86fe-85b44c150813-0000000000000007-82644734-3d2d-11ef-a792-be78068bb159 ReadOnly: false VolumeAttributes: clusterID=6d4278e1-ea45-4d29-86fe-85b44c150813 imageFeatures=layering imageName=csi-vol-82644734-3d2d-11ef-a792-be78068bb159 journalPool=dse-k8s-csi-ssd pool=dse-k8s-csi-ssd storage.kubernetes.io/csiProvisionerIdentity=1720432837686-9999-rbd.csi.ceph.com Events: <none>
I'll check some more logs.
Deleting the pod and pvc worked as expected.
root@deploy1002:/home/btullis# kubectl delete -f raw-block-pod.yaml pod "pod-with-raw-block-volume" deleted
This is now working.
I have upgraded an-mariadb1002 to bookworm and that has successfully installed MariaDB 10.6.
Replication seems fine. I now have to prepare a few patches to switch the primary role from an-mariadb1001 to an-mariadb1002 before Wednesday's switch reboot in T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2
I believe that this is now complete.
btullis@stat1011:~$ source conda-analytics-activate 2024-07-08T12.10.17_btullis (2024-07-08T12.10.17_btullis) btullis@stat1011:~$ cat ~/.conda/envs/2024-07-08T12.10.17_btullis/conda-meta/pinned pyspark=3.1.2 numpy<1.24.0 pandas<2.0.0 pyarrow=9.0.0 jupyter_core=5.5 jupyterhub=1.5.0 jupyterhub-systemdspawner=0.15.0 jupyterhub-ldapauthenticator=1.3.2 jupyterlab_server=2.25 sqlalchemy<2.0 jupyterlab=3.4.8 (2024-07-08T12.10.17_btullis) btullis@stat1011:~$
Please do let me know if you experience any issues with the new evironments.
OK, thanks all.
I've deployed the updated package to cumin2002 as well now.
btullis@cumin2002:~$ sudo db-mysql an-redacteddb1001.eqiad.wmnet:3311 Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 1616808 Server version: 10.6.18-MariaDB MariaDB Server
Sat, Jul 6
Fri, Jul 5
I have cleaned up these unmanaged resources so that they don't cause errors.
root@deploy1002:~# kubectl delete -f raw-block-pod.yaml pod "pod-with-raw-block-volume" deleted root@deploy1002:~# kubectl delete -f raw-block-pvc.yaml persistentvolumeclaim "raw-block-pvc" deleted
I will leave the btullis-pvc-tests namespace in place, but it is empty again.
root@deploy1002:~# kubectl -n btullis-pvc-tests get all No resources found in btullis-pvc-tests namespace.
I am beginning tests with some very basic resources.
I haven't seen any further occurrences so I'll close this ticket, but I'll be monitoring for stability.
This issue is resolved, although we have a new version 0.0.35 of conda-analytics to deploy on Monday, which will fix the trailing brace issue in environment names. T369210
Thu, Jul 4
This looks good.
btullis@an-test-client1002:~$ cat .conda/envs/mycoolenv/conda-meta/pinned pyspark=3.1.2 numpy<1.24.0 pandas<2.0.0 pyarrow=9.0.0 jupyter_core=5.5 jupyterhub=1.5.0 jupyterhub-systemdspawner=0.15.0 jupyterhub-ldapauthenticator=1.3.2 jupyterlab_server=2.25 sqlalchemy<2.0 jupyterlab=3.4.8
So does this:
I'm testing this now on an-test-client1002.
btullis@an-test-client1002:~$ wget https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.35/conda-analytics-0.0.35_amd64.deb --2024-07-04 16:31:52-- https://gitlab.wikimedia.org/api/v4/projects/359/packages/generic/conda-analytics/0.0.35/conda-analytics-0.0.35_amd64.deb Resolving gitlab.wikimedia.org (gitlab.wikimedia.org)... 2620:0:860:1:208:80:153:8, 208.80.153.8 Connecting to gitlab.wikimedia.org (gitlab.wikimedia.org)|2620:0:860:1:208:80:153:8|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1080788932 (1.0G) [application/octet-stream] Saving to: ‘conda-analytics-0.0.35_amd64.deb’
Significant progress now, as with cephcsi: Grant elevated privileges to the driver-registrar container merged, we now have both the nodeplugin daemonset and the provisioner deployment stable.
root@deploy1002:~# kubectl -n kube-system -l release=ceph-csi-rbd get pods NAME READY STATUS RESTARTS AGE ceph-csi-rbd-nodeplugin-6vq2c 2/2 Running 0 5m42s ceph-csi-rbd-nodeplugin-8jqql 2/2 Running 0 5m40s ceph-csi-rbd-nodeplugin-ffh5h 2/2 Running 0 5m43s ceph-csi-rbd-nodeplugin-g8dtw 2/2 Running 0 5m43s ceph-csi-rbd-nodeplugin-hlr6v 2/2 Running 0 5m43s ceph-csi-rbd-nodeplugin-pnwpg 2/2 Running 0 5m42s ceph-csi-rbd-nodeplugin-tjr45 2/2 Running 0 5m43s ceph-csi-rbd-nodeplugin-xkqdx 2/2 Running 0 5m42s ceph-csi-rbd-provisioner-6f9fc45549-4vczx 5/5 Running 0 5m43s ceph-csi-rbd-provisioner-6f9fc45549-cxm7l 5/5 Running 0 5m39s ceph-csi-rbd-provisioner-6f9fc45549-t4t4m 5/5 Running 0 5m43s
Our storageClass called ceph-rbd-ssd is now available.
root@deploy1002:/srv/deployment-charts/helmfile.d/admin_ng# kubectl get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE ceph-rbd-ssd rbd.csi.ceph.com Delete Immediate true 5d23h
OK, I have fixed the build, deleted the botched packages from apt1002 and replaced them.