fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)
View All

Grafana
Tag
netops
Component
observability
Component
Observability-Alerting
Component
ops-codfw
Component

Calendar

User Details

User Since: Oct 3 2014, 8:06 AM (510 w, 6 d)
Availability: Available
IRC Nick: godog
LDAP User: Filippo Giunchedi
MediaWiki User: FGiunchedi (WMF) [ Global Accounts ]

Recent Activity
View All

Yesterday

fgiunchedi added a comment to T351927: Decide and tweak Thanos retention.

In T351927#9994610, @MatthewVernon wrote:

Hi @fgiunchedi and sorry to bug you again (again) about this, but we have a bunch of alerts about thanos backend disk space again.

[relatedly, we're trying to expedite the two new backend servers to get them in this quarter]

Thu, Jul 18, 2:04 PM · User-fgiunchedi, Observability-Metrics

fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

In T369825#9994586, @Papaul wrote:

No problem just give me a heads up on when y'all want to do the changes.

Thu, Jul 18, 2:01 PM · SRE, ops-eqiad, DC-Ops

fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

I captured otelcoll traffic on mw2310 (looks like it is always and only otel-coll running on this host that causes problems) in the hopes to understand what's the problematic trace. I've left /root/otelcoll-T370043.cap in place in case it is of interest; I could decode the http2 traffic with wireshark (select traffic with source port 4317 then "decode as" http2) though I couldn't get wireshark to decode the grpc traffic.

Thu, Jul 18, 12:12 PM · Observability-Tracing

fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

I'm investigating this again today, and yes there's a single otel-coll pod in codfw that is problematic: https://w.wiki/AhDr

Thu, Jul 18, 9:10 AM · Observability-Tracing

fgiunchedi closed T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic as Resolved.

I'm tentatively resolving this since even though the topic isn't perfectly balanced I'm calling it good enough™. We can revisit as needed

Thu, Jul 18, 7:42 AM · Observability-Logging

fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

Thank you all for the clarification, I'm glad we're able to do the move without re-addressing after all!

Thu, Jul 18, 6:59 AM · SRE, ops-eqiad, DC-Ops

fgiunchedi created T370386: statograph_post errors with out of range float values.

Thu, Jul 18, 6:49 AM · observability

Wed, Jul 17

fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

In T369825#9990558, @VRiley-WMF wrote:

Hey @fgiunchedi Just wanted to verify, since we would have to physically move this server into another rack (and in turn, have to change the IP) this activity is no longer needed, correct? If so, I will close this ticket. Thanks!

Wed, Jul 17, 3:35 PM · SRE, ops-eqiad, DC-Ops

fgiunchedi added a comment to T368513: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts.

So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for ssh" and data comes from netbox, specifically these bits in modules/profile/manifests/prometheus/ops.pp:

Wed, Jul 17, 1:26 PM · Infrastructure-Foundations, netops, SRE

fgiunchedi removed a project from T370171: wmde-analytics-minutely.service can get stuck (RuntimeMaxSec= has no effect), resulting in missed stats: Grafana.

I'm removing grafana since turns out this issue is unrelated

Wed, Jul 17, 1:02 PM · Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Wikidata Analytics, Wikidata

fgiunchedi created T370264: benthos mw-accesslog-metrics interpolation errors.

Wed, Jul 17, 12:46 PM · MW-on-K8s, Observability-Logging, serviceops

fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

I have reached out to upstream at https://github.com/redpanda-data/connect/issues/2705 to make them aware, in case they have seen this before. I'm not overly concerned just yet in the sense that IIRC this is the first time we've seen such a failure and it could be a unlucky coincidence

Wed, Jul 17, 12:41 PM · Observability-Logging

fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

In T369256#9989598, @cmooney wrote:

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

Wed, Jul 17, 12:18 PM · Observability-Logging

fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

There's similar logs in benthos@webrequest_live on centrallog1002:

Wed, Jul 17, 8:55 AM · Observability-Logging

fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

Also this in logs around the time consuming stopped:

Wed, Jul 17, 8:51 AM · Observability-Logging

fgiunchedi added a comment to T369445: VRTS errors arriving to root@.

In T369445#9989260, @fgiunchedi wrote:

In T369445#9989244, @Jelto wrote:

In T369445#9987520, @Arnoldokoth wrote:

I think I found the setting that affects this. It's configured on the VRTS dashboard and was initially set to root@localhost. Updated it to our team's mailing list. Let's see where the next one goes.

I think this option solved the mail issue. As far as I can tell the mails no longer go to root@. The latest mails from VRTS Notifications (with the same error) were sent to our team mail address sre-service-ops-collab@ instead of root@.

root@ got a bunch of these emails e.g. just now

Wed, Jul 17, 8:31 AM · Znuny, vrts, collaboration-services

fgiunchedi added a comment to T369445: VRTS errors arriving to root@.

In T369445#9989244, @Jelto wrote:

In T369445#9987520, @Arnoldokoth wrote:

I think I found the setting that affects this. It's configured on the VRTS dashboard and was initially set to root@localhost. Updated it to our team's mailing list. Let's see where the next one goes.

I think this option solved the mail issue. As far as I can tell the mails no longer go to root@. The latest mails from VRTS Notifications (with the same error) were sent to our team mail address sre-service-ops-collab@ instead of root@.

Wed, Jul 17, 8:29 AM · Znuny, vrts, collaboration-services

Tue, Jul 16

fgiunchedi closed T369826: 10gbit nic option for centrallog2002 as Resolved.

This is done! We went with the procedure I suggested above, namely I took the host side configuration by logging back in via console to adjust the network configuration with the new interface name. @Jhancock.wm did the host physical move by connecting everything to its final location, and @Papaul did the netbox + network part

Tue, Jul 16, 3:33 PM · SRE, DC-Ops, ops-codfw

fgiunchedi closed T369826: 10gbit nic option for centrallog2002, a subtask of T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic, as Resolved.

Tue, Jul 16, 3:33 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

Perhaps unsurprisingly, the problem is back only on a different pod. Which in my mind excludes a otel-coll problem and points to a client sending troublesome spans

Tue, Jul 16, 2:36 PM · Observability-Tracing

CDanis awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Love token.

Tue, Jul 16, 2:28 PM · Observability-Logging

fgiunchedi added a comment to T328502: Move WMCS off of Icinga and introduce alertmanager.

A very low hanging fruit to make progress on this task is the following prometheus-based checks:

Tue, Jul 16, 1:57 PM · cloud-services-team (FY2023/2024-Q3-Q4), Toolforge, Cloud-VPS, Observability-Alerting, Goal

fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

Thank you @Papaul that is quite helpful!

Tue, Jul 16, 1:54 PM · SRE, DC-Ops, ops-codfw

fgiunchedi created T370157: Port lists monitoring alerts to Alertmanager.

Tue, Jul 16, 1:42 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T301944: Web interface to navigate Prometheus alerts and their status.

Tue, Jul 16, 1:23 PM · SRE Observability (FY2024/2025-Q1)

fgiunchedi removed a parent task for T301944: Web interface to navigate Prometheus alerts and their status: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.

Tue, Jul 16, 1:23 PM · User-herron, Observability-Metrics

fgiunchedi removed a parent task for T302639: How should we monitor for faulty memory modules?: T294564: Migrate Foundations Prometheus alerts to AlertManager.

Tue, Jul 16, 1:23 PM · SRE Observability, Infrastructure-Foundations

fgiunchedi removed a subtask for T294564: Migrate Foundations Prometheus alerts to AlertManager: T302639: How should we monitor for faulty memory modules?.

Tue, Jul 16, 1:23 PM · Infrastructure-Foundations, Observability-Alerting

fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T305847: Migrate SRE paging alerts off Icinga and to Alertmanager.

Tue, Jul 16, 1:22 PM · SRE Observability (FY2024/2025-Q1)

fgiunchedi added a subtask for T321808: Port most/all Icinga checks to Prometheus/Alertmanager: T305847: Migrate SRE paging alerts off Icinga and to Alertmanager.

Tue, Jul 16, 1:22 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi edited parent tasks for T305847: Migrate SRE paging alerts off Icinga and to Alertmanager, added: T321808: Port most/all Icinga checks to Prometheus/Alertmanager; removed: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.

Tue, Jul 16, 1:22 PM · Observability-Alerting, User-fgiunchedi

fgiunchedi removed a parent task for T326657: Add prometheus-https load balancer: T301944: Web interface to navigate Prometheus alerts and their status.

Tue, Jul 16, 1:22 PM · Traffic, Patch-For-Review, Observability-Metrics

fgiunchedi removed a subtask for T301944: Web interface to navigate Prometheus alerts and their status: T326657: Add prometheus-https load balancer.

Tue, Jul 16, 1:22 PM · User-herron, Observability-Metrics

fgiunchedi edited parent tasks for T225140: Icinga alerts that should open tasks instead of alerting, added: T321808: Port most/all Icinga checks to Prometheus/Alertmanager; removed: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.

Tue, Jul 16, 1:21 PM · Observability-Alerting, Patch-For-Review

fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T225140: Icinga alerts that should open tasks instead of alerting.

Tue, Jul 16, 1:21 PM · SRE Observability (FY2024/2025-Q1)

fgiunchedi added a subtask for T321808: Port most/all Icinga checks to Prometheus/Alertmanager: T225140: Icinga alerts that should open tasks instead of alerting.

Tue, Jul 16, 1:21 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting

fgiunchedi created T370153: Move kafka-mirror Prometheus-based alerts from Icinga to alerts.git.

Tue, Jul 16, 1:18 PM · SRE Observability (FY2024/2025-Q1)

colewhite awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Party Time token.

Tue, Jul 16, 12:30 PM · Observability-Logging

fgiunchedi added a comment to T315866: Migrate mysql icinga alerts to alert manager.

In T315866#9984611, @fnegri wrote:

Adding a note to mention that currently Icinga alerts related to clouddb* hosts are getting tagged with team=wmcs when they are forwarded from icinga to alertmanager. I tried to figure out where that tagging happens but I haven't found it. We should aim to maintain that tagging when the alerts are migrated to alertmanager.

Tue, Jul 16, 12:07 PM · Patch-For-Review, DBA

kamila awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Love token.

Tue, Jul 16, 10:28 AM · Observability-Logging

fgiunchedi added a comment to T352756: Gap in metrics rendered from Thanos Rules.

In T352756#9965404, @gerritbot wrote:

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

Tue, Jul 16, 9:53 AM · SRE Observability (FY2024/2025-Q1), Observability-Metrics, Machine-Learning-Team

fgiunchedi closed T366308: More Benthos instances consumes slower? as Resolved.

I'm tentatively resolving this since the increase in partitions in T369256 has helped with increased concurrency

Tue, Jul 16, 9:36 AM · Observability-Logging

fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

I'm happy to report that increasing partition count and restarting benthos mw sampler did the trick, lag was reduced even before the restart and completely gone after restart:

Tue, Jul 16, 9:34 AM · Observability-Logging

fgiunchedi created T370129: topicmappr marshal error on kafka-logging cluster.

Tue, Jul 16, 9:15 AM · Observability-Logging

fgiunchedi updated the task description for T369258: Cleanup kafka-logging topic names.

Tue, Jul 16, 9:01 AM · Observability-Logging

fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

This is ongoing, to exclude a otel-coll problem I've deleted main-opentelemetry-collector-agent-msvrc pod in wikikube codfw. Let's see if it happens again

Tue, Jul 16, 8:22 AM · Observability-Tracing

fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

In T369826#9982422, @Jhancock.wm wrote:

We won't need to move racks. But because of the way the switches are, we can't reuse the same port on the switch. we'll be moving to a different set of 4. Are you going to reimage the server?

Tue, Jul 16, 7:52 AM · SRE, DC-Ops, ops-codfw

Mon, Jul 15

fgiunchedi updated subscribers of T369826: 10gbit nic option for centrallog2002.

In T369826#9982167, @Jhancock.wm wrote:

Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT

Mon, Jul 15, 4:02 PM · SRE, DC-Ops, ops-codfw

fgiunchedi added a comment to T354255: Alert in need of triage: AlertLintProblem (instance localhost:9123).

Thank you @LSobanski ! I'll be reaching out to the individual service owners

Mon, Jul 15, 2:44 PM · SRE Observability (FY2024/2025-Q1), sre-alert-triage

fgiunchedi updated the task description for T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

Mon, Jul 15, 2:37 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi created T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

Mon, Jul 15, 12:10 PM · Observability-Tracing

Fri, Jul 12

fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

Today I've done extensive tests and tweaking of benthos@webrequest_live settings, namely:

Fri, Jul 12, 2:58 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

Thank you @Jhancock.wm that's great! Please LMK a day and time of next week that would work for you

Fri, Jul 12, 1:37 PM · SRE, DC-Ops, ops-codfw

fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

In T369825#9974444, @VRiley-WMF wrote:

@wiki_willy Yes, I was able to locate one. @fgiunchedi is there an estimated time and date for us to bring the server down and install the it?

Fri, Jul 12, 1:36 PM · SRE, ops-eqiad, DC-Ops

Thu, Jul 11

fgiunchedi updated subscribers of T369825: 10gbit nic option for centrallog1002.

Thu, Jul 11, 3:14 PM · SRE, ops-eqiad, DC-Ops

fgiunchedi created T369826: 10gbit nic option for centrallog2002.

Thu, Jul 11, 2:27 PM · SRE, DC-Ops, ops-codfw

fgiunchedi created T369825: 10gbit nic option for centrallog1002.

Thu, Jul 11, 2:26 PM · SRE, ops-eqiad, DC-Ops

fgiunchedi placed T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic up for grabs.

Thu, Jul 11, 2:25 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi added a project to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic: Observability-Metrics.

Thu, Jul 11, 2:24 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi renamed T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic from Site Issue: Delayed data in the `webrequest_sampled_live` Druid table to Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

Thu, Jul 11, 2:21 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

A bit of context: benthos@webrequest_live normally runs on centrallog1002 and centrallog2002 consuming the webrequest firehose from jumbo-eqiad. While at steady state this is more or less fine, on sudden spikes of messages codfw consumer struggles with clearing its backlog / lag.

Thu, Jul 11, 2:19 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

I've taken a look at this as well for the benthos bits, and there was significant kafka lag for the benthos consumer group at the time, which would explain the delay in data

Thu, Jul 11, 9:57 AM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering

fgiunchedi closed T369500: Grant Access to wmf for Uniquemia as Resolved.

Thank you @EUwandu-WMF ! Everything checks out, you are now part of wmf group and therefore have access. I'm tentatively resolving, though please reopen as needed

Thu, Jul 11, 8:41 AM · SRE, LDAP-Access-Requests

fgiunchedi added a member for WMF-NDA: EUwandu-WMF.

Thu, Jul 11, 8:35 AM

fgiunchedi closed T366032: Grant Access to nda/logstash for Sohom Datta as Resolved.

Patch is merged and I've added soda to nda ldap group, tentatively resolving though please reopen if needed

Thu, Jul 11, 8:32 AM · SRE, LDAP-Access-Requests

Wed, Jul 10

fgiunchedi moved T368088: upgrade prometheus-ipmi-exporter to 1.8.0 from Inbox to FY2024/2025-Q1 on the SRE Observability board.

Wed, Jul 10, 2:11 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Infrastructure-Foundations, Packaging

fgiunchedi edited projects for T369715: Gather all mariadb host under the same prometheus label, added: Observability-Alerting, Observability-Metrics; removed observability.

Wed, Jul 10, 2:08 PM · Observability-Metrics, Observability-Alerting, DBA

fgiunchedi removed a project from T368945: wikipedia-pl-sysop: local images fail to generate thumbnail: SRE.

Wed, Jul 10, 11:43 AM · Thumbor

Tue, Jul 9

fgiunchedi closed T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver as Resolved.

Calling this done, Pontoon now supports Puppet 7 (puppetserver) and I've updated the wikitech documentation

Tue, Jul 9, 12:33 PM · User-fgiunchedi, Pontoon

fgiunchedi closed T360703: Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project as Resolved.

This is done (the monitoring project is no longer a thing)

Tue, Jul 9, 12:06 PM · Cloud-VPS (Debian Buster Deprecation), cloud-services-team

fgiunchedi removed projects from T343529: Prometheus doesn't reload or alert on expired client certificates: SRE Observability (FY2024/2025-Q1), User-fgiunchedi.

Tue, Jul 9, 12:00 PM · Prod-Kubernetes, Observability-Metrics, Kubernetes, serviceops-radar

fgiunchedi added a comment to T369348: Grant access to wmf to lferreira.

No worries at all and totally fair @Aklapper !

Tue, Jul 9, 10:38 AM · SRE, LDAP-Access-Requests

fgiunchedi added a comment to T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).

Definitely +1 to use recording rules for histograms, since they are broken anyways already. I think also switching the istio dashboard to thanos might help, as there is caching within thanos (the stateless part, thanos-query-frontend and thanos-query).

Tue, Jul 9, 9:57 AM · Patch-For-Review, Kubernetes, Grafana, Observability-Metrics, serviceops

fgiunchedi added a comment to T369348: Grant access to wmf to lferreira.

In T369348#9964576, @Aklapper wrote:

I'll bite per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group

Tue, Jul 9, 9:37 AM · SRE, LDAP-Access-Requests

fgiunchedi added a member for WMF-NDA: Lferreira.

Tue, Jul 9, 9:37 AM

fgiunchedi closed T369348: Grant access to wmf to lferreira as Resolved.

@Lferreira you are now part of the wmf ldap group, I'm optimistically resolving the task, though please reopen if sth is amiss!

Tue, Jul 9, 9:15 AM · SRE, LDAP-Access-Requests

fgiunchedi triaged T368945: wikipedia-pl-sysop: local images fail to generate thumbnail as Medium priority.

Tue, Jul 9, 9:10 AM · Thumbor

fgiunchedi moved T368566: Grant Access to analytics-privatedata-users for Sharvaniharan from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.

Tue, Jul 9, 9:08 AM · SRE-Access-Requests, SRE

fgiunchedi added a comment to T368566: Grant Access to analytics-privatedata-users for Sharvaniharan.

You should be able to access the dashboards in ~30 min from now, please confirm that is the case.

Tue, Jul 9, 9:02 AM · SRE-Access-Requests, SRE

fgiunchedi updated the task description for T368566: Grant Access to analytics-privatedata-users for Sharvaniharan.

Tue, Jul 9, 9:02 AM · SRE-Access-Requests, SRE

fgiunchedi moved T369500: Grant Access to wmf for Uniquemia from Backlog to Awaiting User Input on the LDAP-Access-Requests board.

Tue, Jul 9, 8:46 AM · SRE, LDAP-Access-Requests

fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

In T369500#9964364, @fgiunchedi wrote:

In T369500#9963066, @EUwandu-WMF wrote:

Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297576}

Thank you, that helped realize I made a mistake in my search! I'll do the next steps to get you access to superset

Tue, Jul 9, 8:40 AM · SRE, LDAP-Access-Requests

fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

In T369500#9963066, @EUwandu-WMF wrote:

Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297576}

Tue, Jul 9, 8:38 AM · SRE, LDAP-Access-Requests

Mon, Jul 8

fgiunchedi closed T369519: Grant Access to wmf for xiaoxiao as Resolved.

Hello @XiaoXiao-WMF; I've added you to wmf ldap group. I'm tentatively resolving the task though please reopen if sth is amiss

Mon, Jul 8, 2:49 PM · SRE, LDAP-Access-Requests

fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

Hello @EUwandu-WMF, I couldn't find the uniquemia account on wikitech, or at least one with euwandu-ctr@wikimedia.org as its email, what wikitech account should we be using? thank you!

Mon, Jul 8, 2:42 PM · SRE, LDAP-Access-Requests

fgiunchedi updated the task description for T353912: Observability Bookworm upgrades.

Mon, Jul 8, 12:06 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review

fgiunchedi updated the task description for T369122: On-call batphone escalation configuration holidays FY2024/25.

Mon, Jul 8, 8:26 AM · SRE Observability (FY2024/2025-Q1)

Fri, Jul 5

fgiunchedi closed T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm as Resolved.

There is actually a mwlog02 bullseye instance which is already active, I've proceeded to delete mwlog01

Fri, Jul 5, 1:25 PM · Cloud-VPS (Debian Buster Deprecation), observability, Observability-Logging

fgiunchedi updated the task description for T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm.

Fri, Jul 5, 1:24 PM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure

fgiunchedi closed T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm, a subtask of T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm, as Resolved.

Fri, Jul 5, 1:24 PM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure

fgiunchedi added a comment to T369252: monitoring - MariaDB log parsing and log alerting.

Got it now, thank you @Marostegui. My recommendation in this case is to use mtail to get metrics out of logs; where such parsing should happen we can decide. My recommendation is to have mtail on db hosts that tails log files, this way you don't depend on: rsyslog delivering to centrallog, centrallog working, etc. There are some examples in puppet already for roles that use mtail (e.g. hieradata/role/common/mail/mx.yaml) which should be a good starting point, we (o11y) are of course happy to help too.

Fri, Jul 5, 9:52 AM · DBA

fgiunchedi added a comment to T369252: monitoring - MariaDB log parsing and log alerting.

I think I'm missing context here, would you mind expanding on what are you trying to achieve?

Fri, Jul 5, 9:27 AM · DBA

fgiunchedi updated the task description for T369345: NEL almost not reported anymore / very infrequently.

Fri, Jul 5, 9:06 AM · Traffic

fgiunchedi created T369345: NEL almost not reported anymore / very infrequently.

Fri, Jul 5, 8:51 AM · Traffic