[go: nahoru, domu]

Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Projects (17)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 8:06 AM (510 w, 6 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Yesterday

fgiunchedi added a comment to T351927: Decide and tweak Thanos retention.

Hi @fgiunchedi and sorry to bug you again (again) about this, but we have a bunch of alerts about thanos backend disk space again.

[relatedly, we're trying to expedite the two new backend servers to get them in this quarter]

Thu, Jul 18, 2:04 PM · User-fgiunchedi, Observability-Metrics
fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

No problem just give me a heads up on when y'all want to do the changes.

Thu, Jul 18, 2:01 PM · SRE, ops-eqiad, DC-Ops
fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

I captured otelcoll traffic on mw2310 (looks like it is always and only otel-coll running on this host that causes problems) in the hopes to understand what's the problematic trace. I've left /root/otelcoll-T370043.cap in place in case it is of interest; I could decode the http2 traffic with wireshark (select traffic with source port 4317 then "decode as" http2) though I couldn't get wireshark to decode the grpc traffic.

Thu, Jul 18, 12:12 PM · Observability-Tracing
fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

I'm investigating this again today, and yes there's a single otel-coll pod in codfw that is problematic: https://w.wiki/AhDr

Thu, Jul 18, 9:10 AM · Observability-Tracing
fgiunchedi closed T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic as Resolved.

I'm tentatively resolving this since even though the topic isn't perfectly balanced I'm calling it good enough™. We can revisit as needed

Thu, Jul 18, 7:42 AM · Observability-Logging
fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

Thank you all for the clarification, I'm glad we're able to do the move without re-addressing after all!

Thu, Jul 18, 6:59 AM · SRE, ops-eqiad, DC-Ops
fgiunchedi created T370386: statograph_post errors with out of range float values.
Thu, Jul 18, 6:49 AM · observability

Wed, Jul 17

fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

Hey @fgiunchedi Just wanted to verify, since we would have to physically move this server into another rack (and in turn, have to change the IP) this activity is no longer needed, correct? If so, I will close this ticket. Thanks!

Wed, Jul 17, 3:35 PM · SRE, ops-eqiad, DC-Ops
fgiunchedi added a comment to T368513: Core router error logs: "sshd: Did not receive identification string" from prometheus hosts.

So I looked where the probes come from, and they are part of the generic "probe mgmt network hosts for ssh" and data comes from netbox, specifically these bits in modules/profile/manifests/prometheus/ops.pp:

Wed, Jul 17, 1:26 PM · Infrastructure-Foundations, netops, SRE
fgiunchedi removed a project from T370171: wmde-analytics-minutely.service can get stuck (RuntimeMaxSec= has no effect), resulting in missed stats: Grafana.

I'm removing grafana since turns out this issue is unrelated

Wed, Jul 17, 1:02 PM · Wikidata Dev Team (Wikidata.org Slice), Patch-For-Review, Wikidata Analytics, Wikidata
fgiunchedi created T370264: benthos mw-accesslog-metrics interpolation errors.
Wed, Jul 17, 12:46 PM · MW-on-K8s, Observability-Logging, serviceops
fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

I have reached out to upstream at https://github.com/redpanda-data/connect/issues/2705 to make them aware, in case they have seen this before. I'm not overly concerned just yet in the sense that IIRC this is the first time we've seen such a failure and it could be a unlucky coincidence

Wed, Jul 17, 12:41 PM · Observability-Logging
fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

Though there didn't seem to be a problem afterwards, the timing makes me think of T365997: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f2-eqiad

Yeah the times in those logs for connection failures to kafka-jumbo1014 exactly match when it was offline due to the switch upgrade. So probably I'm complicating things for everyone as usual :)

Wed, Jul 17, 12:18 PM · Observability-Logging
fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

There's similar logs in benthos@webrequest_live on centrallog1002:

Wed, Jul 17, 8:55 AM · Observability-Logging
fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

Also this in logs around the time consuming stopped:

Wed, Jul 17, 8:51 AM · Observability-Logging
fgiunchedi added a comment to T369445: VRTS errors arriving to root@.

Screenshot 2024-07-16 at 22.13.12.png (514×3 px, 110 KB)

I think I found the setting that affects this. It's configured on the VRTS dashboard and was initially set to root@localhost. Updated it to our team's mailing list. Let's see where the next one goes.

I think this option solved the mail issue. As far as I can tell the mails no longer go to root@. The latest mails from VRTS Notifications (with the same error) were sent to our team mail address sre-service-ops-collab@ instead of root@.

root@ got a bunch of these emails e.g. just now

Wed, Jul 17, 8:31 AM · Znuny, vrts, collaboration-services
fgiunchedi added a comment to T369445: VRTS errors arriving to root@.

Screenshot 2024-07-16 at 22.13.12.png (514×3 px, 110 KB)

I think I found the setting that affects this. It's configured on the VRTS dashboard and was initially set to root@localhost. Updated it to our team's mailing list. Let's see where the next one goes.

I think this option solved the mail issue. As far as I can tell the mails no longer go to root@. The latest mails from VRTS Notifications (with the same error) were sent to our team mail address sre-service-ops-collab@ instead of root@.

Wed, Jul 17, 8:29 AM · Znuny, vrts, collaboration-services

Tue, Jul 16

fgiunchedi closed T369826: 10gbit nic option for centrallog2002 as Resolved.

This is done! We went with the procedure I suggested above, namely I took the host side configuration by logging back in via console to adjust the network configuration with the new interface name. @Jhancock.wm did the host physical move by connecting everything to its final location, and @Papaul did the netbox + network part

Tue, Jul 16, 3:33 PM · SRE, DC-Ops, ops-codfw
fgiunchedi closed T369826: 10gbit nic option for centrallog2002, a subtask of T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic, as Resolved.
Tue, Jul 16, 3:33 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

Perhaps unsurprisingly, the problem is back only on a different pod. Which in my mind excludes a otel-coll problem and points to a client sending troublesome spans

Tue, Jul 16, 2:36 PM · Observability-Tracing
CDanis awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Love token.
Tue, Jul 16, 2:28 PM · Observability-Logging
fgiunchedi added a comment to T328502: Move WMCS off of Icinga and introduce alertmanager.

A very low hanging fruit to make progress on this task is the following prometheus-based checks:

Tue, Jul 16, 1:57 PM · cloud-services-team (FY2023/2024-Q3-Q4), Toolforge, Cloud-VPS, Observability-Alerting, Goal
fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

Thank you @Papaul that is quite helpful!

Tue, Jul 16, 1:54 PM · SRE, DC-Ops, ops-codfw
fgiunchedi created T370157: Port lists monitoring alerts to Alertmanager.
Tue, Jul 16, 1:42 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T301944: Web interface to navigate Prometheus alerts and their status.
Tue, Jul 16, 1:23 PM · SRE Observability (FY2024/2025-Q1)
fgiunchedi removed a parent task for T301944: Web interface to navigate Prometheus alerts and their status: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.
Tue, Jul 16, 1:23 PM · User-herron, Observability-Metrics
fgiunchedi removed a parent task for T302639: How should we monitor for faulty memory modules?: T294564: Migrate Foundations Prometheus alerts to AlertManager.
Tue, Jul 16, 1:23 PM · SRE Observability, Infrastructure-Foundations
fgiunchedi removed a subtask for T294564: Migrate Foundations Prometheus alerts to AlertManager: T302639: How should we monitor for faulty memory modules?.
Tue, Jul 16, 1:23 PM · Infrastructure-Foundations, Observability-Alerting
fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T305847: Migrate SRE paging alerts off Icinga and to Alertmanager.
Tue, Jul 16, 1:22 PM · SRE Observability (FY2024/2025-Q1)
fgiunchedi added a subtask for T321808: Port most/all Icinga checks to Prometheus/Alertmanager: T305847: Migrate SRE paging alerts off Icinga and to Alertmanager.
Tue, Jul 16, 1:22 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi edited parent tasks for T305847: Migrate SRE paging alerts off Icinga and to Alertmanager, added: T321808: Port most/all Icinga checks to Prometheus/Alertmanager; removed: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.
Tue, Jul 16, 1:22 PM · Observability-Alerting, User-fgiunchedi
fgiunchedi removed a parent task for T326657: Add prometheus-https load balancer: T301944: Web interface to navigate Prometheus alerts and their status.
Tue, Jul 16, 1:22 PM · Traffic, Patch-For-Review, Observability-Metrics
fgiunchedi removed a subtask for T301944: Web interface to navigate Prometheus alerts and their status: T326657: Add prometheus-https load balancer.
Tue, Jul 16, 1:22 PM · User-herron, Observability-Metrics
fgiunchedi edited parent tasks for T225140: Icinga alerts that should open tasks instead of alerting, added: T321808: Port most/all Icinga checks to Prometheus/Alertmanager; removed: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.
Tue, Jul 16, 1:21 PM · Observability-Alerting, Patch-For-Review
fgiunchedi removed a subtask for T288622: All Prometheus based alerts move from Icinga to alert manager exclusively: T225140: Icinga alerts that should open tasks instead of alerting.
Tue, Jul 16, 1:21 PM · SRE Observability (FY2024/2025-Q1)
fgiunchedi added a subtask for T321808: Port most/all Icinga checks to Prometheus/Alertmanager: T225140: Icinga alerts that should open tasks instead of alerting.
Tue, Jul 16, 1:21 PM · SRE Observability (FY2024/2025-Q1), Observability-Alerting
fgiunchedi created T370153: Move kafka-mirror Prometheus-based alerts from Icinga to alerts.git.
Tue, Jul 16, 1:18 PM · SRE Observability (FY2024/2025-Q1)
colewhite awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Party Time token.
Tue, Jul 16, 12:30 PM · Observability-Logging
fgiunchedi added a comment to T315866: Migrate mysql icinga alerts to alert manager.

Adding a note to mention that currently Icinga alerts related to clouddb* hosts are getting tagged with team=wmcs when they are forwarded from icinga to alertmanager. I tried to figure out where that tagging happens but I haven't found it. We should aim to maintain that tagging when the alerts are migrated to alertmanager.

Tue, Jul 16, 12:07 PM · Patch-For-Review, DBA
kamila awarded T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic a Love token.
Tue, Jul 16, 10:28 AM · Observability-Logging
fgiunchedi added a comment to T352756: Gap in metrics rendered from Thanos Rules.

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

Tue, Jul 16, 9:53 AM · SRE Observability (FY2024/2025-Q1), Observability-Metrics, Machine-Learning-Team
fgiunchedi closed T366308: More Benthos instances consumes slower? as Resolved.

I'm tentatively resolving this since the increase in partitions in T369256 has helped with increased concurrency

Tue, Jul 16, 9:36 AM · Observability-Logging
fgiunchedi added a comment to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic.

I'm happy to report that increasing partition count and restarting benthos mw sampler did the trick, lag was reduced even before the restart and completely gone after restart:

Tue, Jul 16, 9:34 AM · Observability-Logging
fgiunchedi created T370129: topicmappr marshal error on kafka-logging cluster.
Tue, Jul 16, 9:15 AM · Observability-Logging
fgiunchedi updated the task description for T369258: Cleanup kafka-logging topic names.
Tue, Jul 16, 9:01 AM · Observability-Logging
fgiunchedi added a comment to T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.

This is ongoing, to exclude a otel-coll problem I've deleted main-opentelemetry-collector-agent-msvrc pod in wikikube codfw. Let's see if it happens again

Tue, Jul 16, 8:22 AM · Observability-Tracing
fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

We won't need to move racks. But because of the way the switches are, we can't reuse the same port on the switch. we'll be moving to a different set of 4. Are you going to reimage the server?

Tue, Jul 16, 7:52 AM · SRE, DC-Ops, ops-codfw

Mon, Jul 15

fgiunchedi updated subscribers of T369826: 10gbit nic option for centrallog2002.

Got the card back @fgiunchedi. I'm free to swap it anytime on Tuesday or Thursday between 8am and 4pm CDT

Mon, Jul 15, 4:02 PM · SRE, DC-Ops, ops-codfw
fgiunchedi added a comment to T354255: Alert in need of triage: AlertLintProblem (instance localhost:9123).

Thank you @LSobanski ! I'll be reaching out to the individual service owners

Mon, Jul 15, 2:44 PM · SRE Observability (FY2024/2025-Q1), sre-alert-triage
fgiunchedi updated the task description for T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.
Mon, Jul 15, 2:37 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi created T370043: occasional OtelCollectorRefusedSpans alert for memory_limiter.
Mon, Jul 15, 12:10 PM · Observability-Tracing

Fri, Jul 12

fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

Today I've done extensive tests and tweaking of benthos@webrequest_live settings, namely:

Fri, Jul 12, 2:58 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi added a comment to T369826: 10gbit nic option for centrallog2002.

Thank you @Jhancock.wm that's great! Please LMK a day and time of next week that would work for you

Fri, Jul 12, 1:37 PM · SRE, DC-Ops, ops-codfw
fgiunchedi added a comment to T369825: 10gbit nic option for centrallog1002.

@wiki_willy Yes, I was able to locate one. @fgiunchedi is there an estimated time and date for us to bring the server down and install the it?

Fri, Jul 12, 1:36 PM · SRE, ops-eqiad, DC-Ops

Thu, Jul 11

fgiunchedi updated subscribers of T369825: 10gbit nic option for centrallog1002.
Thu, Jul 11, 3:14 PM · SRE, ops-eqiad, DC-Ops
fgiunchedi created T369826: 10gbit nic option for centrallog2002.
Thu, Jul 11, 2:27 PM · SRE, DC-Ops, ops-codfw
fgiunchedi created T369825: 10gbit nic option for centrallog1002.
Thu, Jul 11, 2:26 PM · SRE, ops-eqiad, DC-Ops
fgiunchedi placed T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic up for grabs.
Thu, Jul 11, 2:25 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi added a project to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic: Observability-Metrics.
Thu, Jul 11, 2:24 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi renamed T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic from Site Issue: Delayed data in the `webrequest_sampled_live` Druid table to Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.
Thu, Jul 11, 2:21 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

A bit of context: benthos@webrequest_live normally runs on centrallog1002 and centrallog2002 consuming the webrequest firehose from jumbo-eqiad. While at steady state this is more or less fine, on sudden spikes of messages codfw consumer struggles with clearing its backlog / lag.

Thu, Jul 11, 2:19 PM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi added a comment to T369737: Site Issue: Delayed data in the `webrequest_sampled_live` kafka topic.

I've taken a look at this as well for the benthos bits, and there was significant kafka lag for the benthos consumer group at the time, which would explain the delay in data

Thu, Jul 11, 9:57 AM · Observability-Metrics, Sustainability (Incident Followup), Data-Engineering
fgiunchedi closed T369500: Grant Access to wmf for Uniquemia as Resolved.

Thank you @EUwandu-WMF ! Everything checks out, you are now part of wmf group and therefore have access. I'm tentatively resolving, though please reopen as needed

Thu, Jul 11, 8:41 AM · SRE, LDAP-Access-Requests
fgiunchedi added a member for WMF-NDA: EUwandu-WMF.
Thu, Jul 11, 8:35 AM
fgiunchedi closed T366032: Grant Access to nda/logstash for Sohom Datta as Resolved.

Patch is merged and I've added soda to nda ldap group, tentatively resolving though please reopen if needed

Thu, Jul 11, 8:32 AM · SRE, LDAP-Access-Requests

Wed, Jul 10

fgiunchedi moved T368088: upgrade prometheus-ipmi-exporter to 1.8.0 from Inbox to FY2024/2025-Q1 on the SRE Observability board.
Wed, Jul 10, 2:11 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Infrastructure-Foundations, Packaging
fgiunchedi edited projects for T369715: Gather all mariadb host under the same prometheus label, added: Observability-Alerting, Observability-Metrics; removed observability.
Wed, Jul 10, 2:08 PM · Observability-Metrics, Observability-Alerting, DBA
fgiunchedi removed a project from T368945: wikipedia-pl-sysop: local images fail to generate thumbnail: SRE.
Wed, Jul 10, 11:43 AM · Thumbor

Tue, Jul 9

fgiunchedi closed T352640: Fix Pontoon to bootstrap from Bookworm and Puppetserver as Resolved.

Calling this done, Pontoon now supports Puppet 7 (puppetserver) and I've updated the wikitech documentation

Tue, Jul 9, 12:33 PM · User-fgiunchedi, Pontoon
fgiunchedi closed T360703: Replace or remove Debian Buster VMs in 'monitoring' cloud-vps project as Resolved.

This is done (the monitoring project is no longer a thing)

Tue, Jul 9, 12:06 PM · Cloud-VPS (Debian Buster Deprecation), cloud-services-team
fgiunchedi removed projects from T343529: Prometheus doesn't reload or alert on expired client certificates: SRE Observability (FY2024/2025-Q1), User-fgiunchedi.
Tue, Jul 9, 12:00 PM · Prod-Kubernetes, Observability-Metrics, Kubernetes, serviceops-radar
fgiunchedi added a comment to T369348: Grant access to wmf to lferreira.

No worries at all and totally fair @Aklapper !

Tue, Jul 9, 10:38 AM · SRE, LDAP-Access-Requests
fgiunchedi added a comment to T369607: High cardinality metrics break queries/dashboards (envoy, istio, ...).

Definitely +1 to use recording rules for histograms, since they are broken anyways already. I think also switching the istio dashboard to thanos might help, as there is caching within thanos (the stateless part, thanos-query-frontend and thanos-query).

Tue, Jul 9, 9:57 AM · Patch-For-Review, Kubernetes, Grafana, Observability-Metrics, serviceops
fgiunchedi added a comment to T369348: Grant access to wmf to lferreira.
Tue, Jul 9, 9:37 AM · SRE, LDAP-Access-Requests
fgiunchedi added a member for WMF-NDA: Lferreira.
Tue, Jul 9, 9:37 AM
fgiunchedi closed T369348: Grant access to wmf to lferreira as Resolved.

@Lferreira you are now part of the wmf ldap group, I'm optimistically resolving the task, though please reopen if sth is amiss!

Tue, Jul 9, 9:15 AM · SRE, LDAP-Access-Requests
fgiunchedi triaged T368945: wikipedia-pl-sysop: local images fail to generate thumbnail as Medium priority.
Tue, Jul 9, 9:10 AM · Thumbor
fgiunchedi moved T368566: Grant Access to analytics-privatedata-users for Sharvaniharan from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.
Tue, Jul 9, 9:08 AM · SRE-Access-Requests, SRE
fgiunchedi added a comment to T368566: Grant Access to analytics-privatedata-users for Sharvaniharan.

You should be able to access the dashboards in ~30 min from now, please confirm that is the case.

Tue, Jul 9, 9:02 AM · SRE-Access-Requests, SRE
fgiunchedi updated the task description for T368566: Grant Access to analytics-privatedata-users for Sharvaniharan.
Tue, Jul 9, 9:02 AM · SRE-Access-Requests, SRE
fgiunchedi moved T369500: Grant Access to wmf for Uniquemia from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Tue, Jul 9, 8:46 AM · SRE, LDAP-Access-Requests
fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297576}

Thank you, that helped realize I made a mistake in my search! I'll do the next steps to get you access to superset

Tue, Jul 9, 8:40 AM · SRE, LDAP-Access-Requests
fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

Hello @fgiunchedi , Please can you check again to see if it works now? Here is the screenshot of my sign-in with Uniquemia on Wikitech as well if it is helpful{F56297576}

Tue, Jul 9, 8:38 AM · SRE, LDAP-Access-Requests

Mon, Jul 8

fgiunchedi closed T369519: Grant Access to wmf for xiaoxiao as Resolved.

Hello @XiaoXiao-WMF; I've added you to wmf ldap group. I'm tentatively resolving the task though please reopen if sth is amiss

Mon, Jul 8, 2:49 PM · SRE, LDAP-Access-Requests
fgiunchedi added a comment to T369500: Grant Access to wmf for Uniquemia.

Hello @EUwandu-WMF, I couldn't find the uniquemia account on wikitech, or at least one with euwandu-ctr@wikimedia.org as its email, what wikitech account should we be using? thank you!

Mon, Jul 8, 2:42 PM · SRE, LDAP-Access-Requests
fgiunchedi updated the task description for T353912: Observability Bookworm upgrades.
Mon, Jul 8, 12:06 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review
fgiunchedi updated the task description for T369122: On-call batphone escalation configuration holidays FY2024/25.
Mon, Jul 8, 8:26 AM · SRE Observability (FY2024/2025-Q1)

Fri, Jul 5

fgiunchedi closed T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm as Resolved.

There is actually a mwlog02 bullseye instance which is already active, I've proceeded to delete mwlog01

Fri, Jul 5, 1:25 PM · Cloud-VPS (Debian Buster Deprecation), observability, Observability-Logging
fgiunchedi updated the task description for T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm.
Fri, Jul 5, 1:24 PM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure
fgiunchedi closed T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm, a subtask of T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm, as Resolved.
Fri, Jul 5, 1:24 PM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure
fgiunchedi added a comment to T369252: monitoring - MariaDB log parsing and log alerting.

Got it now, thank you @Marostegui. My recommendation in this case is to use mtail to get metrics out of logs; where such parsing should happen we can decide. My recommendation is to have mtail on db hosts that tails log files, this way you don't depend on: rsyslog delivering to centrallog, centrallog working, etc. There are some examples in puppet already for roles that use mtail (e.g. hieradata/role/common/mail/mx.yaml) which should be a good starting point, we (o11y) are of course happy to help too.

Fri, Jul 5, 9:52 AM · DBA
fgiunchedi added a comment to T369252: monitoring - MariaDB log parsing and log alerting.

I think I'm missing context here, would you mind expanding on what are you trying to achieve?

Fri, Jul 5, 9:27 AM · DBA
fgiunchedi updated the task description for T369345: NEL almost not reported anymore / very infrequently.
Fri, Jul 5, 9:06 AM · Traffic
fgiunchedi created T369345: NEL almost not reported anymore / very infrequently.
Fri, Jul 5, 8:51 AM · Traffic

Thu, Jul 4

fgiunchedi added a comment to T366308: More Benthos instances consumes slower?.

I suspect this is related to T369256: Kafka lag for benthos-mw-accesslog-sampler and mediawiki.httpd.accesslog topic; we'll know more once that task is done

Thu, Jul 4, 12:47 PM · Observability-Logging
fgiunchedi updated the task description for T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm.
Thu, Jul 4, 9:37 AM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure
fgiunchedi added a subtask for T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm: T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm.
Thu, Jul 4, 9:35 AM · Cloud-VPS (Debian Buster Deprecation), Beta-Cluster-Infrastructure
fgiunchedi added a parent task for T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm: T327742: Migrate deployment-prep away from Debian Buster to Bullseye/Bookworm.
Thu, Jul 4, 9:35 AM · Cloud-VPS (Debian Buster Deprecation), observability, Observability-Logging
fgiunchedi added a project to T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm: Cloud-VPS (Debian Buster Deprecation).
Thu, Jul 4, 9:34 AM · Cloud-VPS (Debian Buster Deprecation), observability, Observability-Logging
fgiunchedi created T369263: Upgrade deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud to bullseye or bookworm.
Thu, Jul 4, 9:32 AM · Cloud-VPS (Debian Buster Deprecation), observability, Observability-Logging