[go: nahoru, domu]

Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer


  • Clear sailing ahead.


  • Clear sailing ahead.


  • Clear sailing ahead.

User Details

User Since
May 30 2017, 5:25 PM (372 w, 3 d)
IRC Nick
MediaWiki User

Recent Activity


herron added a project to T369854: Occasional SLOMetricAbsent alerts: Observability-Metrics.

Had a closer look into the threshold tunables and I don't actually see a way to change this natively within Pyrra. As-is the "for" duration of SLOMetricAbsent alerts is 6m. Pyrra has options to enable/disable the absent alert, but maybe we can configure something to put this alert in a silence/inhibit waiting room for 30-60m before it alerts (or recovers on its own)

Fri, Jul 19, 5:40 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q1)


herron renamed T369854: Occasional SLOMetricAbsent alerts from Occasional SLOMetricAbsent false positives to Occasional SLOMetricAbsent alerts.
Thu, Jul 18, 2:37 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q1)
herron edited projects for T369854: Occasional SLOMetricAbsent alerts, added: SRE Observability; removed SRE Observability (FY2024/2025-Q1).

SLOMetricAbsent for dead_letters_hits, varnish_sli_bad, trafficserver_backend_sli_bad and haproxy_sli_bad occurred today, at two different times.

Thu, Jul 18, 2:37 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q1)

Mon, Jul 15

herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Mon, Jul 15, 6:05 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations

Thu, Jul 11

herron added a comment to T369854: Occasional SLOMetricAbsent alerts.

Reviewing thanos-rule logs I'm seeing related discards with err="out of order sample"

Thu, Jul 11, 5:56 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q1)
herron triaged T369854: Occasional SLOMetricAbsent alerts as Medium priority.
Thu, Jul 11, 5:54 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q1)
herron created P66323 (An Untitled Masterwork).
Thu, Jul 11, 5:51 PM

Mon, Jul 8

herron added a comment to T352756: Gap in metrics rendered from Thanos Rules.

It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.

Mon, Jul 8, 2:40 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics, Machine-Learning-Team

Wed, Jul 3

herron added a comment to T368168: Re-evaluate logging cluster watermark settings.

Proposal SGTM for the near-term. Thank you for organizing this!

Wed, Jul 3, 2:59 PM · Patch-For-Review, Observability-Logging
herron updated the task description for T368168: Re-evaluate logging cluster watermark settings.
Wed, Jul 3, 2:36 PM · Patch-For-Review, Observability-Logging

Mon, Jul 1

herron triaged T368953: Thanos Cache Tuning as Medium priority.
Mon, Jul 1, 5:15 PM · Observability-Metrics

Tue, Jun 25

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Buster is looking fine with this deb as well. So I've gone ahead and uploaded 1.8.0 to bookworm-wikimedia, bullseye-wikimedia, and buster-wikimedia. Up next is a small canary upgrade before fully rolling out.

Tue, Jun 25, 6:53 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Tue, Jun 25, 6:50 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Nice, thanks for the pointer! It looks like export CGO_ENABLED=0 does the right thing. At least, with this set the package builds and installs successfully on my bullseye test host.

Tue, Jun 25, 2:47 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations

Mon, Jun 24

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.

Mon, Jun 24, 5:32 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations

Fri, Jun 21

herron added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Quick update: prometheus-ipmi-exporter-1.8.0 was a straightforward backport for bookworm https://gitlab.wikimedia.org/repos/sre/prometheus-ipmi-exporter/-/jobs/291532

Fri, Jun 21, 6:33 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
herron updated the task description for T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Fri, Jun 21, 6:33 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T352756: Gap in metrics rendered from Thanos Rules.

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

Fri, Jun 21, 3:07 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics, Machine-Learning-Team

Thu, Jun 20

herron created T368088: upgrade prometheus-ipmi-exporter to 1.8.0.
Thu, Jun 20, 4:41 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
herron added a comment to T253810: Alert on ECC warnings in SEL.

ipmi_exporter now has support to collect generic SEL entries and export metrics from those: https://github.com/prometheus-community/ipmi_exporter/pull/179

Thu, Jun 20, 4:36 PM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), User-MoritzMuehlenhoff
herron added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

Sounds like a plan!

Thu, Jun 20, 2:54 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics
herron added a comment to T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s.

I think we may want to enable it deployment per deployment so SRE Observability can monitor the load on prometheus. @colewhite or @herron can we coordinate on this?

Thu, Jun 20, 2:49 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, MW-on-K8s, serviceops, Observability-Metrics

Jun 18 2024

herron awarded T367466: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences a Love token.
Jun 18 2024, 2:03 PM · SRE-tools, Infrastructure-Foundations, Spicerack, Observability-Alerting
herron added a comment to T367790: Detect hardware failures/automatically create tickets for DC Ops.

Looking longer-term I think it'd be generally worthwhile to support more than baseline blackbox checks on the mgmt interfaces and I'm personally open to exploring something like the redfish exporter. But I think this would be a medium to large sized project since AIUI it will involve a decent sized chunk of setup effort on the hardware itself, and offhand I'm also not sure the percentage of hw in the fleet that will support the approach today. I'm also assuming we would run into some hw vendor oddities/bugs along the way.

Jun 18 2024, 1:46 PM · DC-Ops, Data-Platform

Jun 17 2024

herron added a comment to T359879: SLO dashboards for Lift Wing showing unexpected values.

@herron let's double check, maybe we can drop the secondary rules and keep going with the "regular" ones?

Example of the fix: https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-revert-risk-la-slo-s?orgId=1&from=2024-03-01%2000:00:00&to=2024-05-31%2023:59:59

Jun 17 2024, 2:23 PM · Machine-Learning-Team, Observability-Metrics

Jun 14 2024

herron closed T367053: Grant Access to wmf for Gonyeahialam as Resolved.


Jun 14 2024, 2:06 PM · SRE, LDAP-Access-Requests
herron added a member for WMF-NDA: gonyeahialam.
Jun 14 2024, 2:04 PM

Jun 13 2024

herron closed T367053: Grant Access to wmf for Gonyeahialam as Resolved.

Group membership has been provisioned, thanks!

Jun 13 2024, 4:11 PM · SRE, LDAP-Access-Requests

Jun 12 2024

herron moved T367295: Requesting access to private data-based dashboards for Jsn.sherman from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Hi @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl could one of you please approve this request for analytics-privatedata-users? Thanks in advance!

Jun 12 2024, 6:12 PM · Data-Engineering, SRE, SRE-Access-Requests
herron updated the task description for T367295: Requesting access to private data-based dashboards for Jsn.sherman.
Jun 12 2024, 6:07 PM · Data-Engineering, SRE, SRE-Access-Requests
herron closed T360895: Memory upgrade request for prometheus200[56] as Resolved.

This was completed yesterday (during stashbot outage, this task unfortunately missed the !log)

Jun 12 2024, 1:45 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics

Jun 11 2024

herron closed T367173: Requesting access to Kubernetes deployment for ebysans as Resolved.

The patch to provision this access has been merged, and will be fully propagated within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!

Jun 11 2024, 5:48 PM · SRE, Data-Engineering, SRE-Access-Requests
herron closed T365832: Requesting access to analytics-privatedata-users for Rae Adimer as Resolved.

The patch to provision this access has been merged and will fully propagate within the next 30 minutes. I'll transition this to resolved now, please re-open if any followup is needed. Thanks!

Jun 11 2024, 4:22 PM · SRE, SRE-Access-Requests
herron assigned T366032: Grant Access to nda/logstash for Sohom Datta to Soda.

Hi @Soda could you please coordinate obtaining a comment of support on this task from a sponsor as outlined in https://wikitech.wikimedia.org/wiki/Volunteer_NDA? Thanks!

Jun 11 2024, 4:09 PM · SRE, LDAP-Access-Requests
herron closed T365138: Grant Access to nda for Ricki Jay as Resolved.

Closing as this has been stalled for weeks. Please re-open if/when ready to proceed. Thanks!

Jun 11 2024, 4:05 PM · SRE, LDAP-Access-Requests
herron closed T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE as Resolved.

The patch to provision this access has now been merged and will be fully deployed within the next 30 minutes. I'll transition this task to resolved now, please re-open if any followup is needed. Thanks!

Jun 11 2024, 3:56 PM · SRE, SRE-Access-Requests
herron updated the task description for T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE .
Jun 11 2024, 3:12 PM · SRE, SRE-Access-Requests
herron updated the task description for T367173: Requesting access to Kubernetes deployment for ebysans.
Jun 11 2024, 3:02 PM · SRE, Data-Engineering, SRE-Access-Requests
herron updated the task description for T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE .
Jun 11 2024, 3:01 PM · SRE, SRE-Access-Requests
herron closed T365574: Requesting access to analytics-privatedata-users for rickijay as Resolved.

The patch to provision this access has been merged, and will fully propagate within 30 minutes. I'll go ahead and transition this task to resolved now, please re-open if any followup is needed. Thanks!

Jun 11 2024, 1:18 PM · SRE, SRE-Access-Requests

Jun 10 2024

herron moved T365832: Requesting access to analytics-privatedata-users for Rae Adimer from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Approving access from my end.

Jun 10 2024, 8:43 PM · SRE, SRE-Access-Requests
herron updated the task description for T365832: Requesting access to analytics-privatedata-users for Rae Adimer.
Jun 10 2024, 7:53 PM · SRE, SRE-Access-Requests
herron updated the task description for T365574: Requesting access to analytics-privatedata-users for rickijay.
Jun 10 2024, 7:48 PM · SRE, SRE-Access-Requests
herron assigned T366351: Requesting access to analytics-privatedata-users for Tchanders to JayCano.

Hi @JayCano, assigning to you for approval. Thanks!

Jun 10 2024, 7:47 PM · SRE, SRE-Access-Requests
herron moved T365832: Requesting access to analytics-privatedata-users for Rae Adimer from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.

(SSH key verification email sent)

Jun 10 2024, 7:47 PM · SRE, SRE-Access-Requests
herron updated the task description for T366351: Requesting access to analytics-privatedata-users for Tchanders.
Jun 10 2024, 7:45 PM · SRE, SRE-Access-Requests
herron updated the task description for T365832: Requesting access to analytics-privatedata-users for Rae Adimer.
Jun 10 2024, 7:37 PM · SRE, SRE-Access-Requests
herron closed T364715: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer as Resolved.

Resolving as the access looks to have been provisioned, please reopen if any followup is needed. Thanks!

Jun 10 2024, 7:35 PM · Data-Engineering, SRE, SRE-Access-Requests
herron closed T364801: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen as Resolved.

The patch to provision this access has been merged and will be propagated fully within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!

Jun 10 2024, 7:25 PM · SRE, SRE-Access-Requests
herron updated the task description for T364801: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen.
Jun 10 2024, 6:57 PM · SRE, SRE-Access-Requests
herron moved T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.

Hi @Ifrahkhanyaree_WMDE I see the SSH key in the description is in use already. Could you please generate a fresh ssh key for production use and update the task description with it? Details/rationale for this are outlined in https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key

Jun 10 2024, 6:56 PM · SRE, SRE-Access-Requests
herron updated the task description for T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE .
Jun 10 2024, 6:55 PM · SRE, SRE-Access-Requests
herron updated the task description for T366558: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE .
Jun 10 2024, 6:55 PM · SRE, SRE-Access-Requests
herron added a comment to T360895: Memory upgrade request for prometheus200[56].

I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.

Jun 10 2024, 3:32 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics

May 24 2024

herron updated the task description for T356386: Move all o11y services to discovery.wmnet.
May 24 2024, 3:23 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Observability-Metrics

May 22 2024

herron added a comment to T356386: Move all o11y services to discovery.wmnet.

(from the task description)
pyrra (includes slo/slos)
I believe we could point these to thanos-query.discovery.wmnet right away; what do you think @herron ?

May 22 2024, 5:33 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Observability-Metrics

May 14 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 14 2024, 6:49 PM · Patch-For-Review, User-herron, Observability-Metrics
herron closed T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit as Resolved.

Deployed an SLO change to Pyrra just now and looking much better -- No alerts, and thanos rule was automatically reloaded shortly after pyrra filesystem 😎

May 14 2024, 6:19 PM · SRE Observability
herron closed T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
May 14 2024, 6:18 PM · Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.

Thank you, indeed that's probably that issue I was vaguely remembering.

To clarify I'm referring to --prometheus-url http://localhost:17902/rule/ to pyrra filesystem, not pyrra api; AFAICS the prometheus client in filesystem is used only for reload so it should work as expected. The reload will work on titan hosts that do currently run thanos-rule, and will fail on hosts that don't run it which is benign

May 14 2024, 2:42 PM · SRE Observability

May 13 2024

herron added a comment to T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.

A bit of a different route, though I can't remember if we have tried telling pyrra filesystem about "prometheus" being thanos-rule on localhost? i.e. --prometheus-url http://localhost:17902/rule/ ? In other words let pyrra filesystem effectively do the reload

May 13 2024, 2:48 PM · SRE Observability

May 10 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 10 2024, 7:06 PM · Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.

This started happening because I added a grouping workaround in the parent task where essentially the grouping is done by puppet instead of pyrra itself. It results in pyrra generating more output config files, e.g. now slo-$site.yaml vs what used to be slo.yaml

May 10 2024, 6:35 PM · SRE Observability
herron added a subtask for T302995: Transition to Pyrra for SLO Visualization and Management: T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.
May 10 2024, 6:02 PM · Patch-For-Review, User-herron, Observability-Metrics
herron added a parent task for T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit: T302995: Transition to Pyrra for SLO Visualization and Management.
May 10 2024, 6:02 PM · SRE Observability
herron updated the task description for T364645: The pyrra-filesystem-notify-thanos.path fails on titan1001 after reaching the start limit.
May 10 2024, 6:00 PM · SRE Observability

May 9 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 9 2024, 6:53 PM · Patch-For-Review, User-herron, Observability-Metrics

May 6 2024

herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 6 2024, 8:12 PM · Patch-For-Review, User-herron, Observability-Metrics
herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 6 2024, 5:00 PM · Patch-For-Review, User-herron, Observability-Metrics
herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 6 2024, 5:00 PM · Patch-For-Review, User-herron, Observability-Metrics
herron updated the task description for T302995: Transition to Pyrra for SLO Visualization and Management.
May 6 2024, 3:07 PM · Patch-For-Review, User-herron, Observability-Metrics
herron renamed T302995: Transition to Pyrra for SLO Visualization and Management from Explore Pyrra for SLO Visualization and Management to Transition to Pyrra for SLO Visualization and Management.
May 6 2024, 3:06 PM · Patch-For-Review, User-herron, Observability-Metrics

May 3 2024

herron moved T355461: Add perccli support to smart_data_dump from Inbox to Radar on the SRE Observability board.
May 3 2024, 4:24 PM · Infrastructure-Foundations, SRE Observability
herron triaged T362137: Icinga secondary host is not monitored as Low priority.
May 3 2024, 4:23 PM · SRE Observability, Icinga, observability
herron moved T357630: Improve automation for the vendor maintenance calendar from Inbox to Radar on the observability board.
May 3 2024, 4:12 PM · SRE-tools, DC-Ops, observability, Infrastructure-Foundations
herron closed T363532: prometheus1006 prometheus-k8s logical volume running out of disk space as Resolved.

Reviewed fs utilization on codfw/eqiad prom hosts and grew their k8s and ops filesystems targeting ~85% free space each

May 3 2024, 4:09 PM · observability

Apr 30 2024

herron added a comment to T362239: Reformat IRC alerts to be more useful.

Still seeing two spaces after the status e.g. FIRING: although not seeing a clear cause for that

Apr 30 2024, 7:55 PM · Patch-For-Review, Observability-Alerting

Apr 29 2024

herron triaged T363753: Only select o11y-owned datasources on the Grafana Datasource utilization dashboard as Low priority.

Overall this dashboard is meant to show graphite utilization for the whole installation, so I think the thing to do is add filters to drill down as needed.

Apr 29 2024, 9:32 PM · Observability-Metrics
herron added a subtask for T350591: Audit legacy mediawiki stats used in production dashboards: T363753: Only select o11y-owned datasources on the Grafana Datasource utilization dashboard.
Apr 29 2024, 8:30 PM · SRE Observability (FY2023/2024-Q3), Patch-For-Review, Observability-Metrics
herron added a parent task for T363753: Only select o11y-owned datasources on the Grafana Datasource utilization dashboard: T350591: Audit legacy mediawiki stats used in production dashboards.
Apr 29 2024, 8:30 PM · Observability-Metrics
herron closed T363754: Update Grafana Graphite Datasource Utilization dashboard to indicate migration progress as Resolved.

Added two panels at the bottom of the dashboard to display count over time details using the time picker

Apr 29 2024, 8:02 PM · Observability-Metrics

Apr 26 2024

herron added a comment to T363532: prometheus1006 prometheus-k8s logical volume running out of disk space.

FWIW we recently added disk capacity to these hosts with about ~1T free in the VG. I've made a note also to discuss/plan next week how best to allocate the additional space for the long-term with the team. In the mean time should it fire again it is safe to grow the LV again, though hopefully it wont be necessary.

Apr 26 2024, 2:45 PM · observability

Apr 25 2024

herron added a comment to P61217 difference in envoy config between prometheus2005 and prometheus3003.

fwiw this look quite similar to the diff from the experimental patch where the global cert name was changed to a discovery.wmnet domain:

Apr 25 2024, 3:35 PM

Apr 23 2024

herron closed T360687: Memory upgrade request for prometheus100[56] as Resolved.

Thanks! Looks good!

Apr 23 2024, 5:11 PM · SRE, ops-eqiad, Observability-Metrics
herron added a comment to T360687: Memory upgrade request for prometheus100[56].

Prometheus1005 is down and depooled, any time works!

Apr 23 2024, 2:47 PM · SRE, ops-eqiad, Observability-Metrics

Apr 19 2024

herron reopened T360687: Memory upgrade request for prometheus100[56] as "Open".

Reopening -- today we experienced a memory issue on prometheus1005 which presumably relates to this maintenance. Could we arrange to swap the faulty DIMM outlined in T362990? Thanks in advance!

Apr 19 2024, 4:01 PM · SRE, ops-eqiad, Observability-Metrics

Apr 16 2024

herron added a comment to T362239: Reformat IRC alerts to be more useful.

FWIW I think the current alert text makes sense based on the premise that all alert recipients will/should know about how alerting system internals are structured.

Apr 16 2024, 3:35 PM · Patch-For-Review, Observability-Alerting

Apr 15 2024

herron awarded T246998: Enable SSO for Kibana a Party Time token.
Apr 15 2024, 5:06 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), Observability-Logging, SRE

Apr 11 2024

herron awarded T361251: titan100[12] ram/ssd upgrade coordination a Party Time token.
Apr 11 2024, 4:48 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad

Apr 10 2024

herron updated subscribers of T362239: Reformat IRC alerts to be more useful.

While considering this I'd also like to propose moving the (alert name) to the end of message at the same time. For example:

Apr 10 2024, 4:43 PM · Patch-For-Review, Observability-Alerting

Apr 9 2024

herron awarded P60144 Bash function: open puppet repo path in Gerrit web UI a Cup of Joe token.
Apr 9 2024, 7:38 PM · SRE
herron added a comment to T361251: titan100[12] ram/ssd upgrade coordination.

Hey @VRiley-WMF, I'll help out with this one for the o11y side.

Apr 9 2024, 6:42 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-eqiad

Apr 8 2024

herron updated subscribers of T349521: Prometheus/Pyrra: establish backfill process for recording rules.

With T352756 T359879 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 in mind I think It'd be worth spending some time here to work out a strategy for bringing backfilled metrics into production.

Apr 8 2024, 8:41 PM · Patch-For-Review, User-herron, Observability-Metrics

Apr 4 2024

herron moved T352756: Gap in metrics rendered from Thanos Rules from FY2023/2024-Q3 to FY2023/2024-Q4 on the SRE Observability board.
Apr 4 2024, 6:35 PM · SRE Observability (FY2024/2025-Q1), Observability-Metrics, Machine-Learning-Team
herron closed T353691: Reload thanos-rule on new pyrra rules deployed as Resolved.

I think we're in good shape here, please reopen if anything else is needed

Apr 4 2024, 6:33 PM · SRE Observability (FY2023/2024-Q3), User-herron, Observability-Metrics
herron closed T353691: Reload thanos-rule on new pyrra rules deployed, a subtask of T302995: Transition to Pyrra for SLO Visualization and Management, as Resolved.
Apr 4 2024, 6:33 PM · Patch-For-Review, User-herron, Observability-Metrics
herron added a comment to T359879: SLO dashboards for Lift Wing showing unexpected values.

@herron something really strange: https://w.wiki/9bMW

I compared the recording rule with the actual metric, trying to aggregate with the same labels, and the results of the recording rule are strange. I see both eqiad and codfw traffic reported, while the original metric shows mostly eqiad traffic. Anything that comes up to mind on Thanos that could cause this?

Apr 4 2024, 6:29 PM · Machine-Learning-Team, Observability-Metrics
herron added a comment to T361229: titan200[12] RAM/SSD upgrade coordination.

SSD and RAM upgrades have been installed thanks @Jhancock.wm!

Apr 4 2024, 3:55 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
herron updated the task description for T361229: titan200[12] RAM/SSD upgrade coordination.
Apr 4 2024, 3:53 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw
herron updated the task description for T361229: titan200[12] RAM/SSD upgrade coordination.
Apr 4 2024, 3:11 PM · DC-Ops, SRE Observability (FY2023/2024-Q4), SRE, observability, ops-codfw