User Details
- User Since
- May 30 2017, 5:25 PM (372 w, 3 d)
- Availability
- Available
- IRC Nick
- herron
- LDAP User
- Herron
- MediaWiki User
- Unknown
Today
Had a closer look into the threshold tunables and I don't actually see a way to change this natively within Pyrra. As-is the "for" duration of SLOMetricAbsent alerts is 6m. Pyrra has options to enable/disable the absent alert, but maybe we can configure something to put this alert in a silence/inhibit waiting room for 30-60m before it alerts (or recovers on its own)
Yesterday
SLOMetricAbsent for dead_letters_hits, varnish_sli_bad, trafficserver_backend_sli_bad and haproxy_sli_bad occurred today, at two different times.
Mon, Jul 15
Thu, Jul 11
Reviewing thanos-rule logs I'm seeing related discards with err="out of order sample"
Mon, Jul 8
It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.
Wed, Jul 3
Proposal SGTM for the near-term. Thank you for organizing this!
Mon, Jul 1
Tue, Jun 25
Buster is looking fine with this deb as well. So I've gone ahead and uploaded 1.8.0 to bookworm-wikimedia, bullseye-wikimedia, and buster-wikimedia. Up next is a small canary upgrade before fully rolling out.
Nice, thanks for the pointer! It looks like export CGO_ENABLED=0 does the right thing. At least, with this set the package builds and installs successfully on my bullseye test host.
Mon, Jun 24
Fri, Jun 21
Quick update: prometheus-ipmi-exporter-1.8.0 was a straightforward backport for bookworm https://gitlab.wikimedia.org/repos/sre/prometheus-ipmi-exporter/-/jobs/291532
Thu, Jun 20
Sounds like a plan!
Jun 18 2024
Looking longer-term I think it'd be generally worthwhile to support more than baseline blackbox checks on the mgmt interfaces and I'm personally open to exploring something like the redfish exporter. But I think this would be a medium to large sized project since AIUI it will involve a decent sized chunk of setup effort on the hardware itself, and offhand I'm also not sure the percentage of hw in the fleet that will support the approach today. I'm also assuming we would run into some hw vendor oddities/bugs along the way.
Jun 17 2024
Jun 14 2024
done
Jun 13 2024
Group membership has been provisioned, thanks!
Jun 12 2024
Hi @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl could one of you please approve this request for analytics-privatedata-users? Thanks in advance!
This was completed yesterday (during stashbot outage, this task unfortunately missed the !log)
Jun 11 2024
The patch to provision this access has been merged, and will be fully propagated within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!
The patch to provision this access has been merged and will fully propagate within the next 30 minutes. I'll transition this to resolved now, please re-open if any followup is needed. Thanks!
Hi @Soda could you please coordinate obtaining a comment of support on this task from a sponsor as outlined in https://wikitech.wikimedia.org/wiki/Volunteer_NDA? Thanks!
Closing as this has been stalled for weeks. Please re-open if/when ready to proceed. Thanks!
The patch to provision this access has now been merged and will be fully deployed within the next 30 minutes. I'll transition this task to resolved now, please re-open if any followup is needed. Thanks!
The patch to provision this access has been merged, and will fully propagate within 30 minutes. I'll go ahead and transition this task to resolved now, please re-open if any followup is needed. Thanks!
Jun 10 2024
Hi @JayCano, assigning to you for approval. Thanks!
(SSH key verification email sent)
Resolving as the access looks to have been provisioned, please reopen if any followup is needed. Thanks!
The patch to provision this access has been merged and will be propagated fully within the next 30 minutes. I'll transition this to resolved now, please reopen if any followup is needed. Thanks!
Hi @Ifrahkhanyaree_WMDE I see the SSH key in the description is in use already. Could you please generate a fresh ssh key for production use and update the task description with it? Details/rationale for this are outlined in https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key
May 24 2024
May 22 2024
(from the task description)
pyrra (includes slo/slos)
I believe we could point these to thanos-query.discovery.wmnet right away; what do you think @herron ?
May 14 2024
Deployed an SLO change to Pyrra just now and looking much better -- No alerts, and thanos rule was automatically reloaded shortly after pyrra filesystem 😎
May 13 2024
May 10 2024
This started happening because I added a grouping workaround in the parent task where essentially the grouping is done by puppet instead of pyrra itself. It results in pyrra generating more output config files, e.g. now slo-$site.yaml vs what used to be slo.yaml
May 9 2024
May 6 2024
May 3 2024
Reviewed fs utilization on codfw/eqiad prom hosts and grew their k8s and ops filesystems targeting ~85% free space each
Apr 30 2024
Still seeing two spaces after the status e.g. FIRING: although not seeing a clear cause for that
Apr 29 2024
Overall this dashboard is meant to show graphite utilization for the whole installation, so I think the thing to do is add filters to drill down as needed.
Added two panels at the bottom of the dashboard to display count over time details using the time picker
Apr 26 2024
FWIW we recently added disk capacity to these hosts with about ~1T free in the VG. I've made a note also to discuss/plan next week how best to allocate the additional space for the long-term with the team. In the mean time should it fire again it is safe to grow the LV again, though hopefully it wont be necessary.
Apr 25 2024
fwiw this look quite similar to the diff from the experimental patch where the global cert name was changed to a discovery.wmnet domain:
Apr 23 2024
Thanks! Looks good!
Prometheus1005 is down and depooled, any time works!
Apr 19 2024
Reopening -- today we experienced a memory issue on prometheus1005 which presumably relates to this maintenance. Could we arrange to swap the faulty DIMM outlined in T362990? Thanks in advance!
Apr 16 2024
FWIW I think the current alert text makes sense based on the premise that all alert recipients will/should know about how alerting system internals are structured.
Apr 15 2024
Apr 11 2024
Apr 10 2024
While considering this I'd also like to propose moving the (alert name) to the end of message at the same time. For example:
Apr 9 2024
Hey @VRiley-WMF, I'll help out with this one for the o11y side.
Apr 8 2024
With T352756 T359879 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1017873 in mind I think It'd be worth spending some time here to work out a strategy for bringing backfilled metrics into production.
Apr 4 2024
I think we're in good shape here, please reopen if anything else is needed
SSD and RAM upgrades have been installed thanks @Jhancock.wm!