[go: nahoru, domu]

Page MenuHomePhabricator

Port most/all Icinga checks to Prometheus/Alertmanager
Open, Needs TriagePublic

Description

This is a tracking task for the general work of moving alerts from Icinga to Prometheus/Alertmanager.

Note that the title says most because while the perfect end goal is to migrate all alerts (and thus shut down Icinga) that might be unpractical and/or too much effort with respect to the gains.

On a pragmatic level though what we can do is reduce Icinga' scope over time, and turn it into a "backend" component. In this scenario for example we would stop using Icinga's web UI for all/most operations, and delegate all functionality to AM / alerts.w.o

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
Resolvedfgiunchedi
Resolvedlmata
Resolvedfgiunchedi
ResolvedLadsgroup
Resolvedfgiunchedi
ResolvedNone
ResolvedArnoldokoth
Resolvedfgiunchedi
DuplicateNone
OpenNone
ResolvedNone
ResolvedEBernhardson
ResolvedBTullis
Resolvedjbond
Resolvedjhathaway
ResolvedBCornwall
ResolvedBCornwall
DuplicateNone
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedJMeybohm
ResolvedBCornwall
Resolvedfgiunchedi
Resolvedcmooney
OpenNone
Resolvedfgiunchedi
InvalidNone
OpenABran-WMF
ResolvedABran-WMF
OpenNone
ResolvedABran-WMF
DeclinedABran-WMF
ResolvedABran-WMF
OpenABran-WMF
OpenABran-WMF
OpenABran-WMF
OpenNone
Resolvedfgiunchedi
Resolvedjbond
OpenNone
Resolvedcmooney
ResolvedSLyngshede-WMF
ResolvedSLyngshede-WMF
OpenNone
Resolvedfgiunchedi
Resolvedfgiunchedi
OpenNone
OpenNone
OpenSLyngshede-WMF
OpenBUG REPORTNone
OpenNone
In ProgressNone
OpenNone
OpenNone
OpenNone
Resolvedtaavi
Resolved nskaggs
Resolvedtaavi
Resolvedtaavi
Resolveddcaro
OpenNone
Resolvedtaavi
OpenNone
Resolvedtaavi
OpenNone
OpenNone
OpenNone
OpenSLyngshede-WMF
OpenNone
In ProgressABran-WMF
OpenNone
Resolvedfgiunchedi
InvalidNone
ResolvedVolans
Resolvedfgiunchedi
Resolvedfgiunchedi
OpenNone
Opentappof
OpenNone
OpenNone
OpenNone
OpenNone
Opentappof
OpenBUG REPORTtappof
Opentappof

Event Timeline

Have we thought about creating a small middleware that would change nagios output format into prometheus-scrapable metrics (maybe including some kind of memory/disk cache for long running ones/ones where they are supposed to be run only once per hour)? I checked and I don't see anything already existing that does that (note I am talking about nagios checks, without icinga) I know there is already in place a scraper for the icinga service itself.

While this would not be an ideal situation in many cases- native solutions would be preferred- it would avoid headaches like T315866#8194791, where a very inferior metrics solution is proposed to substitute a proper, long-standing icinga check, by reusing the specific logic on a better management system. A single job would scrape all icinga-based checks for a host and aggregate them into prometheus metrics- including the error text- and that would allow us to replace fully icinga itself, while keeping the custom alert logic, all consolidated in prometheus. This would also solve the issue with the many upstream solutions not having space for a few custom WMF-specific checks- leading the way for an icinga-free WMF.

The closest I can think of is nrpe_exporter: https://github.com/canonical/nrpe_exporter and certainly something we can consider!

The closest I can think of is nrpe_exporter: https://github.com/canonical/nrpe_exporter and certainly something we can consider!

Nice. I see this still uses the NRPE daemon.

I think in general this shouldn't be plan A for any migration, but I am sure complex cases like the one I mention (WMF-specific behaviour potentially not found upstream), or others where there are no longer maintainers around to do the right thing, we could use this or something similar, and migrate the puppet class to use this, getting eventually rid of icinga itself (which I think we all agree is not a great alerting manager).

For example, when I did the check_bacula.py from zero, I implemented both nagios output format and a prometheus exporter daemon, but I guess there may be very old pieces of small checks that could take a lot of time to migrate to proper exporters.

Change 991801 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove legacy check_nagios_paging

https://gerrit.wikimedia.org/r/991801

Change 991801 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove legacy check_nagios_paging

https://gerrit.wikimedia.org/r/991801