[go: nahoru, domu]

Page MenuHomePhabricator

Gap in metrics rendered from Thanos Rules
Open, MediumPublic3 Estimated Story Points

Description

Hi folks!

While reviewing the Grizzly dashboards I noticed that sometimes the SLO/SLI graphs show big "holes" in visualization, and I tracked them down to (what appears to be) missing datapoints in the recording rules themselves:

Screenshot from 2023-12-05 13-02-26.png (929×1 px, 211 KB)

The above graphs are related to the Lift Wing Ores Legacy's dashboard.

From https://w.wiki/8PJd it seems that the original istio metric for total requests (counter) is populated, meanwhile the recording rule tiself shows a big gap of days.

The performance of the recording rules seems good, so it shouldn't be a matter of the titan hosts lagging (in theory): https://thanos.wikimedia.org/rules#istio_slos

Any idea?

Event Timeline

calbon set the point value for this task to 3.
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

@elukey could the gaps possibly be attributed to changes in the queries/recording rules? I know we've been through a few iterations for these SLOs, is it possible the recording rule simply wasn't in place for that gap period?

Also while not the root cause here T349521 is worth linking since it will be helpful for remediating these sorts of gaps

@herron this is the list of changes:

The last two happened on Nov 20th, and they may line up with the ores-legacy's start of the hole. No idea though why they restarted working in December..

It is interesting that other istio-based dashboard SLIs have a similar but not exact behavior:

https://w.wiki/8QbA

image.png (1×2 px, 477 KB)

In this case the hole starts from the 22nd (more or less), but on the 20th there is a change in pattern (that was expected, since the last couple of changes were aimed to reduce the amount of time series returned).

Reloads on titan1001:

elukey@titan1001:~$ sudo journalctl -u thanos-rule.service | grep -i 'msg="reload rule files"'| awk '{print $1" "$2" "$3}'
Nov 13 09:56:48
Nov 14 15:25:41
Nov 14 16:07:25
Nov 14 17:26:29
Nov 15 16:18:43
Nov 15 16:21:05
Nov 16 10:09:06
Nov 16 10:26:26
Nov 16 17:26:25
Nov 17 13:26:42
Nov 20 13:18:25
Nov 20 13:18:40
Nov 20 13:56:22
Nov 20 15:26:47
Nov 23 08:53:48
Nov 23 09:26:48
Nov 23 14:56:21
Nov 23 15:26:42
Nov 24 08:56:19
Nov 29 08:56:56
Nov 29 10:26:34
Nov 29 11:56:00
Nov 29 12:26:57
Nov 29 14:15:16
Nov 29 14:57:01
Nov 29 17:01:48
Nov 29 17:03:19
Nov 30 09:27:08
Nov 30 15:26:38
Nov 30 15:58:53
Nov 30 16:09:15
Nov 30 16:30:26
Nov 30 17:41:52
Dec 01 16:26:42
Dec 04 13:26:21
Dec 06 14:27:00

I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW

I am very puzzled about the increase() one, since I am not sure why before the 23rd we don't have data visualized. The underlying counters seem to be fine, no gaps in there.

I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW

I am very puzzled about the increase() one, since I am not sure why before the 23rd we don't have data visualized. The underlying counters seem to be fine, no gaps in there.

I'm seeing a gap between Nov 27 and Dec 10.

Within that time frame on titan1001 I do see a several variations of query errors like this (still occurring):

Dec 12 11:21:25 titan1001 thanos-rule[965]: level=error ts=2023-12-12T11:21:25.980107596Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: prometheus2006:29906 LabelSets: {prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"codfw\\\"} Mint: 1701084061647 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}\n" query="sum by (site) (rate(istio_requests_total{destination_canonical_service=\"enwiki-articlequality-predictor-default\",kubernetes_namespace=\"istio-system\",response_code=~\"5..\"}[12d])) / sum by (site) (rate(istio_requests_total{destination_canonical_service=\"enwiki-articlequality-predictor-default\",kubernetes_namespace=\"istio-system\"}[12d]))"

and discards like this:

name=istio_request_duration_milliseconds:increase12w index=1 msg="Rule evaluation result discarded" err="duplicate sample for timestamp"

And we also had a change to thanos-rule via pyrra on the 27th with https://gerrit.wikimedia.org/r/c/operations/puppet/+/977738

Offhand I'm thinking our changes to thanos rule introduced these errors/drops that we didn't catch right away.

Thanos rule hasn't logged a discard error since Dec 1st (re: correlation does that date ring any bells?) but I do still see query errors occurring even today, I think due to long running queries.

Change 983179 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem: remove Lift Wing Pilot

https://gerrit.wikimedia.org/r/983179

Change 983179 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem: remove Lift Wing Pilot

https://gerrit.wikimedia.org/r/983179

@herron removed the Pyrra configs to figure out if anything changes, but IIRC we had similar gaps even before the Pyrra pilots, so I suspect something else is going on. I rechecked https://w.wiki/8QkW and I see more holes in the recording rules..

Change 983421 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos: remove Pyrra recording rule for Istio

https://gerrit.wikimedia.org/r/983421

Change 983421 merged by Elukey:

[operations/puppet@production] profile::thanos: remove Pyrra recording rule for Istio

https://gerrit.wikimedia.org/r/983421

Summary: we removed the Pyrra config (related to Lift Wing) that caused some error in the Thanos logs, to remove a variable and see if the Gaps disappear. There is currently no strong lead related to what may cause it, so we'll have to keep investigating it. I'll be away for a couple of months so I'll not be able to help more :(

Rechecking this - after the usual time window update for the quarter, I see that we don't have metrics from the first of June to approx the 8th:

https://grafana.wikimedia.org/dashboard/snapshot/MEfeYAWsKvOk8rA5SY4wrfWY27d1j7s1?orgId=1

Not sure why, I'll keep checking if the holes will appear in other timeframes as well.

I also checked if the _v2 recording rules were making a difference, but so far it seems the issue is the same in both versions.

I tried to add more granularity to the Thanos Rule Grafana UI, to be able to drill down each rule separately. This is the link for the istio-slo rule group.

Up to the 14th it seems that there were a lot of rule evaluation missed, and the rule group last duration was significantly reduced over the last month. In the dashboards related to istio SLOs (like the one linked in my prev comment) I do see a big hole of several days from roughly the 3rd to the 14th of this month, but I can find a clear clue in the Thanos dashboard metrics.

@herron something interesting:

Screenshot from 2024-06-21 15-18-24.png (752×1 px, 159 KB)

Screenshot from 2024-06-21 15-18-53.png (752×1 px, 58 KB)

Screenshot from 2024-06-21 15-18-37.png (752×1 px, 52 KB)

I checked multiple recording rules, and there seems to be consistency in the metrics gap, started on June 6th at around 11:37 UTC. I checked on various daemons on titan nodes and nothing clearly pops up, there are some errors but I don't find clear correlations.

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

On titan1001 these are the reloads for thanos-rule:

Jun 06 13:57:03 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 11 14:56:06 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 12 09:26:39 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 13 09:57:03 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 13 21:27:11 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 14 07:56:10 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 17 11:56:38 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 19 09:26:25 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 21 13:56:23 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...

Multiple ones happened, so I doubt that one is sufficient to clear the issue (if the issue is an inconsistent state).

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

Having a look through the SAL for changes in that time frame, Jun 6th was also when all prom and thanos-be nodes were rebooted for updates. Which should explain a flurry of errors in Thanos logs, but I'm still not clear on the cause for a multi day gap

It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.

Change #1052784 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

This is causing alerts since deployment

2024-07-16-115144_1044x909_scrot.png (909×1 px, 118 KB)

And pint validation failures too AlertLintProblem alert

description: Pint reporter promql/series found problem(s) in /srv/alerts/ops/team-o11y_sli.yaml: prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "istio_sli_availability_requests_total:increase5m" metric in the last 1w