Gap in metrics rendered from Thanos Rules
Open, MediumPublic3 Estimated Story Points
Actions

Assigned To

None

Authored By

	elukey
	Dec 5 2023, 12:04 PM

Description

Hi folks!

While reviewing the Grizzly dashboards I noticed that sometimes the SLO/SLI graphs show big "holes" in visualization, and I tracked them down to (what appears to be) missing datapoints in the recording rules themselves:

Screenshot from 2023-12-05 13-02-26.png (929×1 px, 211 KB)

The above graphs are related to the Lift Wing Ores Legacy's dashboard.

From https://w.wiki/8PJd it seems that the original istio metric for total requests (counter) is populated, meanwhile the recording rule tiself shows a big gap of days.

The performance of the recording rules seems good, so it shouldn't be a matter of the titan hosts lagging (in theory): https://thanos.wikimedia.org/rules#istio_slos

Any idea?

Details

Subject	Repo	Branch	Lines +/-
istio_sli_avail: alert if metric goes absent	operations/alerts	master	+41 -0
profile::thanos: remove Pyrra recording rule for Istio	operations/puppet	production	+0 -3
profile::pyrra::filesystem: remove Lift Wing Pilot	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T351390 Istio recording rules for Pyrra and Grizzly
		Open		None	T352756 Gap in metrics rendered from Thanos Rules

Event Timeline

elukey created this task.Dec 5 2023, 12:04 PM

calbon assigned this task to elukey.Dec 5 2023, 3:20 PM

calbon set the point value for this task to 3.

calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

lmata added a project: SRE Observability (FY2023/2024-Q2).Dec 6 2023, 3:07 PM

lmata removed a project: observability.

@elukey could the gaps possibly be attributed to changes in the queries/recording rules? I know we've been through a few iterations for these SLOs, is it possible the recording rule simply wasn't in place for that gap period?

Also while not the root cause here T349521 is worth linking since it will be helpful for remediating these sorts of gaps

isarantopoulos triaged this task as Medium priority.Dec 7 2023, 8:52 AM

@herron this is the list of changes:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/956841 - Initial creation of the rules for Grizzly, Sep 18th
https://gerrit.wikimedia.org/r/c/operations/puppet/+/967408 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/967426 - First improvements, Oct 20th
https://gerrit.wikimedia.org/r/c/operations/puppet/+/974148 - New rule for Pyrra (in theory unrelated), Nov 14th
https://gerrit.wikimedia.org/r/c/operations/puppet/+/975743 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/975808 - Third and fourth improvement, Nov 20th

The last two happened on Nov 20th, and they may line up with the ores-legacy's start of the hole. No idea though why they restarted working in December..

It is interesting that other istio-based dashboard SLIs have a similar but not exact behavior:

https://w.wiki/8QbA

In this case the hole starts from the 22nd (more or less), but on the 20th there is a change in pattern (that was expected, since the last couple of changes were aimed to reduce the amount of time series returned).

Reloads on titan1001:

elukey@titan1001:~$ sudo journalctl -u thanos-rule.service | grep -i 'msg="reload rule files"'| awk '{print $1" "$2" "$3}'
Nov 13 09:56:48
Nov 14 15:25:41
Nov 14 16:07:25
Nov 14 17:26:29
Nov 15 16:18:43
Nov 15 16:21:05
Nov 16 10:09:06
Nov 16 10:26:26
Nov 16 17:26:25
Nov 17 13:26:42
Nov 20 13:18:25
Nov 20 13:18:40
Nov 20 13:56:22
Nov 20 15:26:47
Nov 23 08:53:48
Nov 23 09:26:48
Nov 23 14:56:21
Nov 23 15:26:42
Nov 24 08:56:19
Nov 29 08:56:56
Nov 29 10:26:34
Nov 29 11:56:00
Nov 29 12:26:57
Nov 29 14:15:16
Nov 29 14:57:01
Nov 29 17:01:48
Nov 29 17:03:19
Nov 30 09:27:08
Nov 30 15:26:38
Nov 30 15:58:53
Nov 30 16:09:15
Nov 30 16:30:26
Nov 30 17:41:52
Dec 01 16:26:42
Dec 04 13:26:21
Dec 06 14:27:00

I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW

I am very puzzled about the increase() one, since I am not sure why before the 23rd we don't have data visualized. The underlying counters seem to be fine, no gaps in there.

In T352756#9390492, @elukey wrote:

I tried to compare the various graphs for ores-legacy in https://w.wiki/8QkW

I am very puzzled about the increase() one, since I am not sure why before the 23rd we don't have data visualized. The underlying counters seem to be fine, no gaps in there.

I'm seeing a gap between Nov 27 and Dec 10.

Within that time frame on titan1001 I do see a several variations of query errors like this (still occurring):

Dec 12 11:21:25 titan1001 thanos-rule[965]: level=error ts=2023-12-12T11:21:25.980107596Z caller=rule.go:833 component=rules err="read query instant response: expected 2xx response, got 422. Body: {\"status\":\"error\",\"errorType\":\"execution\",\"error\":\"expanding series: proxy Series(): rpc error: code = Aborted desc = receive series from Addr: prometheus2006:29906 LabelSets: {prometheus=\\\"k8s\\\", replica=\\\"b\\\", site=\\\"codfw\\\"} Mint: 1701084061647 Maxt: 9223372036854775807: rpc error: code = Unknown desc = query Prometheus: request failed with code 503 Service Unavailable; msg Service Unavailable\"}\n" query="sum by (site) (rate(istio_requests_total{destination_canonical_service=\"enwiki-articlequality-predictor-default\",kubernetes_namespace=\"istio-system\",response_code=~\"5..\"}[12d])) / sum by (site) (rate(istio_requests_total{destination_canonical_service=\"enwiki-articlequality-predictor-default\",kubernetes_namespace=\"istio-system\"}[12d]))"

and discards like this:

name=istio_request_duration_milliseconds:increase12w index=1 msg="Rule evaluation result discarded" err="duplicate sample for timestamp"

And we also had a change to thanos-rule via pyrra on the 27th with https://gerrit.wikimedia.org/r/c/operations/puppet/+/977738

Offhand I'm thinking our changes to thanos rule introduced these errors/drops that we didn't catch right away.

Thanos rule hasn't logged a discard error since Dec 1st (re: correlation does that date ring any bells?) but I do still see query errors occurring even today, I think due to long running queries.

Change 983179 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pyrra::filesystem: remove Lift Wing Pilot

https://gerrit.wikimedia.org/r/983179

gerritbot added a project: Patch-For-Review.Dec 14 2023, 12:48 PM

Change 983179 merged by Elukey:

[operations/puppet@production] profile::pyrra::filesystem: remove Lift Wing Pilot

https://gerrit.wikimedia.org/r/983179

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2023, 4:10 PM

@herron removed the Pyrra configs to figure out if anything changes, but IIRC we had similar gaps even before the Pyrra pilots, so I suspect something else is going on. I rechecked https://w.wiki/8QkW and I see more holes in the recording rules..

Change 983421 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos: remove Pyrra recording rule for Istio

https://gerrit.wikimedia.org/r/983421

gerritbot added a project: Patch-For-Review.Dec 15 2023, 2:31 PM

Change 983421 merged by Elukey:

[operations/puppet@production] profile::thanos: remove Pyrra recording rule for Istio

https://gerrit.wikimedia.org/r/983421

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2023, 3:30 PM

elukey moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.Dec 19 2023, 3:31 PM

elukey moved this task from In Progress to Backlog/SRE on the Machine-Learning-Team board.Dec 20 2023, 3:27 PM

Summary: we removed the Pyrra config (related to Lift Wing) that caused some error in the Thanos logs, to remove a variable and see if the Gaps disappear. There is currently no strong lead related to what may cause it, so we'll have to keep investigating it. I'll be away for a couple of months so I'll not be able to help more :(

elukey mentioned this in T351390: Istio recording rules for Pyrra and Grizzly.Dec 20 2023, 5:02 PM

klausman claimed this task.Jan 10 2024, 2:22 PM

fgiunchedi edited projects, added Observability-Metrics, SRE Observability (FY2023/2024-Q3); removed SRE Observability (FY2023/2024-Q2).Jan 15 2024, 11:17 AM

herron moved this task from FY2023/2024-Q3 to FY2023/2024-Q4 on the SRE Observability board.Apr 4 2024, 6:34 PM

herron edited projects, added SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).

elukey claimed this task.Apr 8 2024, 12:31 PM

elukey moved this task from Backlog/SRE to Blocked on the Machine-Learning-Team board.

herron mentioned this in T349521: Prometheus/Pyrra: establish backfill process for recording rules.Apr 8 2024, 8:39 PM

Rechecking this - after the usual time window update for the quarter, I see that we don't have metrics from the first of June to approx the 8th:

https://grafana.wikimedia.org/dashboard/snapshot/MEfeYAWsKvOk8rA5SY4wrfWY27d1j7s1?orgId=1

Not sure why, I'll keep checking if the holes will appear in other timeframes as well.

I also checked if the _v2 recording rules were making a difference, but so far it seems the issue is the same in both versions.

elukey removed elukey as the assignee of this task.Jun 10 2024, 4:12 PM

I tried to add more granularity to the Thanos Rule Grafana UI, to be able to drill down each rule separately. This is the link for the istio-slo rule group.

Up to the 14th it seems that there were a lot of rule evaluation missed, and the rule group last duration was significantly reduced over the last month. In the dashboards related to istio SLOs (like the one linked in my prev comment) I do see a big hole of several days from roughly the 3rd to the 14th of this month, but I can find a clear clue in the Thanos dashboard metrics.

@herron something interesting:

Screenshot from 2024-06-21 15-18-24.png (752×1 px, 159 KB)

Screenshot from 2024-06-21 15-18-53.png (752×1 px, 58 KB)

Screenshot from 2024-06-21 15-18-37.png (752×1 px, 52 KB)

I checked multiple recording rules, and there seems to be consistency in the metrics gap, started on June 6th at around 11:37 UTC. I checked on various daemons on titan nodes and nothing clearly pops up, there are some errors but I don't find clear correlations.

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

On titan1001 these are the reloads for thanos-rule:

Jun 06 13:57:03 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 11 14:56:06 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 12 09:26:39 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 13 09:57:03 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 13 21:27:11 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 14 07:56:10 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 17 11:56:38 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 19 09:26:25 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...
Jun 21 13:56:23 titan1001 systemd[1]: Reloading thanos-rule.service - Thanos rule...

Multiple ones happened, so I doubt that one is sufficient to clear the issue (if the issue is an inconsistent state).

In T352756#9913669, @elukey wrote:

On the 6th in the thanos-rule logs I see a ton of errors while connecting to Prometheus nodes, and around 14 UTC a reload was issued on both titan active nodes.

Having a look through the SAL for changes in that time frame, Jun 6th was also when all prom and thanos-be nodes were rebooted for updates. Which should explain a flurry of errors in Thanos logs, but I'm still not clear on the cause for a multi day gap

fgiunchedi edited projects, added SRE Observability (FY2024/2025-Q1); removed SRE Observability (FY2023/2024-Q4).Jul 3 2024, 8:10 AM

It occurred to me that deploying absent metric alerts for the metrics where we're seeing gaps would be a reasonable next step. That'd let us troubleshoot gaps in/closer to their broken state which should help toward better understanding the issue and steps to resolve manually. Plus, it'd of course help us respond faster and shrink the gaps. I'll work on a patch.

Change #1052784 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

gerritbot added a project: Patch-For-Review.Jul 8 2024, 5:30 PM

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

Maintenance_bot removed a project: Patch-For-Review.Jul 9 2024, 8:00 PM

In T352756#9965404, @gerritbot wrote:

Change #1052784 merged by jenkins-bot:

[operations/alerts@master] istio_sli_avail: alert if metric goes absent

https://gerrit.wikimedia.org/r/1052784

This is causing alerts since deployment

2024-07-16-115144_1044x909_scrot.png (909×1 px, 118 KB)

And pint validation failures too AlertLintProblem alert

description: Pint reporter promql/series found problem(s) in /srv/alerts/ops/team-o11y_sli.yaml: prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "istio_sli_availability_requests_total:increase5m" metric in the last 1w

fgiunchedi mentioned this in T372454: Alert in need of triage: SLIMetricMissing .Wed, Aug 14, 9:05 AM

	F56457216: 2024-07-16-115144_1044x909_scrot.png
	Jul 16 2024, 9:52 AM

	F55528047: Screenshot from 2024-06-21 15-18-37.png
	Jun 21 2024, 1:25 PM

	F55528045: Screenshot from 2024-06-21 15-18-53.png
	Jun 21 2024, 1:25 PM

	F55528046: Screenshot from 2024-06-21 15-18-24.png
	Jun 21 2024, 1:25 PM

	F41562535: Screenshot from 2023-12-05 13-02-26.png
	Dec 5 2023, 12:04 PM

	F41570194: image.png
	Dec 7 2023, 11:13 AM

Gap in metrics rendered from Thanos RulesOpen, MediumPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Gap in metrics rendered from Thanos Rules
Open, MediumPublic3 Estimated Story Points
Actions

Related Objects
Search...