[go: nahoru, domu]

Page MenuHomePhabricator

statograph_post errors with out of range float values since 2024-07-16
Closed, ResolvedPublic

Description

Starting 2024-07-16 with a benthos issue, statograph_post failed to complete:

Jul 18 06:41:00 alert1001 systemd[1]: Starting statograph_post.service - Runs statograph to publish data to statuspage.io...
Jul 18 06:41:02 alert1001 statograph[971144]: ERROR:statograph.__main__:Exception encountered when processing lyfcttm2lhw4; continuing
Jul 18 06:41:02 alert1001 statograph[971144]: Traceback (most recent call last):
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/requests/models.py", line 511, in prepare_body
Jul 18 06:41:02 alert1001 statograph[971144]:     body = complexjson.dumps(json, allow_nan=False)
Jul 18 06:41:02 alert1001 statograph[971144]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3.11/json/__init__.py", line 238, in dumps
Jul 18 06:41:02 alert1001 statograph[971144]:     **kw).encode(obj)
Jul 18 06:41:02 alert1001 statograph[971144]:           ^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3.11/json/encoder.py", line 200, in encode
Jul 18 06:41:02 alert1001 statograph[971144]:     chunks = self.iterencode(o, _one_shot=True)
Jul 18 06:41:02 alert1001 statograph[971144]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3.11/json/encoder.py", line 258, in iterencode
Jul 18 06:41:02 alert1001 statograph[971144]:     return _iterencode(o, 0)
Jul 18 06:41:02 alert1001 statograph[971144]:            ^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]: ValueError: Out of range float values are not JSON compliant
Jul 18 06:41:02 alert1001 statograph[971144]: During handling of the above exception, another exception occurred:
Jul 18 06:41:02 alert1001 statograph[971144]: Traceback (most recent call last):
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/statograph/__main__.py", line 115, in main
Jul 18 06:41:02 alert1001 statograph[971144]:     uploader.upload_one_batch(datasource, configured_metrics[metric_id]['query'],
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/statograph/uploader.py", line 86, in upload_one_batch
Jul 18 06:41:02 alert1001 statograph[971144]:     statuspage.add_datapoints({self.metric.id: data})
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/statograph/statuspage.py", line 119, in add_datapoints
Jul 18 06:41:02 alert1001 statograph[971144]:     return self._request('POST', self._url("metrics/data"), json=body)
Jul 18 06:41:02 alert1001 statograph[971144]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/statograph/statuspage.py", line 70, in _request
Jul 18 06:41:02 alert1001 statograph[971144]:     resp = self._session.request(method, url, timeout=self._timeout_secs, **kwargs)
Jul 18 06:41:02 alert1001 statograph[971144]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 573, in request
Jul 18 06:41:02 alert1001 statograph[971144]:     prep = self.prepare_request(req)
Jul 18 06:41:02 alert1001 statograph[971144]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 484, in prepare_request
Jul 18 06:41:02 alert1001 statograph[971144]:     p.prepare(
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/requests/models.py", line 371, in prepare
Jul 18 06:41:02 alert1001 statograph[971144]:     self.prepare_body(data, files, json)
Jul 18 06:41:02 alert1001 statograph[971144]:   File "/usr/lib/python3/dist-packages/requests/models.py", line 513, in prepare_body
Jul 18 06:41:02 alert1001 statograph[971144]:     raise InvalidJSONError(ve, request=self)
Jul 18 06:41:02 alert1001 statograph[971144]: requests.exceptions.InvalidJSONError: Out of range float values are not JSON compliant

even though the benthos issue was resolved (cfr https://phabricator.wikimedia.org/T369256#9987295) statograph continued to fail, from IRC backlog I'm assuming because invalid floats (likely NaN) are still in the results from prometheus cc @CDanis

Event Timeline

I've confirmed the data to be posted has quite a few NaN inside:

{'timestamp': 1721142960, 'value': 0.23723366}, {'timestamp': 1721143020, 'value': nan}, {'timestamp': 1721143080, 'value': nan}, ...

They should probably either be excluded by the prometheus query or by the software itself.
I'm sending a tentative patch.

Change #1055875 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/statograph@master] Filter out NaN data from Prometheus

https://gerrit.wikimedia.org/r/1055875

I believe this is responsible for "Wiki response time" not showing up on https://www.wikimediastatus.net/ since 16 July (mentioning here for benefit of search and discovery within Phab).

bd808 renamed this task from statograph_post errors with out of range float values to statograph_post errors with out of range float values since 2024-07-16.Jul 22 2024, 8:34 PM
bd808 updated the task description. (Show Details)

I've hot-tested the above patch and it doesn't solve the problem because all the data is seen as NaN.
From a quick check of the configuration file it has:

graphite: https://graphite.wikimedia.org/
query: MediaWiki.timing.editResponseTime.sample_rate

And from a quick check on graphite we do have that metric:
https://graphite.wikimedia.org/?width=586&height=308&target=MediaWiki.timing.editResponseTime.sample_rate

But apparently statograph is not getting the data, to be investigated further. To be noted that this is the only query to graphite all the other data uses different datasources.

In the meanwhile wile hot-testing I've unblocked statograph uploading all the other data and now it runs fine without erroring out. But still we miss data for the response time

Ok I spoke too soon :D Status page indexing is slower than I expected.

Actually with the hot-fix applied once I've backfilled all the data and is now working fine (without the hotfix) on his own.

All data backfilled and graph looks full of data.

@CDanis when you're back I think we should merge and deploy the patch anyway to prevent statograph to fail entirely if one metric has no data. I'll leave it to you the CI issues with mypy unrelated to this change.

Change #1055875 merged by jenkins-bot:

[operations/software/statograph@master] Filter out NaN data from Prometheus

https://gerrit.wikimedia.org/r/1055875

Thanks for handling this while I was out!

@CDanis when you're back I think we should merge and deploy the patch anyway to prevent statograph to fail entirely if one metric has no data.

Agreed it's a good patch, +2'd.

But unless I'm misunderstanding we already prevent statograph failing entirely if one metric has no data? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph/+/master/statograph/__main__.py#129

I'll leave it to you the CI issues with mypy unrelated to this change.

yep done and merged :)

Thanks for handling this while I was out!

No prob :)

@CDanis when you're back I think we should merge and deploy the patch anyway to prevent statograph to fail entirely if one metric has no data.

Agreed it's a good patch, +2'd.

Thanks

But unless I'm misunderstanding we already prevent statograph failing entirely if one metric has no data? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph/+/master/statograph/__main__.py#129

You're right, it was skipping that metric, what it wasn't able to do was to recover itself once new data was there. A single NaN in the dataset to be filled or backfilled would prevent the metric to work and make statograph unable to get unstuck even if the following datapoints were having normal data.

I'll leave it to you the CI issues with mypy unrelated to this change.

yep done and merged :)

Great!

CDanis assigned this task to Volans.

You're right, it was skipping that metric, what it wasn't able to do was to recover itself once new data was there. A single NaN in the dataset to be filled or backfilled would prevent the metric to work and make statograph unable to get unstuck even if the following datapoints were having normal data.

Yeah this is unfortunately as designed -- I don't think you can pick both "be completely stateless" and also "never leave gaps", while also picking "have a simple implementation".

A similar previous issue like this is also why the --start-timestamp flag/kludge exists (T298619: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC )

I think it's been approx once every two years that we've had a weird issue with bad data causing the graphs to get 'stuck', and we do have monitoring for this situation, so I think this is all the right tradeoff for now.