[go: nahoru, domu]

Page MenuHomePhabricator

Prometheus doesn't reload or alert on expired client certificates
Open, HighPublic

Description

After discovering a hole in k8s apiserver metrics, @fgiunchedi and I investigated and found that new pki certs had been deployed to prometheus but never picked up, and expired certificats were used, resulting in 401 answered queries for metrics.

Smoking gun from kube-apiserver:

Aug 04 12:34:46 kubemaster1001 kube-apiserver[152161]: E0804 12:34:46.650786  152161 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z, verifying certificate SN=701251950718436174693962379298597088894617122879, SKID=5F:4D:28:59:E7:F3:A7:B3:9B:9F:F7:65:A0:44:C4:39:BE:A1:82:85, AKID=06:94:D5:26:9E:07:DF:85:0D:DF:92:AC:80:03:53:CC:88:A3:EC:49 failed: x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z]"

A simple reload didn't fix it, so a restart of both prometheus@k8s instances in eqiad was done.

12:32:26         godog │ !log bounce prometheus@k8s on prometheus100[56] to test failure to reload certs

Prometheus should restart on a new certificate deployment, or at least alert on unhealthy jobs caused by 401s.

Event Timeline

The certificates of the wikikube staging clusters have an expiry time of 3 days (and I've tested the hot reloading initially) so this works in general. Maybe some other configuration issue prevented prometheus from reloading when the certificate changed?

Yes I think something went wrong with Prometheus and couldn't reload the certs for whatever reason. In terms of alerting I'm thinking errors on service-discovery on the Prometheus side, and certainly errors related to k8s service discovery.

This happened again on prometheus100[56]

/var/log/syslog.1
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.435Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.503Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=67.781527ms remote_storage=5.325µs web_handler=1.113µs query_engine=1.796µs scrape=19.67689ms scrape_sd=4.486601ms notify=15.939µs notify_sd=34.816µs rules=16.998914ms



/var/log/syslog.1
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1 
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.221Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.254Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=32.465262ms remote_storage=4.208µs web_handler=2.386µs query_engine=2.047µs scrape=3.352087ms scrape_sd=5.627388ms notify=13.886µs notify_sd=20.155µs rules=15.314358ms

Mentioned in SAL (#wikimedia-operations) [2023-08-21T09:51:11Z] <jayme> restarted prometheus@k8s on prometheus100[56] - T343529

Sigh, sorry this fell off my radar. I'll implement alerting first so at least we have notifications

Change 951526 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

Change 951526 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

Change 952301 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

Change 952301 merged by Filippo Giunchedi:

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

Mentioned in SAL (#wikimedia-operations) [2023-10-09T11:51:26Z] <godog> restart k8s-aux in eqiad to pick up new certs - T343529

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:02:52Z] <godog> restart prometheus k8s k8s-aux - T343529

Mentioned in SAL (#wikimedia-operations) [2023-11-13T08:55:46Z] <godog> bounce prometheus eqiad for k8s / k8s-aux - T343529

Since this issue keeps reoccurring we'll have to upgrade prometheus (sth that we need to do anyways at this point)

I gave a quick try at building unstable's prometheus on bullseye (what prometheus hosts run) and it isn't straightforward (due to the dependencies that would need to be backported too). Building for Bookworm seems more straightforward, though we'll also need to upgrade Prometheus hosts to Bookworm (in place) first

Mentioned in SAL (#wikimedia-operations) [2023-11-27T08:41:33Z] <godog> restart prometheus/k8s-staging in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-01-02T08:27:23Z] <jayme> restart prometheus@k8s prometheus@k8s-aux in eqiad - T343529

Prometheus was upgraded as part of T354399: Prometheus @ k8s OOM loop so this task will need monitoring for reoccurrence (hopefully fixed though)

Despite the upgrade, this just happened again on k8s / k8s-aux in eqiad, so more investigation is needed

Mentioned in SAL (#wikimedia-operations) [2024-02-05T14:28:57Z] <godog> bounce prometheus@k8s and @k8s-aux in eqiad - T343529

And again, I just bumped eqiad prometheus@k8s-aux.

Mentioned in SAL (#wikimedia-operations) [2024-02-22T09:03:39Z] <jayme> restart prometheus@k8s in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:06:31Z] <claime> bouncing prometheus@k8s.service - T343529

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:29:52Z] <cdanis> T343529 ✔ cdanis@prometheus2005.codfw.wmnet ~ 🕦☕sudo systemctl restart thanos-sidecar@k8s.service

Mentioned in SAL (#wikimedia-operations) [2024-03-27T13:59:25Z] <godog> bounce prometheus@k8s-aux in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T07:48:54Z] <jayme> restarting k8s-mlstaging and k8s-staging prometheus instances - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T10:31:44Z] <godog> bounce prometheus@k8s-staging in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-30T08:08:28Z] <godog> bounce prometheus@k8s in eqiad - T343529

Change #1025682 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

I have spent some time investigating this issue and I believe this is a case of https://github.com/prometheus/common/issues/598 . Specifically prometheus does reload certs from disk, however they are not used for existing connections, only new ones! If existing connections are idle for > 5 minutes then they are recycled, if that doesn't happen then existing (possibly expired) certificates are used.

I have verified this is the case by forcibly resetting existing connections to k8s-aux via iptables, and verified that new connections do indeed present the renewed certs.

Change #1025682 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

Mentioned in SAL (#wikimedia-operations) [2024-05-20T10:18:44Z] <godog> bounce prometheus@k8s in eqiad - T343529

Change #1025682 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

FTR, this didn't work in the sense that it introduced a change/reneweal at every puppet run. I've reverted the patch here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030019 though didn't have very much time to investigate further

Change #1034048 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pki: add temporary profile for prometheus + k8s

https://gerrit.wikimedia.org/r/1034048

Change #1034050 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use 'prometheus' profile for k8s certs

https://gerrit.wikimedia.org/r/1034050

Change #1034048 merged by Filippo Giunchedi:

[operations/puppet@production] pki: add temporary profile for prometheus + k8s

https://gerrit.wikimedia.org/r/1034048

Change #1034050 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use 'prometheus' profile for k8s certs

https://gerrit.wikimedia.org/r/1034050

Change is deployed, not a permanent fix though at least the ongoing toil is reduced now