Prometheus doesn't reload or alert on expired client certificates
Open, HighPublic
Actions

Assigned To

Authored By

	Clement_Goubert
	Aug 4 2023, 12:44 PM

Description

After discovering a hole in k8s apiserver metrics, @fgiunchedi and I investigated and found that new pki certs had been deployed to prometheus but never picked up, and expired certificats were used, resulting in 401 answered queries for metrics.

Smoking gun from kube-apiserver:

Aug 04 12:34:46 kubemaster1001 kube-apiserver[152161]: E0804 12:34:46.650786  152161 authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z, verifying certificate SN=701251950718436174693962379298597088894617122879, SKID=5F:4D:28:59:E7:F3:A7:B3:9B:9F:F7:65:A0:44:C4:39:BE:A1:82:85, AKID=06:94:D5:26:9E:07:DF:85:0D:DF:92:AC:80:03:53:CC:88:A3:EC:49 failed: x509: certificate has expired or is not yet valid: current time 2023-08-04T12:34:46Z is after 2023-08-02T08:44:00Z]"

A simple reload didn't fix it, so a restart of both prometheus@k8s instances in eqiad was done.

12:32:26         godog │ !log bounce prometheus@k8s on prometheus100[56] to test failure to reload certs

Prometheus should restart on a new certificate deployment, or at least alert on unhealthy jobs caused by 401s.

Details

Subject	Repo	Branch	Lines +/-
prometheus: use 'prometheus' profile for k8s certs	operations/puppet	production	+3 -1
pki: add temporary profile for prometheus + k8s	operations/puppet	production	+48 -0
prometheus: use longer-expiration pki client certs for k8s	operations/puppet	production	+2 -1
sre: move KubernetesAPINotScrapable to k8s-specific alerts	operations/alerts	master	+44 -35
sre: add bandaid alert for prometheus not reloading its k8s certs	operations/alerts	master	+35 -0

Customize query in gerrit

Related Objects

Mentioned In: T354399: Prometheus @ k8s OOM loop
Mentioned Here: T354399: Prometheus @ k8s OOM loop

Event Timeline

Clement_Goubert created this task.Aug 4 2023, 12:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2023, 12:44 PM

Clement_Goubert updated the task description. (Show Details)Aug 4 2023, 12:47 PM

JMeybohm subscribed.Aug 7 2023, 5:20 PM

The certificates of the wikikube staging clusters have an expiry time of 3 days (and I've tested the hot reloading initially) so this works in general. Maybe some other configuration issue prevented prometheus from reloading when the certificate changed?

JMeybohm added a project: Kubernetes.Aug 7 2023, 5:31 PM

Yes I think something went wrong with Prometheus and couldn't reload the certs for whatever reason. In terms of alerting I'm thinking errors on service-discovery on the Prometheus side, and certainly errors related to k8s service discovery.

fgiunchedi added a project: User-fgiunchedi.Aug 15 2023, 1:40 PM

lmata triaged this task as High priority.Aug 16 2023, 2:42 PM

lmata edited projects, added Observability-Metrics, SRE Observability (FY2023/2024-Q1); removed observability.

lmata moved this task from Inbox to Up next on the SRE Observability (FY2023/2024-Q1) board.

This happened again on prometheus100[56]

/var/log/syslog.1
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:18:33 prometheus1006 puppet-agent[2698049]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.435Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:18:33 prometheus1006 prometheus@k8s-staging[1040]: level=info ts=2023-08-20T15:18:33.503Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=67.781527ms remote_storage=5.325µs web_handler=1.113µs query_engine=1.796µs scrape=19.67689ms scrape_sd=4.486601ms notify=15.939µs notify_sd=34.816µs rules=16.998914ms



/var/log/syslog.1
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (Cfssl::Cert[wikikube_staging__prometheus]) Scheduling refresh of Exec[prometheus@k8s-staging-reload]
Aug 20 15:17:28 prometheus1005 puppet-agent[2889137]: (/Stage[main]/Profile::Prometheus::K8s/Prometheus::Server[k8s-staging]/Exec[prometheus@k8s-staging-reload]) Triggered 'refresh' from 1 event

/var/log/prometheus/server.log.1 
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.221Z caller=main.go:879 msg="Loading configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml
Aug 20 15:17:28 prometheus1005 prometheus@k8s-staging[1046]: level=info ts=2023-08-20T15:17:28.254Z caller=main.go:910 msg="Completed loading of configuration file" filename=/srv/prometheus/k8s-staging/prometheus.yml totalDuration=32.465262ms remote_storage=4.208µs web_handler=2.386µs query_engine=2.047µs scrape=3.352087ms scrape_sd=5.627388ms notify=13.886µs notify_sd=20.155µs rules=15.314358ms

Mentioned in SAL (#wikimedia-operations) [2023-08-21T09:51:11Z] <jayme> restarted prometheus@k8s on prometheus100[56] - T343529

Sigh, sorry this fell off my radar. I'll implement alerting first so at least we have notifications

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Aug 22 2023, 12:40 PM

Change 951526 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

gerritbot added a project: Patch-For-Review.Aug 22 2023, 3:17 PM

Change 951526 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add bandaid alert for prometheus not reloading its k8s certs

https://gerrit.wikimedia.org/r/951526

Maintenance_bot removed a project: Patch-For-Review.Aug 24 2023, 10:30 AM

Change 952301 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

gerritbot added a project: Patch-For-Review.Aug 25 2023, 7:01 AM

Change 952301 merged by Filippo Giunchedi:

[operations/alerts@master] sre: move KubernetesAPINotScrapable to k8s-specific alerts

https://gerrit.wikimedia.org/r/952301

Maintenance_bot removed a project: Patch-For-Review.Aug 25 2023, 10:10 AM

JMeybohm added a project: Prod-Kubernetes.Sep 22 2023, 9:00 AM

Mentioned in SAL (#wikimedia-operations) [2023-10-09T11:51:26Z] <godog> restart k8s-aux in eqiad to pick up new certs - T343529

lmata edited projects, added SRE Observability (FY2023/2024-Q2); removed SRE Observability (FY2023/2024-Q1).Oct 9 2023, 4:28 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2023/2024-Q2) board.

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:02:52Z] <godog> restart prometheus k8s k8s-aux - T343529

Mentioned in SAL (#wikimedia-operations) [2023-11-13T08:55:46Z] <godog> bounce prometheus eqiad for k8s / k8s-aux - T343529

Since this issue keeps reoccurring we'll have to upgrade prometheus (sth that we need to do anyways at this point)

I gave a quick try at building unstable's prometheus on bullseye (what prometheus hosts run) and it isn't straightforward (due to the dependencies that would need to be backported too). Building for Bookworm seems more straightforward, though we'll also need to upgrade Prometheus hosts to Bookworm (in place) first

Mentioned in SAL (#wikimedia-operations) [2023-11-27T08:41:33Z] <godog> restart prometheus/k8s-staging in eqiad - T343529

lmata edited projects, added SRE Observability (FY2023/2024-Q3); removed SRE Observability (FY2023/2024-Q2).Dec 6 2023, 3:24 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-02T08:27:23Z] <jayme> restart prometheus@k8s prometheus@k8s-aux in eqiad - T343529

Prometheus was upgraded as part of T354399: Prometheus @ k8s OOM loop so this task will need monitoring for reoccurrence (hopefully fixed though)

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.Jan 22 2024, 9:19 AM

Despite the upgrade, this just happened again on k8s / k8s-aux in eqiad, so more investigation is needed

fgiunchedi moved this task from Radar to Up next on the User-fgiunchedi board.Feb 5 2024, 2:27 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-05T14:28:57Z] <godog> bounce prometheus@k8s and @k8s-aux in eqiad - T343529

And again, I just bumped eqiad prometheus@k8s-aux.

Mentioned in SAL (#wikimedia-operations) [2024-02-22T09:03:39Z] <jayme> restart prometheus@k8s in eqiad - T343529

CDanis subscribed.Mar 7 2024, 4:04 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:06:31Z] <claime> bouncing prometheus@k8s.service - T343529

Mentioned in SAL (#wikimedia-operations) [2024-03-07T16:29:52Z] <cdanis> T343529 ✔ cdanis@prometheus2005.codfw.wmnet ~ 🕦☕sudo systemctl restart thanos-sidecar@k8s.service

colewhite mentioned this in T354399: Prometheus @ k8s OOM loop.Mar 7 2024, 5:47 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-11T08:29:46Z] <godog> bounce prometheus@aux-k8s - T343529

fgiunchedi edited projects, added SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).Mar 26 2024, 2:57 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-27T13:59:25Z] <godog> bounce prometheus@k8s-aux in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T07:48:54Z] <jayme> restarting k8s-mlstaging and k8s-staging prometheus instances - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-15T10:31:44Z] <godog> bounce prometheus@k8s-staging in eqiad - T343529

Mentioned in SAL (#wikimedia-operations) [2024-04-30T08:08:28Z] <godog> bounce prometheus@k8s in eqiad - T343529

Change #1025682 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

gerritbot added a project: Patch-For-Review.Apr 30 2024, 8:29 AM

I have spent some time investigating this issue and I believe this is a case of https://github.com/prometheus/common/issues/598 . Specifically prometheus does reload certs from disk, however they are not used for existing connections, only new ones! If existing connections are idle for > 5 minutes then they are recycled, if that doesn't happen then existing (possibly expired) certificates are used.

I have verified this is the case by forcibly resetting existing connections to k8s-aux via iptables, and verified that new connections do indeed present the renewed certs.

Change #1025682 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

Maintenance_bot removed a project: Patch-For-Review.May 7 2024, 10:30 AM

Mentioned in SAL (#wikimedia-operations) [2024-05-20T10:18:44Z] <godog> bounce prometheus@k8s in eqiad - T343529

In T343529#9776787, @gerritbot wrote:

Change #1025682 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use longer-expiration pki client certs for k8s

https://gerrit.wikimedia.org/r/1025682

FTR, this didn't work in the sense that it introduced a change/reneweal at every puppet run. I've reverted the patch here https://gerrit.wikimedia.org/r/c/operations/puppet/+/1030019 though didn't have very much time to investigate further

Change #1034048 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pki: add temporary profile for prometheus + k8s

https://gerrit.wikimedia.org/r/1034048

gerritbot added a project: Patch-For-Review.May 20 2024, 10:37 AM

Change #1034050 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use 'prometheus' profile for k8s certs

https://gerrit.wikimedia.org/r/1034050

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.May 20 2024, 12:18 PM

Change #1034048 merged by Filippo Giunchedi:

[operations/puppet@production] pki: add temporary profile for prometheus + k8s

https://gerrit.wikimedia.org/r/1034048

Change #1034050 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use 'prometheus' profile for k8s certs

https://gerrit.wikimedia.org/r/1034050

Maintenance_bot removed a project: Patch-For-Review.May 23 2024, 9:30 AM

Change is deployed, not a permanent fix though at least the ongoing toil is reduced now

fgiunchedi edited projects, added SRE Observability (FY2024/2025-Q1); removed SRE Observability (FY2023/2024-Q4).Jul 3 2024, 8:10 AM

fgiunchedi removed projects: SRE Observability (FY2024/2025-Q1), User-fgiunchedi.Jul 9 2024, 12:00 PM

Prometheus doesn't reload or alert on expired client certificatesOpen, HighPublicActions

Description

Details

Related Objects

Event Timeline

Prometheus doesn't reload or alert on expired client certificates
Open, HighPublic
Actions