Configure availability and health monitoring for the cephosd cluster
Open, HighPublic
Actions

Assigned To

Authored By

	BTullis
	Jul 9 2024, 12:07 AM

Description

Currently we have only basic host level monitoring for the Data-Platform team's Ceph cluster.

We will receive alerts if systemd services fail, which is helpful, but we should also configure Alerts based on the health check metrics that are made available by the prometheus module of the mgr daemon.

These include:

OSD flags being set, such as noout
PGs (placement groups) being degraded
OSDs being marked as down

We should receive appropriate alerts based on these conditions.

Details

	Subject	Repo	Branch	Lines +/-
	Configure prometheus metrics on the cephosd cluster	operations/puppet	production	+32 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T362788 Migrate Airflow to the dse-k8s cluster
Open	Stevemunene	T369582 Enable prometheus metrics on the cephosd cluster
Open	Stevemunene	T369583 Configure availability and health monitoring for the cephosd cluster

Event Timeline

BTullis created this task.Jul 9 2024, 12:07 AM

Gehel triaged this task as High priority.Jul 9 2024, 8:02 AM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.07.08 - 2024.07.28); removed Data-Platform-SRE.Jul 9 2024, 8:06 AM

bking edited projects, added Data-Platform-SRE (2024.07.29 - 2024.08.16); removed Data-Platform-SRE (2024.07.08 - 2024.07.28).Jul 31 2024, 2:44 PM

Stevemunene claimed this task.Thu, Aug 15, 8:21 AM

Gehel edited projects, added Data-Platform-SRE (2024.08.17 - 2024.09.06); removed Data-Platform-SRE (2024.07.29 - 2024.08.16).Fri, Aug 16, 9:45 AM

Change #1070142 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Configure prometheus metrics on the cephosd cluster

https://gerrit.wikimedia.org/r/1070142

gerritbot added a project: Patch-For-Review.Tue, Sep 3, 7:29 AM

Stevemunene moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Tue, Sep 3, 9:07 AM

Configure availability and health monitoring for the cephosd clusterOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Configure availability and health monitoring for the cephosd cluster
Open, HighPublic
Actions

Related Objects
Search...