[go: nahoru, domu]

Page MenuHomePhabricator

Configure availability and health monitoring for the cephosd cluster
Open, HighPublic

Description

Currently we have only basic host level monitoring for the Data-Platform team's Ceph cluster.

We will receive alerts if systemd services fail, which is helpful, but we should also configure Alerts based on the health check metrics that are made available by the prometheus module of the mgr daemon.

These include:

  • OSD flags being set, such as noout
  • PGs (placement groups) being degraded
  • OSDs being marked as down

We should receive appropriate alerts based on these conditions.