[go: nahoru, domu]

Page MenuHomePhabricator

Setup monitoring for the MPIC applications
Closed, ResolvedPublic5 Estimated Story Points

Description

Goal

We have to expose some metrics through Prometheus

This is something that should be tackled by the Metrics Platform team, with help from the Data Platform SREs, as a way to share knowledge and experience.

AC

  • We have determined which metrics we want to expose
  • We have configured the prometheus client library
  • We have configured monitoring in the chart/helmfiles
  • mpic is exposing metrics properly

Notes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Sfaci updated the task description. (Show Details)
Sfaci removed Sfaci as the assignee of this task.May 14 2024, 11:44 AM
Sfaci subscribed.
Sfaci set the point value for this task to 5.May 20 2024, 7:37 PM
VirginiaPoundstone raised the priority of this task from Medium to High.Jul 18 2024, 7:31 PM
VirginiaPoundstone lowered the priority of this task from High to Medium.
VirginiaPoundstone raised the priority of this task from Medium to High.Aug 30 2024, 3:18 PM

After talking with SREs looking for help and guidance (https://wikimedia.slack.com/archives/C055QGPTC69/p1725291450945039), we have some extra context about this task:

@Sfaci @cjming Related MR for prom-client setup. The things setup here as per discussion
prom-client.collectDefaultMetrics() function will automatically collect standard metrics like CPU, memory usage, and more.
duration of each HTTP request and histogram.
/metrics endpoint

Change #1070649 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC

https://gerrit.wikimedia.org/r/1070649

Just for tracking purposes I will put here the MR where we have implemented the prometheus client library change for MPIC : https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/96. It's already merged

Change #1070649 merged by jenkins-bot:

[operations/deployment-charts@master] Metrics Platform Instrument Configurator: Enabling prometheus monitoring for MPIC

https://gerrit.wikimedia.org/r/1070649

Change #1070869 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] MPIC: Deploying to staging a new release (v0.1.4)

https://gerrit.wikimedia.org/r/1070869

Change #1070869 merged by jenkins-bot:

[operations/deployment-charts@master] MPIC: Deploying to staging a new release (v0.1.4)

https://gerrit.wikimedia.org/r/1070869

Monitoring code is already implemented and deploy and the dashboard is already alive, but some tuning is needed to show properly all the metrics. We have based the current dashboard on an existing one for AQS and metric names are pretty different. At this time we are working on that.

According to some comments that Ben has put after merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1070649, we should move the monitoring configuration from the chart to the helm files for the next deployment. That's also pending.

Current dashboard is available at https://grafana-rw.wikimedia.org/d/ee2057f3-eb34-45a7-a48b-489e3ff0b2ec/mpic?orgId=1
@SGupta-WMF Could you take a look at it? I have been tuning a bit expressions and queries and it seems that everything is working now

After being reviewed we will be able to deploy on production and fix something regarding the monitoring kubernetes configuration according to something that Ben suggested. That change is already prepared (see gerritbot message below)

Change #1070977 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] MPIC: Moving monitoring configuration from chart to helmfiles

https://gerrit.wikimedia.org/r/1070977

SGupta-WMF updated Other Assignee, added: Sfaci; removed: SGupta-WMF.

Change #1070977 merged by jenkins-bot:

[operations/deployment-charts@master] MPIC: Moving monitoring configuration from chart to helmfiles

https://gerrit.wikimedia.org/r/1070977

Change #1071580 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] MPIC: Deploying a new release v0.1.5 to staging

https://gerrit.wikimedia.org/r/1071580

Change #1071580 merged by jenkins-bot:

[operations/deployment-charts@master] MPIC: Deploying a new release v0.1.5 to staging

https://gerrit.wikimedia.org/r/1071580

@Sfaci The dashboard looks good to me , I largely reviewed the mpic-next service. Thanks!

SGupta-WMF updated Other Assignee, added: SGupta-WMF; removed: Sfaci.
SGupta-WMF updated the task description. (Show Details)

Cool! Thanks @SGupta-WMF
Next and final step will be deploying MPIC to production to enable there this new feature (and prepare the dashboard for that environment as well). I'll get on with it.

Change #1073152 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[operations/deployment-charts@master] MPIC: New deployment (v0.1.5) to production

https://gerrit.wikimedia.org/r/1073152

Change #1073152 merged by jenkins-bot:

[operations/deployment-charts@master] MPIC: New deployment (v0.1.5) to production

https://gerrit.wikimedia.org/r/1073152

@brouberol Just in case you need to know this, monitoring is already enabled for MPIC on staging and production environments. The dashboard is available https://grafana-rw.wikimedia.org/d/ee2057f3-eb34-45a7-a48b-489e3ff0b2ec/mpic?orgId=1