[go: nahoru, domu]

Page MenuHomePhabricator

[GRAFMIGR] Migrate MediaWiki.wikibase.client.pageupdates.* to statslib
Closed, ResolvedPublic

Description

Follow the migration process as outlined below.

Secure/Conduct code review(s).
Deploy the changes to production via the train (https://wikitech.wikimedia.org/wiki/Deployments/Train).
Verify that the changes have been successfully implemented.
Update the relevant dashboard(s) by replacing the old Graphite metric(s) with the new Prometheus metric(s).
Please follow the guidelines and standards outlined in the provided documentation:

https://www.mediawiki.org/wiki/Manual:Stats for detailed guidance on the conversion process.
https://drive.google.com/file/d/12yQEuOapkML1vb9MgCaX1QzbLBdXE6X2/view for a video tutorial on the conversion process.
https://docs.google.com/presentation/d/1SZWf_D3mWNX-XHN8PHYI84LDZr6GUQC2AMhZ9mQXCI0/edit#slide=id.g2795460c956_0_23 for slides on the best practices for converting metrics to statslib.

  • MediaWiki.wikibase.client.pageupdates.*.discardedTitles.rate
  • MediaWiki.wikibase.client.pageupdates.InjectRCRecords.delay.p99
  • MediaWiki.wikibase.client.pageupdates.InjectRCRecords.discardedTitles.rate
  • MediaWiki.wikibase.client.pageupdates.InjectRCRecords.incompleteChanges.rate
  • MediaWiki.wikibase.client.pageupdates.InjectRCRecords.titles.rate
  • MediaWiki.wikibase.client.pageupdates.*.jobs.rate
  • MediaWiki.wikibase.client.pageupdates.RefreshLinks.jobs.rate
  • MediaWiki.wikibase.client.pageupdates.*.titles.rate

source lines (292,298,301)
source

Event Timeline

colewhite renamed this task from Migrate MediaWiki.wikibase.client.* to statslib to Migrate MediaWiki.wikibase.client.pageupdates.* to statslib.Apr 26 2024, 3:40 PM
colewhite updated the task description. (Show Details)

Change #1048470 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] WIP: InjectRCRecordsJob: Use statslib (StatsFactory)

https://gerrit.wikimedia.org/r/1048470

Note to self and others: be very careful not to use the same metric name for two different metrics (probably, each ->getCounter(), ->getGauge() and ->getTiming() call in the code should have a unique literal string argument, except perhaps in cases like T359253). I thought I knew this already (it’s mentioned here and there in the statslib docs), and yet I still accidentally messed this up (see PS2..3 of the above change), and if I hadn’t tested locally that the jobs emit the same metrics before and after statslib migration, I never would’ve noticed that part of the stats was being quietly dropped.

Change #1052737 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] DNM: testing gerrit commit message formatting

https://gerrit.wikimedia.org/r/1052737

Change #1052737 abandoned by Lucas Werkmeister (WMDE):

[mediawiki/extensions/Wikibase@master] DNM: testing gerrit commit message formatting

Reason:

https://gerrit.wikimedia.org/r/1052737

Oops, disregard the above DNM change – I forgot to remove the Bug line there ^^

Otherwise, assuming CI passes, I think the change should be ready for review now.

Change #1048470 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use statslib (StatsFactory) in change dispatching

https://gerrit.wikimedia.org/r/1048470

Looks like I messed up the time units – immediately after group1 to wmf.14 (8:13 UTC), the metric shot up and has been alerting intermittently since then:

image.png (787×1 px, 95 KB)

Okay, I think this was actually a bug in the old code? The old code did:

		$this->recordDelay( $change->getAge() );
// ...
	private function recordDelay( int $delay ): void {
		if ( $this->stats ) {
			$this->stats->timing( 'wikibase.client.pageupdates.InjectRCRecords.delay', $delay );

But ChangeRow::getAge() returns seconds, while StatsdDataFactoryInterface::timing() takes milliseconds. And the Grafana dashboard was set up to show the metric as seconds rather than milliseconds.

Can we make Grafana use two different time scales, somehow? Divide the actual metric by 1000 before 8:13 UTC today? Or should we just update Grafana to use milliseconds and leave a note that this shows historical information incorrectly? (I think we could live with that – I don’t think the historical information is super interesting in this case.)


Also, a comment for @lmata or anyone else interested: IMHO this is still pretty confusing in the new system. Manual:Stats (permalink) says:

A metric: […]

  • should use base units (e.g. seconds, bytes, meters, etc.)
  • should have a suffix describing the unit in plural form. (i.e. _seconds, _bytes, _total, etc.)

And yet, immediately below:

$statsFactory->getTiming('foo')
	->observe( ( microtime( true ) - $startTime ) * 1000 );  # expects ms

What I did in this migration was call the metric _seconds and observe milliseconds, and I hope this is right, but it seems quite confusing. Given that TimingMetric is our own class (and not from any library or standard), could we not call the method observeMilliseconds() (and also have observeSeconds()) to make the unit more obvious?

For now, I’ve set the dashboard unit to ms and raised the alerting threshold from 3600 to 3600000, so it should at least stop alerting. If anyone knows how to make the panel show historical data accurately, please let me know (or edit it directly ^^).

Can we make Grafana use two different time scales, somehow? Divide the actual metric by 1000 before 8:13 UTC today?

Doesn’t sound like it (I asked in wikimedia-observability). In that case I think this can go back into tech verification.

Also, a comment for @lmata or anyone else interested: IMHO this is still pretty confusing in the new system. Manual:Stats (permalink) says:

A metric: […]

  • should use base units (e.g. seconds, bytes, meters, etc.)
  • should have a suffix describing the unit in plural form. (i.e. _seconds, _bytes, _total, etc.)

And yet, immediately below:

$statsFactory->getTiming('foo')
	->observe( ( microtime( true ) - $startTime ) * 1000 );  # expects ms

What I did in this migration was call the metric _seconds and observe milliseconds, and I hope this is right, but it seems quite confusing. Given that TimingMetric is our own class (and not from any library or standard), could we not call the method observeMilliseconds() (and also have observeSeconds()) to make the unit more obvious?

This is fantastic feedback, thank you.

I agree - this is something StatsLib should help guide the user to correct use. We have a task and I think it's time we work on it: T364240: Investigate making TimingMetric unit-aware

Lucas_Werkmeister_WMDE claimed this task.

I agree - this is something StatsLib should help guide the user to correct use. We have a task and I think it's time we work on it: T364240: Investigate making TimingMetric unit-aware

Thanks, what I see there looks good so far. I think we can close this task then.

karapayneWMDE renamed this task from Migrate MediaWiki.wikibase.client.pageupdates.* to statslib to [GRAFMIGR] Migrate MediaWiki.wikibase.client.pageupdates.* to statslib.Thu, Aug 29, 8:13 AM