[go: nahoru, domu]

Page MenuHomePhabricator

Analytics-Data-ProblemTag
ActivePublic

Members (2)

Watchers (3)

Details

Description

Specific issues where an Analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.

(Project tag requested in T362839.)

Recent Activity

Yesterday

CodeReviewBot added a project to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report: Patch-For-Review.

milimetric opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/800

Tue, Aug 13, 2:52 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Mon, Aug 12

Krinkle added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984. That could be fine, but wanted to share it in case it's surprising:

Mon, Aug 12, 10:59 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

the new graphs are up. The pivot transformation failed for all the browser family reports, so I'm still fixing that. But, for example, we can now compare these two:

Mon, Aug 12, 7:14 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Maintenance_bot removed a project from T342267: Investigate surprising "10% Other" portion of Analytics Browsers report: Patch-For-Review.
Mon, Aug 12, 4:31 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
gerritbot added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Change #1062044 merged by Milimetric:

[analytics/refinery@master] Remove scripts related to old hive version

https://gerrit.wikimedia.org/r/1062044

Mon, Aug 12, 3:54 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
gerritbot added a project to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report: Patch-For-Review.
Mon, Aug 12, 3:53 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
gerritbot added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Change #1062044 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Remove scripts related to old hive version

https://gerrit.wikimedia.org/r/1062044

Mon, Aug 12, 3:53 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Fri, Aug 9

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

The backfill job should be done sometime this weekend, and I'll rerun the weekly job then.

Fri, Aug 9, 1:23 AM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Wed, Aug 7

Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

For reference, I cleared all the tasks that this dag ran, and that will refresh data for 2 years. We can decide then if we want to do the full history:

Wed, Aug 7, 9:27 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Status update on this: the new job is running, I'm going to keep it here until we vet the data. But new data should start showing up right away, and we can compare dashboards side by side and day by day:

Wed, Aug 7, 9:09 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Maintenance_bot removed a project from T342267: Investigate surprising "10% Other" portion of Analytics Browsers report: Patch-For-Review.
Wed, Aug 7, 4:31 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
gerritbot added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Change #1049281 merged by Milimetric:

[analytics/refinery@master] Implement new way to aggregate browser statistics

https://gerrit.wikimedia.org/r/1049281

Wed, Aug 7, 4:29 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Mon, Aug 5

gerritbot added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Change #1059914 merged by Milimetric:

[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data

https://gerrit.wikimedia.org/r/1059914

Mon, Aug 5, 4:58 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
gerritbot added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Change #1059914 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data

https://gerrit.wikimedia.org/r/1059914

Mon, Aug 5, 4:58 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Thu, Aug 1

Dusan_Krehel updated the task description for T370108: Missed pageview data over API.
Thu, Aug 1, 2:26 PM · Analytics-Data-Problem, Data Products, Pageviews-API, Data-Engineering
Dusan_Krehel updated the task description for T370108: Missed pageview data over API.
Thu, Aug 1, 11:40 AM · Analytics-Data-Problem, Data Products, Pageviews-API, Data-Engineering

Wed, Jul 31

Mayakp.wiki added a project to T370108: Missed pageview data over API: Analytics-Data-Problem.
Wed, Jul 31, 6:55 PM · Analytics-Data-Problem, Data Products, Pageviews-API, Data-Engineering

Mon, Jul 29

Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Sprint Backlog to To Deploy on the Data Products (Data Products Sprint 17) board.
Mon, Jul 29, 4:07 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric edited projects for T342267: Investigate surprising "10% Other" portion of Analytics Browsers report, added: Data Products (Data Products Sprint 17); removed Data Products (Data Products Sprint 16).
Mon, Jul 29, 3:24 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Sign Off to To Deploy on the Data Products (Data Products Sprint 16) board.

great, moving this to get deployed. Steps will be:

Mon, Jul 29, 3:22 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Krinkle added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

👍 This is great!

Mon, Jul 29, 3:11 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Snwachukwu added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.

Next steps would be to rerun any affected downstream jobs.

Mon, Jul 29, 2:57 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Snwachukwu added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.

So I reran mediawiki_history_denormalize airflow dag to re-generate the snapshot for 2024-06 and also reran mediawiki_history_check_denormalize. I did a check using the same query @nshahquinn-wmf ran. We don't have any more duplicates.

Mon, Jul 29, 2:56 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Fri, Jul 26

gerritbot added a project to T371099: No longer use removed cuc_actiontext column in analytics/refinery: Patch-For-Review.
Fri, Jul 26, 2:42 PM · Data Products (Data Products Sprint 17), Data-Engineering, Patch-For-Review
gerritbot added a comment to T371099: No longer use removed cuc_actiontext column in analytics/refinery.

Change #1057221 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[analytics/refinery@master] Don't select cuc_actiontext from cu_changes for sqoop

https://gerrit.wikimedia.org/r/1057221

Fri, Jul 26, 2:42 PM · Data Products (Data Products Sprint 17), Data-Engineering, Patch-For-Review
Dreamy_Jazz added a subtask for T371099: No longer use removed cuc_actiontext column in analytics/refinery: T324907: Create separate tables for log events in CheckUser.
Fri, Jul 26, 2:42 PM · Data Products (Data Products Sprint 17), Data-Engineering, Patch-For-Review
Dreamy_Jazz created T371099: No longer use removed cuc_actiontext column in analytics/refinery.
Fri, Jul 26, 2:40 PM · Data Products (Data Products Sprint 17), Data-Engineering, Patch-For-Review

Thu, Jul 25

amastilovic moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from Next Up to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Thu, Jul 25, 4:27 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Snwachukwu added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.

So I digged further by looking at the airflow job to see if it ran twice for any reason and I think I found the culprit. The job indeed ran twice.
The first run reported this error which we've seen before in this ticket T342911. Skein job ran and failed with this error:

Thu, Jul 25, 12:38 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Snwachukwu added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.

I tried to rerun the job for one of the small wiki_db (tetwiki) with duplicate revision record. using this command:

Thu, Jul 25, 12:24 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Wed, Jul 24

DVrandecic moved T364872: Unique devices per country spikes on wikifunctions from To triage to No current plans / External on the Abstract Wikipedia team board.
Wed, Jul 24, 4:37 PM · Abstract Wikipedia team, Movement-Insights, Analytics-Data-Problem, Data-Platform
mforns moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from To Deploy to Sign Off on the Data Products (Data Products Sprint 16) board.
Wed, Jul 24, 4:06 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Mon, Jul 22

Milimetric added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.
  • wmf.mediawiki_history: duplicate revision/create records indeed exist, some have 4 copies and some 2 copies but all spot-checked duplicates come in even numbers
  • wmf_raw.mediawiki_revision: does not show the same duplication
  • analytics mysql replicas: the pages those revisions belong to were moved and had some delete/restore and delete/revision actions in the logging table
  • cloud replicas: agrees with analytics replicas
Mon, Jul 22, 9:59 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Ottomata updated subscribers of T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.
Mon, Jul 22, 2:49 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Fri, Jul 19

Milimetric moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from Code Review / Tech Input to To Deploy on the Data Products (Data Products Sprint 16) board.

ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:

Fri, Jul 19, 4:29 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Jdforrester-WMF added a project to T364872: Unique devices per country spikes on wikifunctions : Abstract Wikipedia team.

There are…a lot of "pageviews" coming from just 2 IP addresses that day.

Special:GlobalUsage on Wikifunctions is particularly utilized:

[…]

Fri, Jul 19, 3:45 PM · Abstract Wikipedia team, Movement-Insights, Analytics-Data-Problem, Data-Platform
mpopov added a comment to T364872: Unique devices per country spikes on wikifunctions .
select normalized_host.project_class, ip, count(1) as view_count
from pageview_actor 
where year = 2024 and month = 6 and day = 28
  and http_status = '301'
  and agent_type = 'user'
  and uri_path = '/w/index.php'
  and regexp_like(uri_query, 'title=Special%3AGlobalUsage')
  and (is_redirect_to_pageview or is_pageview)
group by 1, 2
order by view_count desc
limit 1000
Fri, Jul 19, 2:21 PM · Abstract Wikipedia team, Movement-Insights, Analytics-Data-Problem, Data-Platform
mpopov added a comment to T364872: Unique devices per country spikes on wikifunctions .

It also means there's nobody to ask to fix the behavior. I believe this requires engineering help from DPE.

Fri, Jul 19, 1:48 PM · Abstract Wikipedia team, Movement-Insights, Analytics-Data-Problem, Data-Platform
mpopov added a comment to T364872: Unique devices per country spikes on wikifunctions .

@Mayakp.wiki: Special:GlobalUsage comes from Extension:GlobalUsage (GlobalUsage), which is a volunteer-authored extension.

Fri, Jul 19, 1:47 PM · Abstract Wikipedia team, Movement-Insights, Analytics-Data-Problem, Data-Platform

Thu, Jul 18

mforns added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Heya @Milimetric, sorry for taking so long to review this.
I left a comment and a +1, I think that the code looks great and that we can deploy this 👍
This new query is so cool! Kudos :-)

Thu, Jul 18, 6:56 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
Ottomata moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from In progress to Next Up on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Thu, Jul 18, 4:45 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Wed, Jul 17

VirginiaPoundstone moved T342267: Investigate surprising "10% Other" portion of Analytics Browsers report from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 16) board.
Wed, Jul 17, 4:15 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki

Tue, Jul 16

VirginiaPoundstone moved T359004: arywiki view stats too low for agent = user? from Radar (other teams) to To be discussed on the Data Products board.
Tue, Jul 16, 4:42 PM · Analytics-Data-Problem, Movement-Insights, Data Products, Data-Engineering, Data-Engineering-Wikistats

Mon, Jul 15

lbowmaker moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from Incoming (new tickets) to Q1 2024 July 1st - September 30th on the Data-Engineering board.
Mon, Jul 15, 2:33 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
lbowmaker moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from Next Up to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Mon, Jul 15, 2:32 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Jul 12 2024

lbowmaker moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from Backlog to Data Engineering on the Data-Platform board.
Jul 12 2024, 2:59 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
lbowmaker assigned T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions to Snwachukwu.
Jul 12 2024, 2:59 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform
Milimetric added a comment to T342267: Investigate surprising "10% Other" portion of Analytics Browsers report.

Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:

Jul 12 2024, 2:46 PM · Patch-For-Review, Data Products (Data Products Sprint 17), Analytics-Data-Problem, MediaWiki-Platform-Team (Radar), Data-Engineering, Data-Engineering-Dashiki
OSefu-WMF moved T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions from Incoming to Waiting on others on the Movement-Insights board.
Jul 12 2024, 1:57 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform

Jul 11 2024

nshahquinn-wmf added a comment to T369851: NEW BUG REPORT Mediawiki_history contains duplicate rows for some revisions.

This may help in diagnosing the problem: looking at the snapshot, the number of duplicates is not uniform across event_timestamp. There are almost none until 2014, and then the number generally increases until the most recent month.

Jul 11 2024, 9:50 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Movement-Insights, Analytics-Data-Problem, Data-Platform