Specific issues where an Analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.
(Project tag requested in T362839.)
Specific issues where an Analytics dataset has incorrect, missing, or malformed data or shows an anomaly which might be caused by such data. Not for general work on data quality processes or monitoring.
(Project tag requested in T362839.)
Hm.. it seems the "Other" bucket has grown slightly larger than our predictions of 0.26% prediction at T342267#9998984. That could be fine, but wanted to share it in case it's surprising:
the new graphs are up. The pivot transformation failed for all the browser family reports, so I'm still fixing that. But, for example, we can now compare these two:
Change #1062044 merged by Milimetric:
[analytics/refinery@master] Remove scripts related to old hive version
Change #1062044 had a related patch set uploaded (by Milimetric; author: Milimetric):
[analytics/refinery@master] Remove scripts related to old hive version
The backfill job should be done sometime this weekend, and I'll rerun the weekly job then.
For reference, I cleared all the tasks that this dag ran, and that will refresh data for 2 years. We can decide then if we want to do the full history:
Status update on this: the new job is running, I'm going to keep it here until we vet the data. But new data should start showing up right away, and we can compare dashboards side by side and day by day:
Change #1049281 merged by Milimetric:
[analytics/refinery@master] Implement new way to aggregate browser statistics
Change #1059914 merged by Milimetric:
[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data
Change #1059914 had a related patch set uploaded (by Milimetric; author: Milimetric):
[analytics/analytics.wikimedia.org@master] Add temporary dashboard pointing to old data
great, moving this to get deployed. Steps will be:
👍 This is great!
Next steps would be to rerun any affected downstream jobs.
So I reran mediawiki_history_denormalize airflow dag to re-generate the snapshot for 2024-06 and also reran mediawiki_history_check_denormalize. I did a check using the same query @nshahquinn-wmf ran. We don't have any more duplicates.
Change #1057221 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):
[analytics/refinery@master] Don't select cuc_actiontext from cu_changes for sqoop
So I digged further by looking at the airflow job to see if it ran twice for any reason and I think I found the culprit. The job indeed ran twice.
The first run reported this error which we've seen before in this ticket T342911. Skein job ran and failed with this error:
I tried to rerun the job for one of the small wiki_db (tetwiki) with duplicate revision record. using this command:
ok, moving to ready to deploy. I'm going to ping @Krinkle one more time for data review. I executed this as I was testing and the results are available in milimetric.browser_general_test. You can query this like this:
In T364872#9998611, @mpopov wrote:There are…a lot of "pageviews" coming from just 2 IP addresses that day.
Special:GlobalUsage on Wikifunctions is particularly utilized:
[…]
select normalized_host.project_class, ip, count(1) as view_count from pageview_actor where year = 2024 and month = 6 and day = 28 and http_status = '301' and agent_type = 'user' and uri_path = '/w/index.php' and regexp_like(uri_query, 'title=Special%3AGlobalUsage') and (is_redirect_to_pageview or is_pageview) group by 1, 2 order by view_count desc limit 1000
It also means there's nobody to ask to fix the behavior. I believe this requires engineering help from DPE.
@Mayakp.wiki: Special:GlobalUsage comes from Extension:GlobalUsage (GlobalUsage), which is a volunteer-authored extension.
Heya @Milimetric, sorry for taking so long to review this.
I left a comment and a +1, I think that the code looks great and that we can deploy this 👍
This new query is so cool! Kudos :-)
Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:
This may help in diagnosing the problem: looking at the snapshot, the number of duplicates is not uniform across event_timestamp. There are almost none until 2014, and then the number generally increases until the most recent month.