[go: nahoru, domu]

Page MenuHomePhabricator

Antoine_Quhen (aqu)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Jan 4 2022, 1:16 PM (139 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
AQuhen (WMF) [ Global Accounts ]

Recent Activity

Mon, Aug 26

Antoine_Quhen moved T365659: Implement automatic sync of refinery HQL files to HDFS from Blocked/Paused to In progress on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Mon, Aug 26, 2:17 PM · Data-Engineering (Q1 2024 July 1st - September 30th)
Antoine_Quhen moved T354694: [Maintenance] Safeguard VarnishKafka to HAProxy analytics transition from Blocked/Paused to To be estimated on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Mon, Aug 26, 2:11 PM · Data-Engineering
Antoine_Quhen moved T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib from In Review to Ready to Deploy on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Mon, Aug 26, 2:10 PM · MW-1.43-notes (1.43.0-wmf.22; 2024-09-10), Patch-For-Review, Data-Engineering (Q1 2024 July 1st - September 30th), Dumps 2.0 (Kanban Board), Event-Platform
Antoine_Quhen moved T372768: [BUG] MediawikiPageContentChangeEnrichAvailability is firing from In Review to Done on the Data-Engineering (Q1 2024 July 1st - September 30th) board.
Mon, Aug 26, 2:08 PM · Patch-For-Review, Dumps 2.0 (Kanban Board), Event-Platform, Data-Engineering (Q1 2024 July 1st - September 30th)
Antoine_Quhen updated the task description for T365563: Timeout hive-metastore locks.
Mon, Aug 26, 9:23 AM · Patch-For-Review, Data-Engineering
Antoine_Quhen updated the task description for T365563: Timeout hive-metastore locks.
Mon, Aug 26, 9:22 AM · Patch-For-Review, Data-Engineering
Antoine_Quhen closed T365223: Fix generation of _IMPORTED flags by Gobblin as Resolved.
Mon, Aug 26, 8:56 AM · Data-Engineering, Data Pipelines, Patch-For-Review
Antoine_Quhen closed T365223: Fix generation of _IMPORTED flags by Gobblin, a subtask of T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation, as Resolved.
Mon, Aug 26, 8:55 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review

Thu, Aug 15

Antoine_Quhen added a comment to T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.

I've performed a short study of our late events, which I detect according to the timestamp of the file created by Gobblin.

Thu, Aug 15, 4:53 PM · Data-Engineering

Thu, Aug 8

mpopov awarded T371373: airflow-dags: Mutualization of _IMPORTED flag sensors creations a Like token.
Thu, Aug 8, 7:40 PM · Data-Engineering (Q1 2024 July 1st - September 30th)

Aug 5 2024

Antoine_Quhen added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T371803: Refine optimizations on output and parallelization.
Aug 5 2024, 12:27 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen added a parent task for T371803: Refine optimizations on output and parallelization: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
Aug 5 2024, 12:27 PM · Data-Engineering
Antoine_Quhen added a comment to T371803: Refine optimizations on output and parallelization.

The test code: https://gitlab.wikimedia.org/-/snippets/149

Aug 5 2024, 12:26 PM · Data-Engineering
Antoine_Quhen added a comment to T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.

I've isolated the optimization here https://phabricator.wikimedia.org/T371803

Aug 5 2024, 12:18 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen created T371803: Refine optimizations on output and parallelization.
Aug 5 2024, 12:16 PM · Data-Engineering

Jul 31 2024

Jelto awarded T365449: Upgrade Airflow to 2.9.3 a Like token.
Jul 31 2024, 8:23 AM · Patch-For-Review, Release-Engineering-Team (Radar), Data-Platform-SRE (2024.07.29 - 2024.08.16), collaboration-services, Data Pipelines, Data-Engineering

Jul 30 2024

Antoine_Quhen created T371373: airflow-dags: Mutualization of _IMPORTED flag sensors creations.
Jul 30 2024, 12:52 PM · Data-Engineering (Q1 2024 July 1st - September 30th)

Jul 29 2024

Antoine_Quhen added a comment to T366627: [MPIC] Analyse risk of potential performance issues with static approach to stream configuration.

We have prepared some work to Refine raw events directly into Iceberg tables.

Jul 29 2024, 1:23 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Data Products, Metrics Platform

Jul 22 2024

Antoine_Quhen added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.
Jul 22 2024, 4:03 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen added a parent task for T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
Jul 22 2024, 4:03 PM · Data-Engineering
Antoine_Quhen created T370665: Handle Late-Arrived Events from Gobblin into Airflow triggered Refine.
Jul 22 2024, 4:02 PM · Data-Engineering

Jun 27 2024

Antoine_Quhen added a comment to T367134: [Refine Refactoring] Changes to EventStreamConfig needed for scheduling Refine via airflow.

As discussed in the meeting, I'm OK with minimal changes to ESC: It could be nice to add the hourly_computing_scale and have a boolean ~needs_to_be_refined.

Jun 27 2024, 3:45 PM · MW-1.43-notes (1.43.0-wmf.15; 2024-07-23), Data-Engineering (Q1 2024 July 1st - September 30th)
Antoine_Quhen added a comment to T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.

To check the output of the new Refine process, I've been working with diff from https://github.com/G-Research/spark-extension/ to verify that my output matches what was generated by the old Refine process. Here are the problems I've encountered and circumvented:

Jun 27 2024, 3:30 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review

Jun 14 2024

Antoine_Quhen added a comment to T362594: Update the editor_month table with an Airflow job.

Following the discussion on Slack, I've created the table and changed the dir owner to analytics-product. The long-term discussion about HDSF dir architecture and ownership is here: T367243

Jun 14 2024, 3:57 PM · Movement-Insights

Jun 5 2024

Antoine_Quhen added a comment to T360922: [Status Store] [SPIKE] Investigate and document approach for Iceberg Sensors.

We have been working with @amastilovic on a Wikitech page to describe the reasoning behind choosing Airflow itself to store Dataset status.

Jun 5 2024, 10:11 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Dumps 2.0 (Kanban Board), Spike
Antoine_Quhen assigned T360922: [Status Store] [SPIKE] Investigate and document approach for Iceberg Sensors to amastilovic.
Jun 5 2024, 10:04 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Dumps 2.0 (Kanban Board), Spike

May 22 2024

Antoine_Quhen added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T365563: Timeout hive-metastore locks.
May 22 2024, 8:46 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen added a parent task for T365563: Timeout hive-metastore locks: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
May 22 2024, 8:46 AM · Patch-For-Review, Data-Engineering
Antoine_Quhen created T365563: Timeout hive-metastore locks.
May 22 2024, 8:45 AM · Patch-For-Review, Data-Engineering
Antoine_Quhen added a subtask for T365449: Upgrade Airflow to 2.9.3: Unknown Object (Task).
May 22 2024, 8:41 AM · Patch-For-Review, Release-Engineering-Team (Radar), Data-Platform-SRE (2024.07.29 - 2024.08.16), collaboration-services, Data Pipelines, Data-Engineering

May 21 2024

Antoine_Quhen added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T365449: Upgrade Airflow to 2.9.3.
May 21 2024, 9:46 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen added a parent task for T365449: Upgrade Airflow to 2.9.3: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
May 21 2024, 9:46 AM · Patch-For-Review, Release-Engineering-Team (Radar), Data-Platform-SRE (2024.07.29 - 2024.08.16), collaboration-services, Data Pipelines, Data-Engineering
Antoine_Quhen created T365449: Upgrade Airflow to 2.9.3.
May 21 2024, 9:45 AM · Patch-For-Review, Release-Engineering-Team (Radar), Data-Platform-SRE (2024.07.29 - 2024.08.16), collaboration-services, Data Pipelines, Data-Engineering

May 17 2024

Antoine_Quhen added a subtask for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation: T365223: Fix generation of _IMPORTED flags by Gobblin.
May 17 2024, 8:12 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen added a parent task for T365223: Fix generation of _IMPORTED flags by Gobblin: T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
May 17 2024, 8:12 AM · Data-Engineering, Data Pipelines, Patch-For-Review
Antoine_Quhen created T365223: Fix generation of _IMPORTED flags by Gobblin.
May 17 2024, 8:12 AM · Data-Engineering, Data Pipelines, Patch-For-Review

May 16 2024

Antoine_Quhen added a comment to T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.

Moreover, one more conf for a dag to execute in a depth-first manner is to add in its default_args (or in some tasks args) weight_rule="upstream".

May 16 2024, 2:42 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review

May 13 2024

Antoine_Quhen added a comment to T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.

I've encountered an issue with our current production Airflow setup where the scheduler is not executing tasks in a depth-first manner as expected. Upon investigation, I found that the depth-first execution feature isn't supported in our current Airflow version (2.7.3). This functionality was introduced in Airflow 2.8.0, as per this pull request: Apache Airflow PR #27827 (I've tried it locally with success).

May 13 2024, 9:06 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review
Antoine_Quhen updated the task description for T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation.
May 13 2024, 8:51 AM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review

Apr 4 2024

Antoine_Quhen renamed T356762: [Refine refactoring] Refine jobs should be scheduled by Airflow: implementation from [NEEDS GROOMING][SPIKE] Extract refine schema management into a dedicated tool to Extract refine schema management into a dedicated tool.
Apr 4 2024, 3:57 PM · Data-Engineering (Q1 2024 July 1st - September 30th), Patch-For-Review

Mar 27 2024

Antoine_Quhen added a comment to T360967: [Developer Experience] Implement CI hql Linting.

https://docs.sqlfluff.com/en/stable/dialects.html#hive

Mar 27 2024, 11:08 AM · Data-Engineering (Q1 2024 July 1st - September 30th)

Mar 26 2024

Antoine_Quhen closed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow as Resolved.
Mar 26 2024, 4:18 PM · Data-Engineering (Sprint 9), Data Pipelines
Antoine_Quhen closed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow, a subtask of T307505: Refine jobs should be scheduled by Airflow, as Resolved.
Mar 26 2024, 4:17 PM · Data-Engineering, Data Pipelines
Antoine_Quhen moved T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store from In progress to Done on the Data-Engineering (Sprint 9) board.

5 datasets are being refined as a POC on the prod cluster. 2 on the test cluster.

Mar 26 2024, 4:17 PM · Data-Engineering (Sprint 9)
Antoine_Quhen added a comment to T357430: Airflow mapped tasks UI & metrics.

The Airflow PR has been merged and should be released in Airflow 2.9 in April.

Mar 26 2024, 4:15 PM · Data-Engineering (Q1 2024 July 1st - September 30th)

Feb 22 2024

Antoine_Quhen closed T311111: Improve speed of Gitlab CI as Resolved.

Done. The last version was done with Blubber:

Feb 22 2024, 3:14 PM · Data-Engineering, GitLab (CI & Job Runners), Performance Issue

Feb 21 2024

Antoine_Quhen added a comment to T357873: Mediawiki_wikitext_history job often has long gaps between stages.

Some research:

  • Each XML dumps snapshot may represent ~5.5TB (including ~1.8TB for wikidata and 1.4TB for enwiki)
  • The Airflow sensor may take ~19days to turn green. It waits until the last dump has been processed (_IMPORTED flag). Most dumps are generated in a matter of days (~4 on average, maybe). Enwiki may take 7 days. And they all wait for the wikidata dump (~19 days).
  • When the sensor turns green, a heavy Spark job is launched to convert all the compressed XML to parquet. ~5.5TB (compressed) is taking ~4.5 days to process.
  • The perceived gaps are due to the non-parallelism of the dag + very long jobs. 1 heavy job is preventing the other ones from running. due to the retries (Thx for the pointer @JAllemandou ). Other symptoms, same problem here I think: https://phabricator.wikimedia.org/T342911
Feb 21 2024, 4:53 PM · Data Products, Data-Engineering, Movement-Insights

Feb 13 2024

Antoine_Quhen added a subtask for T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store: T357430: Airflow mapped tasks UI & metrics.
Feb 13 2024, 3:39 PM · Data-Engineering (Sprint 9)
Antoine_Quhen added a parent task for T357430: Airflow mapped tasks UI & metrics: T356360: [Refine Refactoring] Orchestrate Airflow execution of navigationtiming from config store.
Feb 13 2024, 3:39 PM · Data-Engineering (Q1 2024 July 1st - September 30th)
Antoine_Quhen created T357430: Airflow mapped tasks UI & metrics.
Feb 13 2024, 3:39 PM · Data-Engineering (Q1 2024 July 1st - September 30th)

Feb 8 2024

Antoine_Quhen moved T352672: [Iceberg Migration] Migrate session length tables to Iceberg from In Review to Done on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 5:10 PM · Data-Engineering (Sprint 8)

Feb 6 2024

Antoine_Quhen closed T356364: [Maintenance] Migrate Gitlab CI to blubber as Resolved.
Feb 6 2024, 1:24 PM · Data-Engineering (Sprint 8)
Antoine_Quhen closed T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner as Resolved.
Feb 6 2024, 1:24 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen closed T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner, a subtask of T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs, as Resolved.
Feb 6 2024, 1:24 PM · Data-Engineering (Sprint 6), Patch-For-Review

Feb 2 2024

Antoine_Quhen moved T356362: [Refine Refactoring] [Spike] Define a concept and provide a PoC for dynamic DAG execution in Airflow from Next Up to In progress on the Data-Engineering (Sprint 8) board.
Feb 2 2024, 10:29 AM · Data-Engineering (Q1 2024 July 1st - September 30th)
Antoine_Quhen changed the status of T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner from Open to In Progress.
Feb 2 2024, 10:08 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen changed the status of T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner, a subtask of T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs, from Open to In Progress.
Feb 2 2024, 10:08 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a comment to T356364: [Maintenance] Migrate Gitlab CI to blubber.

Linked with: https://phabricator.wikimedia.org/T351792

Feb 2 2024, 10:08 AM · Data-Engineering (Sprint 8)

Jan 31 2024

Antoine_Quhen renamed T356192: [Refine refactoring] Refactor and migrate navigationtiming to Airflow from Refactor and migrate navigationtiming to Airflow to [Refine refactoring] Refactor and migrate navigationtiming to Airflow.
Jan 31 2024, 3:52 PM · Data-Engineering (Sprint 9), Data Pipelines
Antoine_Quhen added a comment to T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.

wmf.wikidata_item_page_link/snapshot=2024-01-15 looked fine from a row count perspective but 2024-01-22 snapshot was unexpectedly available and had zero rows:

select count(*) from wmf.wikidata_item_page_link where snapshot='2024-01-22';
...
+------+
| _c0  |
+------+
| 0    |
+------+
Jan 31 2024, 12:59 PM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen moved T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15 from Next Up to Done on the Data-Engineering (Sprint 8) board.
Jan 31 2024, 12:56 PM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen changed the status of T352672: [Iceberg Migration] Migrate session length tables to Iceberg from Open to In Progress.
Jan 31 2024, 12:54 PM · Data-Engineering (Sprint 8)
Antoine_Quhen changed the status of T352672: [Iceberg Migration] Migrate session length tables to Iceberg, a subtask of T333013: [Iceberg Migration] Apache Iceberg Migration, from Open to In Progress.
Jan 31 2024, 12:53 PM · Data-Engineering, Epic

Jan 30 2024

Antoine_Quhen updated Other Assignee for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15, added: Antoine_Quhen.
Jan 30 2024, 11:12 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen updated the task description for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.
Jan 30 2024, 11:12 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen added a comment to T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15.

I've added the missing partitions in the source of wikidata_item_page_link and its missing snapshot is now generated.

Jan 30 2024, 11:11 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions
Antoine_Quhen edited projects for T356030: Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-01-15, added: Data-Engineering (Sprint 7); removed Data-Engineering.
Jan 30 2024, 9:49 AM · Discovery-Search (Current work), Data-Engineering (Sprint 8), Image-Suggestions

Jan 29 2024

Antoine_Quhen moved T352672: [Iceberg Migration] Migrate session length tables to Iceberg from Next Up to In progress on the Data-Engineering (Sprint 7) board.
Jan 29 2024, 2:46 PM · Data-Engineering (Sprint 8)
Antoine_Quhen claimed T352672: [Iceberg Migration] Migrate session length tables to Iceberg.
Jan 29 2024, 2:46 PM · Data-Engineering (Sprint 8)
Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from In Review to Done on the Data-Engineering (Sprint 7) board.
Jan 29 2024, 2:45 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen moved T354695: [Iceberg Migration] Define sensor concept and implementation plan from Radar (External Teams) to In Review on the Data-Engineering (Sprint 7) board.

https://docs.google.com/document/d/1upAje5lMawu4X6seRxI8Lx7YN-oHEzcEcm2fO6E5OH0/edit

Jan 29 2024, 1:55 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)
Antoine_Quhen added a comment to T338065: [Iceberg Migration] Implement mechanism for automatic Iceberg table maintenance.

I would like to add rewrite_manifests to the list of maintenance actions:

Jan 29 2024, 10:25 AM · Dumps 2.0 (Kanban Board), Data-Engineering

Jan 25 2024

Antoine_Quhen closed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables as Resolved.
Jan 25 2024, 2:58 PM · Data-Engineering (Sprint 7)
Antoine_Quhen closed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables, a subtask of T333013: [Iceberg Migration] Apache Iceberg Migration, as Resolved.
Jan 25 2024, 2:58 PM · Data-Engineering, Epic
Antoine_Quhen closed T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey as Resolved.
Jan 25 2024, 2:57 PM · Data-Engineering (Sprint 7)
Antoine_Quhen moved T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey from In Review to Done on the Data-Engineering (Sprint 7) board.
Jan 25 2024, 2:57 PM · Data-Engineering (Sprint 7)

Jan 22 2024

Antoine_Quhen updated subscribers of T354695: [Iceberg Migration] Define sensor concept and implementation plan.

@BTullis , @brouberol , @Stevemunene I would like your feedback on this subject:

Jan 22 2024, 10:02 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 19 2024

Antoine_Quhen claimed T355391: Fix refinery-source.refinery-core.Utilities::getValueForKey.
Jan 19 2024, 2:53 PM · Data-Engineering (Sprint 7)

Jan 17 2024

Antoine_Quhen moved T354695: [Iceberg Migration] Define sensor concept and implementation plan from Next Up to Radar (External Teams) on the Data-Engineering (Sprint 7) board.
Jan 17 2024, 11:24 AM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering (Sprint 7)

Jan 16 2024

Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from In progress to In Review on the Data-Engineering (Sprint 7) board.
Jan 16 2024, 2:20 PM · Data-Engineering (Sprint 7)

Jan 15 2024

Antoine_Quhen added a comment to T343232: Configure Airflow to send metrics to Prometheus.

Following the Grafana dashboard review, I've performed some changes to it:

  • I distributed the graphs into 4 sections: Failures, Durations, Counts, Scheduling
  • I added missing parameterization of variables (e.g the list of operators changes when the instance is selected)
  • I updated TTLs into statsd-exporter in order to reflect into Prometheus the rate of the metrics sent by Airflow (e.g. when Airflow emits a metric to say 1 task is in failure, it generates 1 call. and we would like to get this single point into Prometheus)
  • I isolated task failures into its own graph
Jan 15 2024, 1:37 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics

Jan 9 2024

Antoine_Quhen created T354703: analytics/refinery scap deploy on test cluster fails with permission error.
Jan 9 2024, 9:17 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Data-Engineering
Antoine_Quhen added a comment to T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.

1/ Splitting the CI

Jan 9 2024, 4:51 PM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering

Dec 21 2023

Antoine_Quhen added a comment to T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables.

In this ticket, I needed to find a way to detect when an Iceberg table has some data in it. This would replace the Hive partition sensor when migrating a table to Iceberg.
We chose to launch a Spark application running an SQL count. It's now implemented here:

A drawback of this solution is that it generates a FAILED Spark application each time the sensor does not find any data in the interval. When monitoring our Spark applications, we want to avoid artificially growing the FAILED counts.

Dec 21 2023, 3:39 PM · Data-Engineering (Sprint 7)

Dec 20 2023

Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from In Review to Done on the Data-Engineering (Sprint 6) board.
Dec 20 2023, 3:19 PM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a comment to T353806: Airflow scheduler monitoring is broken since the most recent deploy.

https://github.com/wikimedia/operations-puppet/blob/f7c3eb56a9417571792b7636367f3c13e850bc83/modules/profile/manifests/airflow.pp#L199

Dec 20 2023, 2:37 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31), Data-Engineering
Antoine_Quhen added a comment to T353806: Airflow scheduler monitoring is broken since the most recent deploy.

I think it's because the airflow-analytics jobs check should be run like the other commands: with a custom PYTHONPATH=/path/to/root/of/airflow-dags/ as an env variable.

Dec 20 2023, 2:35 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31), Data-Engineering

Dec 18 2023

tchin awarded T336739: Post Oozie -> Airflow migration refactorings a Barnstar token.
Dec 18 2023, 3:12 PM · Patch-For-Review, Data-Engineering, Epic, Data Pipelines
Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from In progress to In Review on the Data-Engineering (Sprint 6) board.

I have the first version of the code in review.

Dec 18 2023, 2:46 PM · Data-Engineering (Sprint 7)

Dec 5 2023

Antoine_Quhen claimed T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables.
Dec 5 2023, 3:31 PM · Data-Engineering (Sprint 7)
Antoine_Quhen moved T347879: [Airflow Migration] Migrate Airflow Druid Jobs to Unique Devices Iceberg tables from Next Up to In progress on the Data-Engineering (Sprint 6) board.
Dec 5 2023, 3:31 PM · Data-Engineering (Sprint 7)

Dec 4 2023

Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

In this puppet patch, we are adding configuration to send more Airflow metrics to Prometheus, and to customize them.

Dec 4 2023, 8:55 AM · Data-Engineering (Sprint 6), Patch-For-Review

Nov 30 2023

Antoine_Quhen moved T343232: Configure Airflow to send metrics to Prometheus from In progress to In Review on the Data-Engineering (Sprint 5) board.
Nov 30 2023, 5:04 PM · Data-Engineering (Sprint 7), Data-Platform-SRE (2024.01.01 - 2024.01.21), Patch-For-Review, Observability-Metrics
Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from In progress to In Review on the Data-Engineering (Sprint 5) board.
Nov 30 2023, 5:04 PM · Data-Engineering (Sprint 6), Patch-For-Review

Nov 22 2023

Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

The workaround to our Gilab-CI pb is here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537

Nov 22 2023, 3:04 PM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a subtask for T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs: T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.
Nov 22 2023, 9:37 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen added a parent task for T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner: T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.
Nov 22 2023, 9:37 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering
Antoine_Quhen added a comment to T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs.

The puppet configuration is now merged, and statsd_exporter is running on an-test-client1002. Analytics Prometheus is scrapping from it, as it should.

Nov 22 2023, 9:35 AM · Data-Engineering (Sprint 6), Patch-For-Review
Antoine_Quhen created T351792: Unblock Dockerfile syntax to build images with Gitlab trusted runner.
Nov 22 2023, 9:16 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, GitLab (CI & Job Runners), Data-Engineering

Nov 16 2023

Antoine_Quhen moved T349532: [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs from Blocked/Paused to In progress on the Data-Engineering (Sprint 5) board.
Nov 16 2023, 10:15 AM · Data-Engineering (Sprint 6), Patch-For-Review