[go: nahoru, domu]

Page MenuHomePhabricator

Make sure our delta algorithm doesn't depend on successful past runs
Closed, ResolvedPublic

Description

While discussing T330686, we figured that our delta algorithm assumes that the previous run was successful.

In this task we want to change this so that the delta is made againt whichever is the latest snapshot available in Hive. This will make the job robust against intermittent failures.

Details

TitleReferenceAuthorSource BranchDest Branch
Make previous_weekly be the last successful DAG run of image_suggestionsrepos/data-engineering/airflow-dags!287xcollazoredo-robust-deltasmain
Hotfix: Make previous_weekly a varprop on image_suggestions DAG.repos/data-engineering/airflow-dags!262xcollazohotfix-image-suggestionsmain
Calculate previous snapshots for deltas automatically.repos/structured-data/image-suggestions!11xcollazoT330688-robust-deltasmain
Customize query in GitLab

Event Timeline

One issue we have is that the pipeline ran for snapshot=2023-02-20 while we were working on this task.

This means that now the data on image_suggestions_search_index_delta is corrupted.

After discussions with @Cparle and @mfossati, we decided to do the following to fix this:

  • merge changes related to this task ( done )
  • drop the 2023-02-20 snapshot data from
image_suggestions_search_index_full
image_suggestions_search_index_delta
image_suggestions_lead_image_data
image_suggestions_wikidata_data
  • re-run commonswiki_file.py with snapshot=2023-02-20 and previous_snapshot=2023-02-06 ( previous_snapshot will now be automatically calculated )
  • re-run search_indices.py with snapshot=2023-02-20 and previous_snapshot=2023-02-06 ( previous_snapshot will now be automatically calculated )

Mentioned in SAL (#wikimedia-analytics) [2023-03-03T16:48:12Z] <xcollazo> Deleted snapshot=2023-02-20 for tables image_suggestions_search_index_full, image_suggestions_search_index_delta, image_suggestions_lead_image_data and image_suggestions_wikidata_data from the analytics_platform_eng schema. This data will be regenerated. See https://phabricator.wikimedia.org/T330688.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/287

Make previous_weekly be the last successful DAG run of image_suggestions

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/287

Make previous_weekly be the last successful DAG run of image_suggestions

Deployment is now blocked by the Airflow 2.5.1 upgrade (See T332031). We could just branch out for this deployment, but since the upgrade is slated for Thu Mar 16, it doesn't make much sense to pay the branching penalty for just one deployment.

This was deployed as part of T332031. Closing.