Because of unmet dependencies while working on T328641, we could not make sure to modify the DAGs so that they only start when all the dependencies are met.
In this task we should:
- Resolve any dependencies still needed.
- Add sensors accordingly.
Because of unmet dependencies while working on T328641, we could not make sure to modify the DAGs so that they only start when all the dependencies are met.
In this task we should:
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Make section_image_suggestions.py work in prod | repos/structured-data/image-suggestions!24 | xcollazo | hotfix-remove-dev-flags | main | |
Update DAGs to generate section-level image suggestions | repos/data-engineering/airflow-dags!327 | cparle | T330667 | main |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T311814 [EPIC] Section-level image suggestions data pipeline | |||
Resolved | Cparle | T330667 [M] Make sure DAGs are run in the correct order |
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/T330667
Haven't made an MR yet because I want to wait for the old DAG to run with the new image-suggestions code. Will make one next week
cparle opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/327
Update DAGs to generate section-level image suggestions
xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/327
Update DAGs to generate section-level image suggestions
Fixed some file path issues and bumped image_suggestions via the same merge request above.
Tested on an airflow test instance to make sure the new URLSensors were good.
Deployed to prod.
Re-ran section_alignment_image_suggestions DAG on prof for data_interval_start=2023-02-01 to regenerate data in the location that we now expect it to.
image_suggestions DAG is now waiting until weekly sensors trigger for data_interval_start=2023-03-20.
We are good.
The run failed with:
/venv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco pyspark.sql.utils.AnalysisException: The column number of the existing table analytics_platform_eng.image_suggestions_suggestions(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,snapshot:string,wiki:string>) doesn't match the data schema(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,section_heading:string,snapshot:string,wiki:string>)
This confirms our suspicion that saveAsTable() with mode=append will not automatically ALTER the table. Thus we need to ALTER by hand:
hostname -f an-airflow1004.eqiad.wmnet sudo -u analytics-platform-eng spark3-sql spark-sql (default)> use analytics_platform_eng; Response code Time taken: 2.896 seconds spark-sql (default)> ALTER TABLE `image_suggestions_suggestions` ADD COLUMNS ( > `section_heading` string > ); Response code Time taken: 0.795 seconds spark-sql (default)> describe formatted image_suggestions_suggestions; col_name data_type comment page_id bigint Uniquely identifying primary key along with wiki. This value is preserved across edits, renames, and, as of MediaWiki 1.27, deletions, via an analogous field in the archive table (introduced in MediaWiki 1.11). For example, for this page, page_id = 10501. id string NULL image string The sanitized page title, without the namespace, with a maximum of 255 characters (binary). It is stored as text, with spaces replaced by underscores. The real title shown in articles is just this title with underscores (_) converted to spaces ( ). For exa origin_wiki string NULL confidence int NULL found_on array<string> NULL kind array<string> NULL page_rev bigint NULL section_heading string NULL snapshot string NULL wiki string The wiki_db project # Partition Information # col_name data_type comment snapshot string NULL wiki string The wiki_db project # Detailed Table Information Database analytics_platform_eng Table image_suggestions_suggestions Owner analytics-platform-eng Created Time Fri May 20 13:21:14 UTC 2022 Last Access UNKNOWN Created By Spark 2.4.4 Type MANAGED Provider parquet Location hdfs://analytics-hadoop/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Partition Provider Catalog Time taken: 0.17 seconds, Fetched 30 row(s)
The article_level_suggestions Airflow node is now successful. Now we have a different issue:
Traceback (most recent call last): File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111183/container_e75_1678266962370_111183_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 611, in <module> from wmfdata.spark import create_custom_session # type: ignore ModuleNotFoundError: No module named 'wmfdata'
This is because we forgot to get some dev code commented out. Have a fix at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.
Next issue:
File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 407, in prune_suggestions suggestions = prune_non_illustratable_sections(spark, suggestions) File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 431, in prune_non_illustratable_sections sections_to_exclude_dict = json.load(open( FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/hadoop/data/j/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/filecache/10/image-suggestions-0.9.0.dev0-hotfix-remove-dev-flags.conda.tgz/lib/python3.10/site-packages/image_suggestions/../data/section_titles_denylist.json'
We need a production friendly way of loading files from the artifact. Added fix to https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.
Success!
https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24 up for review. A dev version of that MR is currently set on prod via conda_env varprop.
For some reason all cassandra uploads took more time, but specifically the hive_to_cassandra_suggestions node, which typically takes 30-40 minutes, took 4 hours 45 mins. Perhaps it was just bad timing with other jobs? We should monitor subsequent runs.
xcollazo updated https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24
Make section_image_suggestions.py work in prod
xcollazo merged https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24
Make section_image_suggestions.py work in prod
Pushed https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24 to production via scap deploy.
We are finally done here! 🎉