[go: nahoru, domu]

Page MenuHomePhabricator

[M] Make sure DAGs are run in the correct order
Closed, ResolvedPublic

Description

Because of unmet dependencies while working on T328641, we could not make sure to modify the DAGs so that they only start when all the dependencies are met.

In this task we should:

  • Resolve any dependencies still needed.
  • Add sensors accordingly.

Details

TitleReferenceAuthorSource BranchDest Branch
Make section_image_suggestions.py work in prodrepos/structured-data/image-suggestions!24xcollazohotfix-remove-dev-flagsmain
Update DAGs to generate section-level image suggestionsrepos/data-engineering/airflow-dags!327cparleT330667main
Customize query in GitLab

Event Timeline

CBogen renamed this task from Make sure DAGs are run in the correct order to [M] Make sure DAGs are run in the correct order.Mar 8 2023, 5:21 PM
CBogen updated the task description. (Show Details)

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/T330667

Haven't made an MR yet because I want to wait for the old DAG to run with the new image-suggestions code. Will make one next week

Fixed some file path issues and bumped image_suggestions via the same merge request above.

Tested on an airflow test instance to make sure the new URLSensors were good.

Deployed to prod.

Re-ran section_alignment_image_suggestions DAG on prof for data_interval_start=2023-02-01 to regenerate data in the location that we now expect it to.

image_suggestions DAG is now waiting until weekly sensors trigger for data_interval_start=2023-03-20.

We are good.

The run failed with:

/venv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: The column number of the existing table analytics_platform_eng.image_suggestions_suggestions(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,snapshot:string,wiki:string>) doesn't match the data schema(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,section_heading:string,snapshot:string,wiki:string>)

This confirms our suspicion that saveAsTable() with mode=append will not automatically ALTER the table. Thus we need to ALTER by hand:

 hostname -f
an-airflow1004.eqiad.wmnet

sudo -u analytics-platform-eng spark3-sql

spark-sql (default)> use analytics_platform_eng;
Response code
Time taken: 2.896 seconds

spark-sql (default)> ALTER TABLE `image_suggestions_suggestions` ADD COLUMNS (
                   >     `section_heading` string
                   > );
Response code
Time taken: 0.795 seconds

spark-sql (default)> describe formatted image_suggestions_suggestions;
col_name	data_type	comment
page_id	bigint	Uniquely identifying primary key along with wiki. This value is preserved across edits, renames, and, as of MediaWiki 1.27, deletions, via an analogous field in the archive table (introduced in MediaWiki 1.11). For example, for this page, page_id = 10501. 
id	string	NULL
image	string	The sanitized page title, without the namespace, with a maximum of 255 characters (binary). It is stored as text, with spaces replaced by underscores. The real title shown in articles is just this title with underscores (_) converted to spaces ( ). For exa
origin_wiki	string	NULL
confidence	int	NULL
found_on	array<string>	NULL
kind	array<string>	NULL
page_rev	bigint	NULL
section_heading	string	NULL
snapshot	string	NULL
wiki	string	The wiki_db project
# Partition Information		
# col_name	data_type	comment
snapshot	string	NULL
wiki	string	The wiki_db project
		
# Detailed Table Information		
Database	analytics_platform_eng	
Table	image_suggestions_suggestions	
Owner	analytics-platform-eng	
Created Time	Fri May 20 13:21:14 UTC 2022	
Last Access	UNKNOWN	
Created By	Spark 2.4.4	
Type	MANAGED	
Provider	parquet	
Location	hdfs://analytics-hadoop/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions	
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe	
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	
Partition Provider	Catalog	
Time taken: 0.17 seconds, Fetched 30 row(s)

(Reopened. Will keep open until a successful run.)

The article_level_suggestions Airflow node is now successful. Now we have a different issue:

Traceback (most recent call last):
  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111183/container_e75_1678266962370_111183_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 611, in <module>
    from wmfdata.spark import create_custom_session  # type: ignore
ModuleNotFoundError: No module named 'wmfdata'

This is because we forgot to get some dev code commented out. Have a fix at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.

Next issue:

  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 407, in prune_suggestions
    suggestions = prune_non_illustratable_sections(spark, suggestions)
  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 431, in prune_non_illustratable_sections
    sections_to_exclude_dict = json.load(open(
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/hadoop/data/j/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/filecache/10/image-suggestions-0.9.0.dev0-hotfix-remove-dev-flags.conda.tgz/lib/python3.10/site-packages/image_suggestions/../data/section_titles_denylist.json'

We need a production friendly way of loading files from the artifact. Added fix to https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.

Success!

https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24 up for review. A dev version of that MR is currently set on prod via conda_env varprop.

For some reason all cassandra uploads took more time, but specifically the hive_to_cassandra_suggestions node, which typically takes 30-40 minutes, took 4 hours 45 mins. Perhaps it was just bad timing with other jobs? We should monitor subsequent runs.

CC @Cparle @mfossati