[M] Make sure DAGs are run in the correct order
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	Feb 27 2023, 3:30 PM

Description

Because of unmet dependencies while working on T328641, we could not make sure to modify the DAGs so that they only start when all the dependencies are met.

In this task we should:

Resolve any dependencies still needed.
Add sensors accordingly.

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Make section_image_suggestions.py work in prod	repos/structured-data/image-suggestions!24	xcollazo	hotfix-remove-dev-flags	main
	Update DAGs to generate section-level image suggestions	repos/data-engineering/airflow-dags!327	cparle	T330667	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T311814 [EPIC] Section-level image suggestions data pipeline
		Resolved		Cparle	T330667 [M] Make sure DAGs are run in the correct order

Event Timeline

xcollazo created this task.Feb 27 2023, 3:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2023, 3:30 PM

xcollazo mentioned this in T328641: Productionize the Airflow DAG of section alignment-based suggestions.Feb 27 2023, 3:31 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.Feb 27 2023, 5:21 PM

xcollazo updated the task description. (Show Details)Mar 1 2023, 2:05 AM

lbowmaker moved this task from Backlog to Structured Data (Tracking) on the Data Pipelines board.Mar 1 2023, 7:24 PM

CBogen renamed this task from Make sure DAGs are run in the correct order to [M] Make sure DAGs are run in the correct order.Mar 8 2023, 5:21 PM

CBogen updated the task description. (Show Details)

CBogen moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.Mar 8 2023, 5:58 PM

Cparle claimed this task.Mar 16 2023, 3:24 PM

Cparle moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/tree/T330667

Haven't made an MR yet because I want to wait for the old DAG to run with the new image-suggestions code. Will make one next week

Cparle updated the task description. (Show Details)Mar 24 2023, 11:07 AM

cparle opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/327

Update DAGs to generate section-level image suggestions

gerritbot added a project: Patch-For-Review.Mar 28 2023, 1:18 PM

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/327

Update DAGs to generate section-level image suggestions

Maintenance_bot removed a project: Patch-For-Review.Mar 28 2023, 6:30 PM

Fixed some file path issues and bumped image_suggestions via the same merge request above.

Tested on an airflow test instance to make sure the new URLSensors were good.

Deployed to prod.

Re-ran section_alignment_image_suggestions DAG on prof for data_interval_start=2023-02-01 to regenerate data in the location that we now expect it to.

image_suggestions DAG is now waiting until weekly sensors trigger for data_interval_start=2023-03-20.

We are good.

mfossati awarded a token.Mar 29 2023, 8:13 AM

The run failed with:

/venv/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.AnalysisException: The column number of the existing table analytics_platform_eng.image_suggestions_suggestions(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,snapshot:string,wiki:string>) doesn't match the data schema(struct<page_id:bigint,id:string,image:string,origin_wiki:string,confidence:int,found_on:array<string>,kind:array<string>,page_rev:bigint,section_heading:string,snapshot:string,wiki:string>)

This confirms our suspicion that saveAsTable() with mode=append will not automatically ALTER the table. Thus we need to ALTER by hand:

 hostname -f
an-airflow1004.eqiad.wmnet

sudo -u analytics-platform-eng spark3-sql

spark-sql (default)> use analytics_platform_eng;
Response code
Time taken: 2.896 seconds

spark-sql (default)> ALTER TABLE `image_suggestions_suggestions` ADD COLUMNS (
                   >     `section_heading` string
                   > );
Response code
Time taken: 0.795 seconds

spark-sql (default)> describe formatted image_suggestions_suggestions;
col_name	data_type	comment
page_id	bigint	Uniquely identifying primary key along with wiki. This value is preserved across edits, renames, and, as of MediaWiki 1.27, deletions, via an analogous field in the archive table (introduced in MediaWiki 1.11). For example, for this page, page_id = 10501. 
id	string	NULL
image	string	The sanitized page title, without the namespace, with a maximum of 255 characters (binary). It is stored as text, with spaces replaced by underscores. The real title shown in articles is just this title with underscores (_) converted to spaces ( ). For exa
origin_wiki	string	NULL
confidence	int	NULL
found_on	array<string>	NULL
kind	array<string>	NULL
page_rev	bigint	NULL
section_heading	string	NULL
snapshot	string	NULL
wiki	string	The wiki_db project
# Partition Information		
# col_name	data_type	comment
snapshot	string	NULL
wiki	string	The wiki_db project
		
# Detailed Table Information		
Database	analytics_platform_eng	
Table	image_suggestions_suggestions	
Owner	analytics-platform-eng	
Created Time	Fri May 20 13:21:14 UTC 2022	
Last Access	UNKNOWN	
Created By	Spark 2.4.4	
Type	MANAGED	
Provider	parquet	
Location	hdfs://analytics-hadoop/user/hive/warehouse/analytics_platform_eng.db/image_suggestions_suggestions	
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe	
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat	
Partition Provider	Catalog	
Time taken: 0.17 seconds, Fetched 30 row(s)

(Reopened. Will keep open until a successful run.)

The article_level_suggestions Airflow node is now successful. Now we have a different issue:

Traceback (most recent call last):
  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111183/container_e75_1678266962370_111183_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 611, in <module>
    from wmfdata.spark import create_custom_session  # type: ignore
ModuleNotFoundError: No module named 'wmfdata'

This is because we forgot to get some dev code commented out. Have a fix at https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.

Next issue:

  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 407, in prune_suggestions
    suggestions = prune_non_illustratable_sections(spark, suggestions)
  File "/var/lib/hadoop/data/d/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/container_e75_1678266962370_111381_01_000001/venv/lib/python3.10/site-packages/image_suggestions/section_image_suggestions.py", line 431, in prune_non_illustratable_sections
    sections_to_exclude_dict = json.load(open(
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/hadoop/data/j/yarn/local/usercache/analytics-platform-eng/appcache/application_1678266962370_111381/filecache/10/image-suggestions-0.9.0.dev0-hotfix-remove-dev-flags.conda.tgz/lib/python3.10/site-packages/image_suggestions/../data/section_titles_denylist.json'

We need a production friendly way of loading files from the artifact. Added fix to https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24. Testing it out.

xcollazo mentioned this in T311825: [M] Create the section-level image suggestions Airflow DAG.Mar 31 2023, 2:23 AM

Success!

https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/24 up for review. A dev version of that MR is currently set on prod via conda_env varprop.

For some reason all cassandra uploads took more time, but specifically the hive_to_cassandra_suggestions node, which typically takes 30-40 minutes, took 4 hours 45 mins. Perhaps it was just bad timing with other jobs? We should monitor subsequent runs.

CC @Cparle @mfossati