[L] Build experimental dataset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mfossati
	Aug 23 2022, 9:48 AM

Description

Blend image suggestions based on section alignment with those based on visual topics to form an initial proof-of-concept dataset for manual evaluation.

ask Research to share their datasets - on HDFS at /user/mnz/imagerecs/recs-2022-06-07
join it with Structured Data's one, see section 4
filter out obvious sections like references, see also T311730: [L] Exclude certain sections from having generated image suggestions
compute suggestion counts per language

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T311814 [EPIC] Section-level image suggestions data pipeline
		Resolved		Cparle	T315976 [L] Build experimental dataset

Event Timeline

mfossati created this task.Aug 23 2022, 9:48 AM

mfossati updated the task description. (Show Details)

mfossati added a parent task: T311814: [EPIC] Section-level image suggestions data pipeline.Aug 23 2022, 9:50 AM

mfossati updated the task description. (Show Details)Aug 23 2022, 1:51 PM

mfossati updated the task description. (Show Details)Aug 24 2022, 8:45 AM

CBogen mentioned this in T316149: [L] Create tool for manual evaluation of section-level image suggestions.Aug 24 2022, 6:56 PM

CBogen mentioned this in T311829: [XL] Combine suggestions based on section topics with section alignment ones and convert notebook code into idiomatic data pipeline code.Aug 25 2022, 12:38 PM

CBogen edited projects, added Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog, Section-Level-Image-Suggestions.Aug 29 2022, 4:39 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.

mfossati added a project: Section-Level-Image-Suggestions.Sep 2 2022, 9:59 AM

CBogen renamed this task from Build experimental dataset to [L] Build experimental dataset.Sep 27 2022, 3:37 PM

CBogen moved this task from Ready for Estimation to Ready for Development on the Structured-Data-Backlog (Current Work) board.

lbowmaker mentioned this in T320831: Section Level Image Suggestions - Data Persistence Request.Oct 14 2022, 5:45 PM

kostajh subscribed.Oct 25 2022, 1:22 PM

lbowmaker subscribed.Oct 28 2022, 12:26 PM

mfossati claimed this task.Nov 28 2022, 5:26 PM

mfossati changed the task status from Open to In Progress.Dec 5 2022, 2:17 PM

mfossati moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.

A note that @Cparle is currently working on the second task.

CBogen updated Other Assignee, added: Cparle.Dec 8 2022, 3:38 PM

Sample dataset for enwiki

https://docs.google.com/spreadsheets/d/17H8eHrGJlfpgG9hJYkCujtrkPByyn9cShSKosgBHDFI/edit#gid=1787708191

Section image suggestions counts:

based on section alignment: 248289
based on section topics and p18 image property: 50358726

More to follow ...

	section-alignment suggestions	section-topics-plus-p18 suggestions	intersection
enwiki	248035	50337151	14536
ptwiki*	148838	147934	584
idwiki	75618	1677378	2103
ruwiki	267413	11865098	7743
arwiki	97886	3226347	2828
bnwiki	28796	406662	213
eswiki	215593	11747916	10621
cswiki	124834	3901333	4644
frwiki	259604	16446381	10244

FWIW here's the notebook I used to gather the data https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/section_image_suggestions_data.ipynb

*the ratio of section-alignment-suggestions to section-topics suggestions is very different for ptwiki - this is because in section-topics we exclude a lot of sections for ptwiki that we suspect might have been parsed incorrectly

Just for completeness here's the notebook code I used to calculate the intersections

sis = spark.read.parquet("/user/cparle/section-image-suggestions/all_sugggestions_pruned.2022-01-12")
sis.registerTempTable("section_image_suggestions")
sa = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="section_alignment"')
p18 = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="p18"')

intersection = sa.join(
    p18,
    on=[
        'wiki_db',
        'target_qid',
        'target_section_heading',
        'suggested_image'
    ],
    how='inner'
).groupBy(
    'wiki_db'
).agg(
    F.count('target_qid').alias('target_intersection_count')
).select(
    'wiki_db', 'target_intersection_count'
)

Cparle claimed this task.Dec 9 2022, 1:04 PM

Cparle updated the task description. (Show Details)

Cparle updated Other Assignee, removed: Cparle.

Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

@Cparle , I reviewed your notebook and I think there’s a missing piece of the section topics suggestions algorithm.
Here's what we agreed with Research: project visual topics from all but the given wiki into the given wiki. See section 4 (Eureka!) of my notebook and this slide.
In a nutshell:

image links of all but the given wiki are joined with Commons on image titles, then with Wikidata p18/p373 on page IDs
section topics of the given wiki are joined with the previous dataset on QIDs

What's missing is the join with image links.

Review integrated at https://gitlab.wikimedia.org/cparle/notebooks/-/commit/d35c8a524f5300c8ac1dfedcd62d75a1263fed4d?view=parallel&w=1

We also discussed the following points:

images linked via Wikidata p373 (Commons categories) are noisy, so we should only use p18, which is a direct triple (QID, p18, Commons image). That's implemented
we shouldn't exclude the given wiki from the projection, otherwise we may lose relevant suggestions.

So ... can we count this as done?

Yep, closing.

Actually I forgot that the current notebook makes a union of the two suggestion approaches, while we agreed with Research to intersect them instead.

Added numbers for intersections to the table above (https://phabricator.wikimedia.org/T315976#8456730) so I think this can closed now @mfossati ?

Looks good to me! Closing.

[L] Build experimental datasetClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[L] Build experimental dataset
Closed, ResolvedPublic
Actions

Related Objects
Search...