[go: nahoru, domu]

Page MenuHomePhabricator

[L] Build experimental dataset
Closed, ResolvedPublic


Blend image suggestions based on section alignment with those based on visual topics to form an initial proof-of-concept dataset for manual evaluation.

Event Timeline

CBogen renamed this task from Build experimental dataset to [L] Build experimental dataset.Sep 27 2022, 3:37 PM
mfossati changed the task status from Open to In Progress.Dec 5 2022, 2:17 PM

A note that @Cparle is currently working on the second task.

Sample dataset for enwiki


Section image suggestions counts:

  • based on section alignment: 248289
  • based on section topics and p18 image property: 50358726

More to follow ...

section-alignment suggestionssection-topics-plus-p18 suggestionsintersection

FWIW here's the notebook I used to gather the data https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/section_image_suggestions_data.ipynb

*the ratio of section-alignment-suggestions to section-topics suggestions is very different for ptwiki - this is because in section-topics we exclude a lot of sections for ptwiki that we suspect might have been parsed incorrectly

Just for completeness here's the notebook code I used to calculate the intersections

sis = spark.read.parquet("/user/cparle/section-image-suggestions/all_sugggestions_pruned.2022-01-12")
sa = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="section_alignment"')
p18 = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="p18"')

intersection = sa.join(
    'wiki_db', 'target_intersection_count'
Cparle updated the task description. (Show Details)
Cparle updated Other Assignee, removed: Cparle.
Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

@Cparle , I reviewed your notebook and I think there’s a missing piece of the section topics suggestions algorithm.
Here's what we agreed with Research: project visual topics from all but the given wiki into the given wiki. See section 4 (Eureka!) of my notebook and this slide.
In a nutshell:

  • image links of all but the given wiki are joined with Commons on image titles, then with Wikidata p18/p373 on page IDs
  • section topics of the given wiki are joined with the previous dataset on QIDs

What's missing is the join with image links.

We also discussed the following points:

  • images linked via Wikidata p373 (Commons categories) are noisy, so we should only use p18, which is a direct triple (QID, p18, Commons image). That's implemented
  • we shouldn't exclude the given wiki from the projection, otherwise we may lose relevant suggestions.

So ... can we count this as done?

Actually I forgot that the current notebook makes a union of the two suggestion approaches, while we agreed with Research to intersect them instead.

Added numbers for intersections to the table above (https://phabricator.wikimedia.org/T315976#8456730) so I think this can closed now @mfossati ?

Looks good to me! Closing.