[go: nahoru, domu]

Page MenuHomePhabricator

[L] Build experimental dataset
Closed, ResolvedPublic

Description

Blend image suggestions based on section alignment with those based on visual topics to form an initial proof-of-concept dataset for manual evaluation.

Event Timeline

CBogen renamed this task from Build experimental dataset to [L] Build experimental dataset.Sep 27 2022, 3:37 PM
mfossati changed the task status from Open to In Progress.Dec 5 2022, 2:17 PM

A note that @Cparle is currently working on the second task.

Sample dataset for enwiki

https://docs.google.com/spreadsheets/d/17H8eHrGJlfpgG9hJYkCujtrkPByyn9cShSKosgBHDFI/edit#gid=1787708191

Section image suggestions counts:

  • based on section alignment: 248289
  • based on section topics and p18 image property: 50358726

More to follow ...

section-alignment suggestionssection-topics-plus-p18 suggestionsintersection
enwiki2480355033715114536
ptwiki*148838147934584
idwiki7561816773782103
ruwiki267413118650987743
arwiki9788632263472828
bnwiki28796406662213
eswiki2155931174791610621
cswiki12483439013334644
frwiki2596041644638110244

FWIW here's the notebook I used to gather the data https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/section_image_suggestions_data.ipynb


*the ratio of section-alignment-suggestions to section-topics suggestions is very different for ptwiki - this is because in section-topics we exclude a lot of sections for ptwiki that we suspect might have been parsed incorrectly


Just for completeness here's the notebook code I used to calculate the intersections

sis = spark.read.parquet("/user/cparle/section-image-suggestions/all_sugggestions_pruned.2022-01-12")
sis.registerTempTable("section_image_suggestions")
sa = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="section_alignment"')
p18 = spark.sql('select wiki_db, target_qid, LOWER(target_section_heading) as target_section_heading, suggested_image from section_image_suggestions where suggestion_origin="p18"')

intersection = sa.join(
    p18,
    on=[
        'wiki_db',
        'target_qid',
        'target_section_heading',
        'suggested_image'
    ],
    how='inner'
).groupBy(
    'wiki_db'
).agg(
    F.count('target_qid').alias('target_intersection_count')
).select(
    'wiki_db', 'target_intersection_count'
)
Cparle updated the task description. (Show Details)
Cparle updated Other Assignee, removed: Cparle.
Cparle moved this task from Doing to Code Review on the Structured-Data-Backlog (Current Work) board.

@Cparle , I reviewed your notebook and I think there’s a missing piece of the section topics suggestions algorithm.
Here's what we agreed with Research: project visual topics from all but the given wiki into the given wiki. See section 4 (Eureka!) of my notebook and this slide.
In a nutshell:

  • image links of all but the given wiki are joined with Commons on image titles, then with Wikidata p18/p373 on page IDs
  • section topics of the given wiki are joined with the previous dataset on QIDs

What's missing is the join with image links.

We also discussed the following points:

  • images linked via Wikidata p373 (Commons categories) are noisy, so we should only use p18, which is a direct triple (QID, p18, Commons image). That's implemented
  • we shouldn't exclude the given wiki from the projection, otherwise we may lose relevant suggestions.

So ... can we count this as done?

Actually I forgot that the current notebook makes a union of the two suggestion approaches, while we agreed with Research to intersect them instead.

Added numbers for intersections to the table above (https://phabricator.wikimedia.org/T315976#8456730) so I think this can closed now @mfossati ?

Looks good to me! Closing.