This task is to adapt either
https://media-search-signal-test.toolforge.org/synonyms_bak.html
or
https://image-recommendation-test.toolforge.org/
for manual evaluation of section-level image suggestions once T315976 is complete.
There will be a follow up task to do the manual testing itself.
Acceptance Criteria:
- Determine which of the above tools is better/easier to adapt for this purpose.
- The tool will evaluate results in English, Portuguese, Indonesian, Russian, Arabic, Czech, Bengali, French and Spanish Wikipedias
- The tool will allow the user to choose which wiki/language they want to evaluate
- The tool will evaluate 500 random section-level image suggestions across 500 random different articles, per wiki
- The tool will display and evaluate the output (both a preview of the section text and the image), similar to https://media-search-signal-test.toolforge.org/
- The tool will allow testers to manually decide whether the match is good or bad for each result for each unillustrated article.
- The tool will show the user information about the source of the match -- whether it's from section alignment (e.g. this image was used on X wiki), or from visual topics, or an intersection of both
- The tool will show suggestions from section alignment, visual topics, and at the intersection of both, but will prioritize suggestions that intersect
- The tool will output the results into a spreadsheet, showing how many good and bad matches were produced for each article, and what the confidence score for each of those matches was
- We will remove images that are not .jpgs from the evaluation dataset so that we remove potential icons
Update
We are going to do another round of evaluation to see if we can improve the % good topics and the number of available intersection suggestions.
Plan
Data
- Investigate more images linked through wikidata via T311832 and T311831 - update: Structured Data on Commons depicts statements can increase the amount of images for topics
- @mfossati to spend 1 day on both spikes to see how it goes; then include them in the updated evaluation data if it goes well
- Reverse-lookup depicts statements
- Use the updated section topics data set (fewer tables and lists, media items, dates) in the updated evaluation data set
- Half of the data evaluation set should use section topics with a relevance score at the section level over 10
- Half of the data evaluation set should use section topics with a relevance score at the article level over 3.725 - update: threshold computed via "recursive" percentiles 🤓
- Run the pipeline to generate intersection-based suggestions
- Do not include alignment-based suggestions in the updated data set -- only include section-topics and intersection-based suggestions
Tool
- Add an explanation for users when the source is a depicts statement (eg, here is an image we think might fit the article section, because: This is the image has the depicts statement X, and an article about that item is linked from the section.)
- Remove suggestions from the queue when the evaluator clicks “unsure” so that we cycle through more suggestions
- Keep the new evaluation data set separate from the previous one by adding a 'dataset_id' field to ratedSuggestions
- Before switching over to the new data set, put the updated results from the old data set in the ticket
- Just like in the first data set, remove images that are not .jpgs from the evaluation dataset so that we remove potential icons
- Once these updates are made, run another evaluation, just amongst ourselves in our languages
- If it goes well, we can do one more round with ambassadors
Round 1 results
wiki | % good intersection | % good alignment | % good p18 topics | total rated suggestions |
arwiki | 100 | 71 | 41 | 343 |
bnwiki | 50 | 40 | 13 | 55 |
cswiki | 38 | 38 | 16 | 206 |
enwiki | 82 | 68 | 44 | 178 |
eswiki | 89 | 78 | 29 | 398 |
frwiki | 85 | 71 | 18 | 344 |
idwiki | 88 | 95 | 71 | 530 |
ptwiki | 100 | 83 | 69 | 398 |
ruwiki | 77 | 75 | 28 | 966 |
overall | 79 | 69 | 37 | 3418 |
As of Feb 16.
Round 2 internal results
wiki | %good intersection | % good p18 topics | total rated suggestions |
enwiki | 100 | 71 | 329 |
eswiki | 100 | 85 | 101 |
frwiki | 75 | 57 | 70 |
ptwiki | 100 | 75 | 94 |
ruwiki | 56 | 62 | 139 |
overall | 86 | 70 | 733 |
As of Feb 28.