⚓ T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification

Subject	Repo	Branch	Lines +/-
Add textcat subtest	mediawiki/extensions/WikimediaEvents	wmf/1.28.0-wmf.1	+100 -27
Adjust textcat data collection for AB test	mediawiki/extensions/CirrusSearch	wmf/1.28.0-wmf.1	+22 -12
A/B/C test of control vs textcat vs accept-lang + textcat	operations/mediawiki-config	master	+52 -0
Add textcat subtest	mediawiki/extensions/WikimediaEvents	master	+100 -27
Adjust textcat data collection for AB test	mediawiki/extensions/CirrusSearch	master	+22 -12
Allowing triggering user tests from query parameter	mediawiki/extensions/CirrusSearch	master	+74 -27

Status	Assigned	Task
Resolved	EBernhardson	T137158 Compile and then resolve issues with TextCat A/B test data
Declined	mpopov	T134320 Analyse results of TextCat A/B test
Resolved	EBernhardson	T130321 Disable Schema:Search, since it's outdated and redundant
Resolved	mpopov	T129564 Switch Desktop data collection for dashboards to use TestSearchSatisfaction2 instead of Search schema
Resolved	EBernhardson	T134319 Turn off TextCat A/B test on the English Wikipedia on or after May 23
Resolved	debt	T134318 Verify data pipeline for TextCat A/B test on English Wikipedia
Open	None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
Resolved	EBernhardson	T121542 Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Resolved	dcausse	T121540 Investigate Updating Cybozu / ES Plugin for Language Identification
Resolved	EBernhardson	T124844 Add textcat to mediawiki vendor libs
Resolved	mpopov	T132706 Validate click events in TestSearchSatisfaction2
Resolved	EBernhardson	T121543 Do an A/B Tests on Other Wikis with TextCat for Language Identification
Resolved	debt	T121541 Create Properly Weighted Language Identification Evaluation Sets for Top N Other Wikis
Resolved	TJones	T121539 Create Balanced Language Identification Evaluation Set for Top N Wikis by Query Volume
Resolved	TJones	T132466 Lang ID Eval Sets for Italian, German, Spanish, and French
Resolved	TJones	T134431 Re-Optimize Italian, German, Spanish, and French TextCat Languages by Recall
Resolved	TJones	T138315 Lang ID Eval Sets for English, Russian, Japanese, Portuguese
Resolved	TJones	T142413 Deploy recommended languages for Russian, Japanese, and Portuguese
Resolved	debt	T143355 request translations for 'showing results from'
Resolved	Anikethfoss	T145926 [[MediaWiki:Search-interwiki-results-acewiki/fi]] typo: "Acehnese" instead of "Achinese"
Resolved	TJones	T142140 Lang ID Eval Set for Dutch
Resolved	debt	T143354 ask for translations for 'showing results from' (Polish, Dutch, Arabic and Chinese)
Resolved	Smalyshev	T121538 Convert TextCat to PHP Library for Language Identification in Cirrus Search
Resolved	TJones	T123537 Generate wikitext-based and query-based language models for TextCat
Resolved	TJones	T123651 Decide which set of separators we have to use for TextCat ngrams
Resolved	• dpatrick	T123558 Security review for TextCat library
Resolved	EBernhardson	T137163 Part Deux: TextCat A/B test for Language Identification - specification

TJones created this task.Dec 15 2015, 5:48 PM

TJones raised the priority of this task from to Needs Triage.

TJones updated the task description. (Show Details)

TJones subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptDec 15 2015, 5:48 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

TJones added subtasks: T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search, T121540: Investigate Updating Cybozu / ES Plugin for Language Identification.Dec 15 2015, 5:48 PM

TJones mentioned this in T121543: Do an A/B Tests on Other Wikis with TextCat for Language Identification.

TJones added a parent task: T118278: [EPIC] Improve Language Identification for use in Cirrus Search.Dec 15 2015, 5:56 PM

Smalyshev added a project: Discovery-Search (Current work).Dec 22 2015, 5:25 PM

Smalyshev set Security to None.

Smalyshev removed a project: Discovery-Search (Current work).

• Deskana added a project: Discovery-Search (Current work).Dec 22 2015, 6:21 PM

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.Dec 31 2015, 12:28 AM

• Deskana closed subtask T121540: Investigate Updating Cybozu / ES Plugin for Language Identification as Resolved.Dec 31 2015, 5:15 AM

Although not an absolute blocker, we think that T123537: Generate wikitext-based and query-based language models for TextCat should be done before this. Maybe it should block this?

• Deskana triaged this task as Medium priority.Jan 14 2016, 5:40 PM

• Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.

• Deskana subscribed.

Smalyshev added a subtask: T124844: Add textcat to mediawiki vendor libs.Jan 26 2016, 11:13 PM

Smalyshev added a subtask: T123537: Generate wikitext-based and query-based language models for TextCat.

I don't think T123537 is a blocker for doing A/B tests on enwiki. We have query-based models for the languages relevant to enwiki, so we don't need the wikitext-based models.

Smalyshev closed subtask T121538: Convert TextCat to PHP Library for Language Identification in Cirrus Search as Resolved.Jan 28 2016, 6:03 PM

EBernhardson claimed this task.Feb 3 2016, 12:59 AM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Smalyshev closed subtask T123537: Generate wikitext-based and query-based language models for TextCat as Resolved.Feb 3 2016, 1:08 AM

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Feb 3 2016, 5:39 AM

Change 268048 had a related patch set uploaded (by EBernhardson):
A/B/C test of control vs textcat vs accept-lang textcat

https://gerrit.wikimedia.org/r/268048

gerritbot added a project: Patch-For-Review.Feb 3 2016, 6:00 AM

Smalyshev closed subtask T124844: Add textcat to mediawiki vendor libs as Resolved.Feb 9 2016, 11:12 PM

I was looking at this after a comment Stas made about Italian, and I realized that the set of languages currently in LM-query under TextCat is not the ideal one for this test.

Portuguese and Japanese are missing, which are minor issues, because there are not many Portuguese or Japanese queries on enwiki. Hebrew, Armenian, Georgian, Tamil, and Telegu are present—they won't do much, but they won't hurt.

However, French and German are present, and they both tend to get many more false positives than true positives. Not a ton, but they will bring the overall performance down on enwiki.

As I understand it, the PHP version of TextCat doesn't yet have the ability to specify/limit languages, other than by what's in the requested directory.

Do we want to patch LM-query/ before the A/B test?

New languages have been added in T121539.

CKoerner_WMF subscribed.Mar 29 2016, 9:30 PM

• Deskana raised the priority of this task from Medium to High.Apr 12 2016, 10:08 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 12 2016, 10:08 PM

Increasing priority, as this is a Q4 goal for the Search Team.

The primary outstanding question for this task is how to measure the effectiveness of the test. This was discussed briefly in a sprint planning meeting today, but @EBernhardson and @mpopov didn't come to a conclusion. @mpopov will schedule a meeting to discuss this. Marking as stalled until that's done.

mpopov mentioned this in T132706: Validate click events in TestSearchSatisfaction2.Apr 14 2016, 5:30 PM

Met & discussed: https://lists.wikimedia.org/pipermail/discovery/2016-April/001043.html

Erik, Trey, David, Kevin, and I met this morning to discuss how we're going to handle data collection for the upcoming TextCat test. A big problem in this particular case is that the system wasn't designed/engineered in a way that's conducive for cross-wiki logging / session tracking. And recently we even lost the ability to use the referrer info to see which page the user came from when visiting another wiki page when going between wikis. (I was told this was done for user privacy reasons.)

Erik said he had recently implemented a click event in the TestSearchSatisfaction2 schema that we might be able to hook into to measure clickthrough rate for users who are eligible for TextCat language detection & get shown results in the language their non-English query probably is written in. Whether we use this and how much we rely on this particular method of measuring whether TextCat is successful (beyond just measuring how it impacts the zero results rate) depends on the validation of the click events and how they compare to page visit events (which cannot be fired in an interwiki context).

We also discussed an alternative approach which uses web requests with the caveat being that if a user is selected for the test once, they'll be selected every time. So if a particular IP+UA combination is part of the test and performs 2 million searches (as is sometimes the case), then we'll have to do some very careful filtering which will also exclude some completely valid use cases (a computer lab in a school or a country with only 2 public IP addresses). But we're shooting for being able to use TestSearchSatisfaction2

Will add validation of click events in TestSearchSatisfaction2 (T132706) as a blocker.

mpopov added a subtask: T132706: Validate click events in TestSearchSatisfaction2.Apr 14 2016, 8:05 PM

Jnanaranjan_sahu subscribed.Apr 16 2016, 1:26 AM

• Deskana mentioned this in T134318: Verify data pipeline for TextCat A/B test on English Wikipedia.May 3 2016, 9:45 PM

• Deskana mentioned this in T134319: Turn off TextCat A/B test on the English Wikipedia on or after May 23.

I changed the task title slightly to more accurate reflect the sequence of events.

After discussion the way we are running this test has slightly changed. The provided patch above does a backend only test which doesn't include as much data as we would like to analyze with. Will re-work to run a test using our frontend search satisfaction schema.

Change 287674 had a related patch set uploaded (by EBernhardson):
Allowing triggering user tests from query parameter

https://gerrit.wikimedia.org/r/287674

@mpopov What additional metrics should we collect into the satisfaction schema for users in the textcat test?

Some i'm thinking might be useful:

# of interwiki results provided
boolean indicating if click event was interwiki or not

These might be unnecessary though, we mostly are just looking at if the click through more or not?

Change 287677 had a related patch set uploaded (by EBernhardson):
Add textcat subtest

https://gerrit.wikimedia.org/r/287677

Change 287674 merged by jenkins-bot:
Allowing triggering user tests from query parameter

https://gerrit.wikimedia.org/r/287674

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-05-10_(1.28.0-wmf.1)).May 10 2016, 4:00 PM

• Elitre subscribed.May 10 2016, 4:24 PM

Change 288313 had a related patch set uploaded (by EBernhardson):
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288313

This should now be ready to go. Ideally we want to ship this in the thurs afternoon swat window

Change 288313 merged by jenkins-bot:
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288313

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-05-17_(1.28.0-wmf.2)).May 12 2016, 6:00 PM

Change 288465 had a related patch set uploaded (by EBernhardson):
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288465

Change 287677 merged by jenkins-bot:
Add textcat subtest

https://gerrit.wikimedia.org/r/287677

Change 288509 had a related patch set uploaded (by EBernhardson):
Add textcat subtest

https://gerrit.wikimedia.org/r/288509

Change 268048 merged by jenkins-bot:
A/B/C test of control vs textcat vs accept-lang textcat

https://gerrit.wikimedia.org/r/268048

Change 288465 merged by jenkins-bot:
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288465

Mentioned in SAL [2016-05-12T23:54:13Z] <dereckson@tin> Synchronized php-1.28.0-wmf.1/extensions/CirrusSearch/includes/CirrusSearch.php: Adjust textcat data collection for AB test (T121542) (duration: 00m 26s)