[go: nahoru, domu]

Page MenuHomePhabricator

Write and deploy an A/B Test on enwiki using TextCat for Language Identification
Closed, ResolvedPublic

Description

Do an A/B test on enwiki (or A/B/C test vs the ES Plugin) using the best TextCat config determined in T118287. (A/B test depends on T121538; A/B/C test could benefit from T121540)

Related Objects

StatusSubtypeAssignedTask
ResolvedEBernhardson
Declinedmpopov
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
OpenNone
ResolvedEBernhardson
Resolveddcausse
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson

Event Timeline

TJones raised the priority of this task from to Needs Triage.
TJones updated the task description. (Show Details)
TJones added a project: CirrusSearch.
TJones subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Although not an absolute blocker, we think that T123537: Generate wikitext-based and query-based language models for TextCat should be done before this. Maybe it should block this?

Deskana moved this task from Needs triage to On Sprint Board on the Discovery-ARCHIVED board.
Deskana subscribed.

I don't think T123537 is a blocker for doing A/B tests on enwiki. We have query-based models for the languages relevant to enwiki, so we don't need the wikitext-based models.

Change 268048 had a related patch set uploaded (by EBernhardson):
A/B/C test of control vs textcat vs accept-lang textcat

https://gerrit.wikimedia.org/r/268048

I was looking at this after a comment Stas made about Italian, and I realized that the set of languages currently in LM-query under TextCat is not the ideal one for this test.

Portuguese and Japanese are missing, which are minor issues, because there are not many Portuguese or Japanese queries on enwiki. Hebrew, Armenian, Georgian, Tamil, and Telegu are present—they won't do much, but they won't hurt.

However, French and German are present, and they both tend to get many more false positives than true positives. Not a ton, but they will bring the overall performance down on enwiki.

As I understand it, the PHP version of TextCat doesn't yet have the ability to specify/limit languages, other than by what's in the requested directory.

Do we want to patch LM-query/ before the A/B test?

New languages have been added in T121539.

Deskana raised the priority of this task from Medium to High.Apr 12 2016, 10:08 PM

Increasing priority, as this is a Q4 goal for the Search Team.

Deskana changed the task status from Open to Stalled.Apr 12 2016, 10:16 PM
Deskana added a subscriber: mpopov.

The primary outstanding question for this task is how to measure the effectiveness of the test. This was discussed briefly in a sprint planning meeting today, but @EBernhardson and @mpopov didn't come to a conclusion. @mpopov will schedule a meeting to discuss this. Marking as stalled until that's done.

Met & discussed: https://lists.wikimedia.org/pipermail/discovery/2016-April/001043.html

Erik, Trey, David, Kevin, and I met this morning to discuss how we're going to handle data collection for the upcoming TextCat test. A big problem in this particular case is that the system wasn't designed/engineered in a way that's conducive for cross-wiki logging / session tracking. And recently we even lost the ability to use the referrer info to see which page the user came from when visiting another wiki page when going between wikis. (I was told this was done for user privacy reasons.)

Erik said he had recently implemented a click event in the TestSearchSatisfaction2 schema that we might be able to hook into to measure clickthrough rate for users who are eligible for TextCat language detection & get shown results in the language their non-English query probably is written in. Whether we use this and how much we rely on this particular method of measuring whether TextCat is successful (beyond just measuring how it impacts the zero results rate) depends on the validation of the click events and how they compare to page visit events (which cannot be fired in an interwiki context).

We also discussed an alternative approach which uses web requests with the caveat being that if a user is selected for the test once, they'll be selected every time. So if a particular IP+UA combination is part of the test and performs 2 million searches (as is sometimes the case), then we'll have to do some very careful filtering which will also exclude some completely valid use cases (a computer lab in a school or a country with only 2 public IP addresses). But we're shooting for being able to use TestSearchSatisfaction2

Will add validation of click events in TestSearchSatisfaction2 (T132706) as a blocker.

Deskana renamed this task from Do an A/B Test on enwiki using TextCat for Language Identification to Write and deploy an A/B Test on enwiki using TextCat for Language Identification.May 3 2016, 9:49 PM

I changed the task title slightly to more accurate reflect the sequence of events.

After discussion the way we are running this test has slightly changed. The provided patch above does a backend only test which doesn't include as much data as we would like to analyze with. Will re-work to run a test using our frontend search satisfaction schema.

Change 287674 had a related patch set uploaded (by EBernhardson):
Allowing triggering user tests from query parameter

https://gerrit.wikimedia.org/r/287674

@mpopov What additional metrics should we collect into the satisfaction schema for users in the textcat test?

Some i'm thinking might be useful:

  • # of interwiki results provided
  • boolean indicating if click event was interwiki or not

These might be unnecessary though, we mostly are just looking at if the click through more or not?

Change 287677 had a related patch set uploaded (by EBernhardson):
Add textcat subtest

https://gerrit.wikimedia.org/r/287677

Change 287674 merged by jenkins-bot:
Allowing triggering user tests from query parameter

https://gerrit.wikimedia.org/r/287674

Change 288313 had a related patch set uploaded (by EBernhardson):
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288313

This should now be ready to go. Ideally we want to ship this in the thurs afternoon swat window

Change 288313 merged by jenkins-bot:
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288313

Change 288465 had a related patch set uploaded (by EBernhardson):
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288465

Change 287677 merged by jenkins-bot:
Add textcat subtest

https://gerrit.wikimedia.org/r/287677

Change 288509 had a related patch set uploaded (by EBernhardson):
Add textcat subtest

https://gerrit.wikimedia.org/r/288509

Change 268048 merged by jenkins-bot:
A/B/C test of control vs textcat vs accept-lang textcat

https://gerrit.wikimedia.org/r/268048

Change 288465 merged by jenkins-bot:
Adjust textcat data collection for AB test

https://gerrit.wikimedia.org/r/288465

Mentioned in SAL [2016-05-12T23:54:13Z] <dereckson@tin> Synchronized php-1.28.0-wmf.1/extensions/CirrusSearch/includes/CirrusSearch.php: Adjust textcat data collection for AB test (T121542) (duration: 00m 26s)

Change 288509 merged by jenkins-bot:
Add textcat subtest

https://gerrit.wikimedia.org/r/288509

debt subscribed.

Looks like this is resolved - closing.