Fix Glent M2 CJK suggestion tokenization
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Oct 8 2020, 7:25 PM

Description

User story: As a user of a CJK Wikipedia, I want suggestions with proper token breaks so the suggestions get the right results.

Notes: While reviewing Chinese data for T244800, I noticed that tokens in suggestions were being run together. This is sub-optimal for Chinese queries because users intentionally break up words to prevent incorrect tokenization (a holdover from the bigram days). As an example, a query like AB CD cannot be incorrectly tokenized as A BC D, but ABCD could. If the user searches for AB Cd, then AB CD (with a space) is a better suggestion than ABCD (without a space).

This is particularly terrible for Latin and other non-CJK tokens. Instead of a suggestion like john smith 探险家, we are generating johnsmith探险家, the Latin part of which will definitely not be tokenized correctly.

Update: After not finding the problem in the expected place (the Chinese analysis/tokenizer) I discovered that it's a problem for Japanese and Korean, too—I just didn't find it in my sample:

TOFU BEST ～ウチらのトーフビーツ～ → tofubestウチちのトーフビーツ
이아코바 이탈렐리(Iakoba Taeia Italeli) → 이아코바이탈랠리iakobataeiaitaleli

It's also concatenating tokens across punctuation and not just spaces.

Acceptance criteria: CJK M‍2 analysis chain passes unit tests with Latin and/or non-CJK tokens that are not run together.

NB: This should probably be completed and deployed before any A/B test is run for M‍2—though it is not super common.

Details

	Subject	Repo	Branch	Lines +/-
	Fix Glent M2 CJK suggestion tokenization	search/glent	master	+140 -42

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T212884 [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent)
Open	None	T212891 [EPIC-ish][Milestone 2] Implement NLP Search Suggestion Method 2 for CJK languages
Resolved	TJones	T244800 Analysis of Method 2 Suggestion results
Resolved	TJones	T265081 Fix Glent M2 CJK suggestion tokenization
Resolved	EBernhardson	T277213 Eliminate old M2 suggestions with improper tokenization

Event Timeline

TJones created this task.Oct 8 2020, 7:25 PM

Restricted Application added subscribers: Stang, Aklapper. · View Herald TranscriptOct 8 2020, 7:25 PM

TJones updated the task description. (Show Details)Oct 8 2020, 7:26 PM

TJones added a parent task: T244800: Analysis of Method 2 Suggestion results.Oct 8 2020, 7:49 PM

Shizhao added a project: Chinese-Sites.Oct 9 2020, 8:22 AM

Shizhao moved this task from Backlog to Research on the Chinese-Sites board.Oct 9 2020, 8:26 AM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Oct 12 2020, 3:25 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones renamed this task from Review Chinese Analysis Chain for Glent M2 to Fix Chinese Analysis Chain for Glent M2.Oct 19 2020, 5:17 PM

EBernhardson set the point value for this task to 3.Oct 19 2020, 5:19 PM

EBernhardson moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel triaged this task as High priority.Oct 28 2020, 1:28 PM

CBogen edited projects, added Discovery-Search; removed Discovery-Search (Current work).Nov 2 2020, 6:14 PM

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.Nov 2 2020, 6:20 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Feb 8 2021, 4:30 PM

TJones mentioned this in T267971: Analyze Speaker-Reviewed M2 Data for Chinese.Feb 24 2021, 10:05 PM

TJones claimed this task.Feb 24 2021, 10:09 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

TJones renamed this task from Fix Chinese Analysis Chain for Glent M2 to Fix Glent M2 CJK suggestion tokenization .Mar 3 2021, 11:40 PM

TJones updated the task description. (Show Details)

Restricted Application added a subscriber: revi. · View Herald TranscriptMar 3 2021, 11:40 PM

Change 670257 had a related patch set uploaded (by Tjones; owner: Tjones):
[search/glent@master] Fix Glent M2 CJK suggestion tokenization

https://gerrit.wikimedia.org/r/670257

gerritbot added a project: Patch-For-Review.Mar 9 2021, 6:15 PM

I added a param to the tokenizer to preserve token separation when creating M2 suggestions. That broke suggestion creation because all single-character tokens are considered. I limited that to CJK characters, which also prevents trying to use Latin and other non-CJK characters in suggestions. (Oddly, the Chinese analyzer splits Greek and Cyrillic words into letters, so this also prevents trying to get suggestions out of в, и, к, и, п, е, д, и, ю, for example.)

Added a bunch of tests to GlentUtilsTest to cover the relevant cases and refactored it a bit. Added some new tests elsewhere because I thought they might be the source of the problem, even though they weren't. More tests is better tests, eh?

Also added some logic to prevent duplicate suggestion token candidates being created. It ain't much but it's honest work.

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Mar 9 2021, 6:31 PM

Change 670257 merged by jenkins-bot:
[search/glent@master] Fix Glent M2 CJK suggestion tokenization

https://gerrit.wikimedia.org/r/670257

Maintenance_bot removed a project: Patch-For-Review.Mar 10 2021, 12:10 AM

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Mar 11 2021, 6:29 PM

TJones mentioned this in T277213: Eliminate old M2 suggestions with improper tokenization.Mar 11 2021, 6:46 PM

Seems we are about ready, should i run a release on glent and update airflow with the new jar?

Seems we are about ready, should i run a release on glent and update airflow with the new jar?

Yeah, I think so!

TJones moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Mar 22 2021, 4:47 PM

Gehel closed this task as Resolved.Mar 24 2021, 1:25 PM

Shizhao moved this task from Research to Closed on the Chinese-Sites board.Mar 25 2021, 1:49 AM

Gehel closed subtask T277213: Eliminate old M2 suggestions with improper tokenization as Resolved.Jun 21 2021, 11:42 AM

Stang unsubscribed.Nov 13 2021, 8:57 PM

Fix Glent M2 CJK suggestion tokenization Closed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fix Glent M2 CJK suggestion tokenization
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...