User story: As a user of a CJK Wikipedia, I want suggestions with proper token breaks so the suggestions get the right results.
Notes: While reviewing Chinese data for T244800, I noticed that tokens in suggestions were being run together. This is sub-optimal for Chinese queries because users intentionally break up words to prevent incorrect tokenization (a holdover from the bigram days). As an example, a query like AB CD cannot be incorrectly tokenized as A BC D, but ABCD could. If the user searches for AB Cd, then AB CD (with a space) is a better suggestion than ABCD (without a space).
This is particularly terrible for Latin and other non-CJK tokens. Instead of a suggestion like john smith 探险家, we are generating johnsmith探险家, the Latin part of which will definitely not be tokenized correctly.
Update: After not finding the problem in the expected place (the Chinese analysis/tokenizer) I discovered that it's a problem for Japanese and Korean, too—I just didn't find it in my sample:
- TOFU BEST ~ウチらのトーフビーツ~ → tofubestウチちのトーフビーツ
- 이아코바 이탈렐리(Iakoba Taeia Italeli) → 이아코바이탈랠리iakobataeiaitaleli
It's also concatenating tokens across punctuation and not just spaces.
Acceptance criteria: CJK M2 analysis chain passes unit tests with Latin and/or non-CJK tokens that are not run together.
NB: This should probably be completed and deployed before any A/B test is run for M2—though it is not super common.