User story: As a user of a CJK Wikipedia, I want only suggestions with proper token breaks so the suggestions get the right results.
T265081 fixes the M2 tokenization problems, but old suggestions with poor tokenization are still in the database tables. It's not entirely clear what the best way to fix them is. Some options:
- Find some way to remove the old suggestions and repopulate with freshly generated suggestions that use the new code; may require database surgery or other excess cleverness
- Double check that we have 90 days of data for each of the CJK languages and just delete all existing suggestions and start over collecting them, using the last 90 days' worth of data.
- Do something extra clever and only delete queries and suggestions with spaces, or with spaces between non-CJK characters.
Acceptability Criteria:
- M2 suggestion table no longer contains suggestions with spaces removed from between non-CJK tokens (Latin, Cyrillic, etc.)