luganda language extension #10847

tobiusaolo · 2022-05-25T08:28:02Z

This is an intiative to add luganda language from Uganda,East Africa to spacy

polm · 2022-05-26T02:05:23Z

You can click on the red X to see why a build failed, though in this case the most recent build succeeded so you're fine. We'll take a look at this, thanks for the submission!

tobiusaolo · 2022-05-26T07:35:55Z

Thank you , Am looking forward for your feedback

adrianeboyd

Thanks for the PR, this is a great start! I added some comments about how to configure the LIKE_NUM modifications for LEX_ATTRS.

spacy/lang/lg/lex_attrs.py

spacy/lang/lg/__init__.py

spacy/lang/lg/lex_attrs.py

spacy/lang/lg/__init__.py

adrianeboyd · 2022-06-20T07:49:13Z

spacy/lang/lg/stop_words.py

@@ -0,0 +1,20 @@
+#stopwords as whitespace-seperated list


Do you have a source (link or citation?) for the stop words?

adrianeboyd · 2022-06-20T07:57:56Z

Aside from the ordinal English number issue mentioned above, I think this PR is looking good for initial support for Luganda.

Were you planning on adding any custom tokenizer settings (in punctuation.py or tokenizer_exceptions.py) or do the current defaults work well enough for now?

I think it would be nice to have a few example sentences in examples.py. You can choose your own sentences or translate sentences from another language like the English examples:

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/examples.py

spacy/lang/lg/stop_words.py

tobiusaolo · 2022-07-04T14:59:54Z

Alright let me do that

…

On Mon, Jul 4, 2022 at 5:09 PM Sofie Van Landeghem ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In spacy/lang/lg/stop_words.py <#10847 (comment)>: > +STOP_WORDS=set( + "wa lwa si ebyo nti anti nanti okutuusa tu wandi wa kiki kki dda" + "a singa oluvannyuma neera yenna nze ne kyonna ba nga ku beera kubanga" + "byombi naye osobola buli okuva kuva teyalina talina bayina byonna yonna byaffe be" + "bombi tebaalina tayina bonna zonna tayina tebaalina teyayina tetulina alina wano bimu abadde waliwo" + "bangi wakati ejja omuli ebyo nabo balina kuwa kyaffe olwekyo" + "buva bwaffe yonna ddala liryo yaffe terina kennyini ye bwonna bokka abalala bulungi kirungi ebweru" + "obulungi leero bya kikye yina atya munda ziba byabwe tewali erimu engeri ffenna lyange okudda kudda ebiri twafuna nnyingi lyabwe" + "zaabwe mu endala lyaffe kye nnyini tebayina yennyini ga bibye ayinza ali kikino nandi" + "ye nyinza ateekeddwa tetuteekeddwa neetaaga seetaaga nedda edda kati ku gumu gujja oba ekirala wabweru waggulu" + "nnina byebimu n'olwekyo ekyo bo abava bingi abangi ojja bangi waliyo bino bwabwe bandi bajja ajja wansi bulijjo kaseera ba" + "balina kino ebyo ku nnyo ennyo okutuusa bwayo yabadde ffe tu-yina kyekimu" + "oyo babadde baali tebaali ki kiki ddi wa ani lwaki ne gwe wandi oli oyina kikyo e mu wange ku bwe wa bajja" + "newankubade sinakindi n'olwekyo okuggyako gunno guno bateekeddwa oba gwe mwe" + "gyabwe erina tolina ebimu mingi zijja ffe nanti anti naye ate" + "wamu awamu baweebwa aweebwa weebwa era wadde mpozzi ekyo oyo kati kyekyo oluvannyuma kwegamba nandiyagadde wadde kubanga" + "olwokuba wabula nnyo nnyini nnyinza tuyina tulina tayina balina bali okuwa twetaaga okugenda bayina alina mulina" + "oyina olina abamu bano ye otya ki ono gwa nabadde mbadde".split() +) I would format this as STOP_WORDS = set( """ wa lwa si ... ... ... nabadde mbadde """.split() ) and sort this alphabetically — Reply to this email directly, view it on GitHub <#10847 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHLWOY4ORJCCUNLELLBSZZTVSLWCZANCNFSM5W4JFWKQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

adrianeboyd · 2022-07-14T07:34:02Z

Thanks for the updates, this is looking good! In a second I'll try to make a few minor edits and reformat so this is ready to merge...

adrianeboyd · 2022-07-14T07:46:14Z

Actually, one more question: what is the intended tokenization of strings like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples (thanks for adding a few!), I get the tokenization:

Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri']
Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo']

From the stop words, it looks like you're expecting "ky'" to be a separate token?

If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings.

tobiusaolo · 2022-07-15T07:49:12Z

Thank you for your feedback, first about tokenization i have consulted the language expert and if we follow the sentence below: sentence:Abooluganda ab’emmamba ababiri We can tokenize the above sentence as ['Abooluganda', 'ab’emmamba', 'ababiri'] About the 'ky' i will remove it and update the repo otherwise thank you for the guidance Regards

…

On Thu, Jul 14, 2022 at 10:46 AM Adriane Boyd ***@***.***> wrote: Actually, one more question: what is the intended tokenization of strings like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples (thanks for adding a few!), I get the tokenization: Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri'] Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo'] From the stop words, it looks like you're expecting "ky' to be a separate token? If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings. — Reply to this email directly, view it on GitHub <#10847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHLWOYYSZOKSGLWUZOL4NRTVT7AVDANCNFSM5W4JFWKQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

adrianeboyd · 2022-07-27T07:23:50Z

Do you have a source for the stop words?

I'm still a bit confused about the tokenizer settings vs. stop words.

Is ky' ever a separate token and not just a prefix? With the current tokenizer settings, none of the stop words with ' will end up as separate tokens, so the stop words with apostrophes might not make sense.

For example:

import spacy

nlp = spacy.blank("lg")

doc = nlp("Ekiwandiiko ky'olunaku")
print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"]

I will add some basic tokenizer tests in a minute with the example above.

tobiusaolo · 2022-07-29T12:42:34Z

You're right, the ky' is not supposed it a prefix. According to the discussion with the Luganda Experts, they indicated that the word should remain "ky'olunaku" when tokenized. For the source of stop words we are its don't yet publish but it was list generated by the Experts here.

…

On Wed, Jul 27, 2022 at 10:24 AM Adriane Boyd ***@***.***> wrote: Do you have a source for the stop words? I'm still a bit confused about the tokenizer settings vs. stop words. Is ky' ever a separate token and not just a prefix? With the current tokenizer settings, none of the stop words with ' will end up as separate tokens, so the stop words with apostrophes might not make sense. For example: import spacy nlp = spacy.blank("lg") doc = nlp("Ekiwandiiko ky'olunaku")print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"] I will add some basic tokenizer tests in a minute with the example above. — Reply to this email directly, view it on GitHub <#10847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHLWOY3BQW3RFFTMCMZBLETVWDPZDANCNFSM5W4JFWKQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

adrianeboyd · 2022-08-03T10:41:29Z

I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words?

contractions = [
    "b'",
    "bw'",
    "by'",
    "eky'",
    "ey'",
    "ez'",
    "g'",
    "gw'",
    "gy'",
    "ky'",
    "lw'",
    "ly'",
    "n'",
    "ng'",
    "olw'",
    "ow'",
    "w'",
    "y'",
    "z'",
]

tobiusaolo · 2022-08-16T11:49:37Z

I sent in a new stopwords list in the latest PR which does not include the "ky',b'". Those words were transferred to the contractions. I suggest that the stopwords should stand since the contractions are distinct . Kind regards

…

On Wed, Aug 3, 2022 at 1:41 PM Adriane Boyd ***@***.***> wrote: I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words? contractions = [ "b'", "bw'", "by'", "eky'", "ey'", "ez'", "g'", "gw'", "gy'", "ky'", "lw'", "ly'", "n'", "ng'", "olw'", "ow'", "w'", "y'", "z'", ] — Reply to this email directly, view it on GitHub <#10847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHLWOYY346J6O74ENYHFJ7TVXJEGJANCNFSM5W4JFWKQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

adrianeboyd · 2022-08-23T08:20:32Z

Sorry for the delay, I thought I should wait on an update because in the current version the contractions are still added to the stop words. If the contractions are removed, then I think this is fine to merge. Let me go ahead and do that...

We're actually planning to remove the default stop word lists for v4, but I was hoping to leave all the stop words in v3 as a useful reference for users.

adrianeboyd · 2022-08-23T11:09:32Z

Thanks again for the PR! We'll mention Luganda in the release notes for the next release (probably v3.4.2).

tobiusaolo added 2 commits May 25, 2022 11:15

luganda language extension

8360ded

__init__.py changes

02af414

polm added the enhancement Feature requests and improvements label May 26, 2022

adrianeboyd reviewed Jun 3, 2022

View reviewed changes

spacy/lang/lg/lex_attrs.py Outdated Show resolved Hide resolved

spacy/lang/lg/lex_attrs.py Show resolved Hide resolved

spacy/lang/lg/__init__.py Outdated Show resolved Hide resolved

New enhancements

eb4d108

svlandeg added the lang / lg Luganda language data and models label Jun 10, 2022

Lexical attribute changed

8a2c3ad

svlandeg requested a review from adrianeboyd June 17, 2022 19:19

adrianeboyd reviewed Jun 20, 2022

View reviewed changes

spacy/lang/lg/lex_attrs.py Outdated Show resolved Hide resolved

adrianeboyd reviewed Jun 20, 2022

View reviewed changes

spacy/lang/lg/__init__.py Outdated Show resolved Hide resolved

adrianeboyd reviewed Jun 20, 2022

View reviewed changes

svlandeg reviewed Jul 4, 2022

View reviewed changes

spacy/lang/lg/stop_words.py Outdated Show resolved Hide resolved

punctuaction and sentence additions

d019df7

adrianeboyd added 2 commits July 14, 2022 09:41

Remove comment header

c177c76

Fix typos, reformat

b27f322

tobiusaolo added 2 commits July 15, 2022 10:54

reformated version

4286e6a

reformated version

f19b6b9

polm added the new language Adding support for new languages to spaCy. label Jul 24, 2022

Add tokenizer test

805f675

adrianeboyd added 2 commits August 23, 2022 10:17

Remove contractions from stop words

7beac6d

Format

4d0a6d0

Add Luganda to website

e576a6a

adrianeboyd merged commit c09d2fa into explosion:master Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

luganda language extension #10847

luganda language extension #10847

luganda language extension #10847

luganda language extension #10847

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment