-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
luganda language extension #10847
luganda language extension #10847
Conversation
You can click on the red X to see why a build failed, though in this case the most recent build succeeded so you're fine. We'll take a look at this, thanks for the submission! |
Thank you , Am looking forward for your feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, this is a great start! I added some comments about how to configure the LIKE_NUM
modifications for LEX_ATTRS
.
spacy/lang/lg/stop_words.py
Outdated
@@ -0,0 +1,20 @@ | |||
#stopwords as whitespace-seperated list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a source (link or citation?) for the stop words?
Aside from the ordinal English number issue mentioned above, I think this PR is looking good for initial support for Luganda. Were you planning on adding any custom tokenizer settings (in I think it would be nice to have a few example sentences in https://github.com/explosion/spaCy/blob/master/spacy/lang/en/examples.py |
Alright let me do that
…On Mon, Jul 4, 2022 at 5:09 PM Sofie Van Landeghem ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spacy/lang/lg/stop_words.py
<#10847 (comment)>:
> +STOP_WORDS=set(
+ "wa lwa si ebyo nti anti nanti okutuusa tu wandi wa kiki kki dda"
+ "a singa oluvannyuma neera yenna nze ne kyonna ba nga ku beera kubanga"
+ "byombi naye osobola buli okuva kuva teyalina talina bayina byonna yonna byaffe be"
+ "bombi tebaalina tayina bonna zonna tayina tebaalina teyayina tetulina alina wano bimu abadde waliwo"
+ "bangi wakati ejja omuli ebyo nabo balina kuwa kyaffe olwekyo"
+ "buva bwaffe yonna ddala liryo yaffe terina kennyini ye bwonna bokka abalala bulungi kirungi ebweru"
+ "obulungi leero bya kikye yina atya munda ziba byabwe tewali erimu engeri ffenna lyange okudda kudda ebiri twafuna nnyingi lyabwe"
+ "zaabwe mu endala lyaffe kye nnyini tebayina yennyini ga bibye ayinza ali kikino nandi"
+ "ye nyinza ateekeddwa tetuteekeddwa neetaaga seetaaga nedda edda kati ku gumu gujja oba ekirala wabweru waggulu"
+ "nnina byebimu n'olwekyo ekyo bo abava bingi abangi ojja bangi waliyo bino bwabwe bandi bajja ajja wansi bulijjo kaseera ba"
+ "balina kino ebyo ku nnyo ennyo okutuusa bwayo yabadde ffe tu-yina kyekimu"
+ "oyo babadde baali tebaali ki kiki ddi wa ani lwaki ne gwe wandi oli oyina kikyo e mu wange ku bwe wa bajja"
+ "newankubade sinakindi n'olwekyo okuggyako gunno guno bateekeddwa oba gwe mwe"
+ "gyabwe erina tolina ebimu mingi zijja ffe nanti anti naye ate"
+ "wamu awamu baweebwa aweebwa weebwa era wadde mpozzi ekyo oyo kati kyekyo oluvannyuma kwegamba nandiyagadde wadde kubanga"
+ "olwokuba wabula nnyo nnyini nnyinza tuyina tulina tayina balina bali okuwa twetaaga okugenda bayina alina mulina"
+ "oyina olina abamu bano ye otya ki ono gwa nabadde mbadde".split()
+)
I would format this as
STOP_WORDS = set(
"""
wa lwa si ...
...
... nabadde mbadde
""".split()
)
and sort this alphabetically
—
Reply to this email directly, view it on GitHub
<#10847 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHLWOY4ORJCCUNLELLBSZZTVSLWCZANCNFSM5W4JFWKQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks for the updates, this is looking good! In a second I'll try to make a few minor edits and reformat so this is ready to merge... |
Actually, one more question: what is the intended tokenization of strings like
From the stop words, it looks like you're expecting If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings. |
Thank you for your feedback, first about tokenization i have consulted the
language expert and if we follow the sentence below:
sentence:Abooluganda ab’emmamba ababiri
We can tokenize the above sentence as
['Abooluganda', 'ab’emmamba', 'ababiri']
About the 'ky' i will remove it and update the repo otherwise thank
you for the guidance
Regards
…On Thu, Jul 14, 2022 at 10:46 AM Adriane Boyd ***@***.***> wrote:
Actually, one more question: what is the intended tokenization of strings
like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples
(thanks for adding a few!), I get the tokenization:
Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri']
Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo']
From the stop words, it looks like you're expecting "ky' to be a separate
token?
If I know what the tokenization is intended to be, I can add a few
tokenizer tests and help adjust the tokenizer settings.
—
Reply to this email directly, view it on GitHub
<#10847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHLWOYYSZOKSGLWUZOL4NRTVT7AVDANCNFSM5W4JFWKQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Do you have a source for the stop words? I'm still a bit confused about the tokenizer settings vs. stop words. Is For example: import spacy
nlp = spacy.blank("lg")
doc = nlp("Ekiwandiiko ky'olunaku")
print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"] I will add some basic tokenizer tests in a minute with the example above. |
You're right, the ky' is not supposed it a prefix. According to the
discussion with the Luganda Experts, they indicated that the word should
remain "ky'olunaku" when tokenized.
For the source of stop words we are its don't yet publish but it was list
generated by the Experts here.
…On Wed, Jul 27, 2022 at 10:24 AM Adriane Boyd ***@***.***> wrote:
Do you have a source for the stop words?
I'm still a bit confused about the tokenizer settings vs. stop words.
Is ky' ever a separate token and not just a prefix? With the current
tokenizer settings, none of the stop words with ' will end up as separate
tokens, so the stop words with apostrophes might not make sense.
For example:
import spacy
nlp = spacy.blank("lg")
doc = nlp("Ekiwandiiko ky'olunaku")print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"]
I will add some basic tokenizer tests in a minute with the example above.
—
Reply to this email directly, view it on GitHub
<#10847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHLWOY3BQW3RFFTMCMZBLETVWDPZDANCNFSM5W4JFWKQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words?
|
I sent in a new stopwords list in the latest PR which does not include the
"ky',b'". Those words were transferred to the contractions.
I suggest that the stopwords should stand since the contractions are
distinct .
Kind regards
…On Wed, Aug 3, 2022 at 1:41 PM Adriane Boyd ***@***.***> wrote:
I'm worried that users will be confused in the future because "ky'" is a
stop word but never a separate token that could be marked as a stop word.
Does it make sense to remove all these stop words?
contractions = [
"b'",
"bw'",
"by'",
"eky'",
"ey'",
"ez'",
"g'",
"gw'",
"gy'",
"ky'",
"lw'",
"ly'",
"n'",
"ng'",
"olw'",
"ow'",
"w'",
"y'",
"z'",
]
—
Reply to this email directly, view it on GitHub
<#10847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHLWOYY346J6O74ENYHFJ7TVXJEGJANCNFSM5W4JFWKQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Sorry for the delay, I thought I should wait on an update because in the current version the contractions are still added to the stop words. If the contractions are removed, then I think this is fine to merge. Let me go ahead and do that... We're actually planning to remove the default stop word lists for v4, but I was hoping to leave all the stop words in v3 as a useful reference for users. |
Thanks again for the PR! We'll mention Luganda in the release notes for the next release (probably v3.4.2). |
This is an intiative to add luganda language from Uganda,East Africa to spacy