[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luganda language extension #10847

Merged
merged 13 commits into from
Aug 23, 2022
Merged

luganda language extension #10847

merged 13 commits into from
Aug 23, 2022

Conversation

tobiusaolo
Copy link
Contributor

This is an intiative to add luganda language from Uganda,East Africa to spacy

@polm polm added the enhancement Feature requests and improvements label May 26, 2022
@polm
Copy link
Contributor
polm commented May 26, 2022

You can click on the red X to see why a build failed, though in this case the most recent build succeeded so you're fine. We'll take a look at this, thanks for the submission!

@tobiusaolo
Copy link
Contributor Author

Thank you , Am looking forward for your feedback

Copy link
Contributor
@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, this is a great start! I added some comments about how to configure the LIKE_NUM modifications for LEX_ATTRS.

spacy/lang/lg/lex_attrs.py Outdated Show resolved Hide resolved
spacy/lang/lg/lex_attrs.py Show resolved Hide resolved
spacy/lang/lg/__init__.py Outdated Show resolved Hide resolved
@svlandeg svlandeg added the lang / lg Luganda language data and models label Jun 10, 2022
@@ -0,0 +1,20 @@
#stopwords as whitespace-seperated list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a source (link or citation?) for the stop words?

@adrianeboyd
Copy link
Contributor

Aside from the ordinal English number issue mentioned above, I think this PR is looking good for initial support for Luganda.

Were you planning on adding any custom tokenizer settings (in punctuation.py or tokenizer_exceptions.py) or do the current defaults work well enough for now?

I think it would be nice to have a few example sentences in examples.py. You can choose your own sentences or translate sentences from another language like the English examples:

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/examples.py

@tobiusaolo
Copy link
Contributor Author
tobiusaolo commented Jul 4, 2022 via email

@adrianeboyd
Copy link
Contributor

Thanks for the updates, this is looking good! In a second I'll try to make a few minor edits and reformat so this is ready to merge...

@adrianeboyd
Copy link
Contributor
adrianeboyd commented Jul 14, 2022

Actually, one more question: what is the intended tokenization of strings like 'ab’emmamba' and "ky'ebyenjigiriza"? When I try out the examples (thanks for adding a few!), I get the tokenization:

Abooluganda ab’emmamba ababiri ['Abooluganda', 'ab’emmamba', 'ababiri']
Ekisaawe ky'ebyenjigiriza kya mugaso nnyo ['Ekisaawe', "ky'ebyenjigiriza", 'kya', 'mugaso', 'nnyo']

From the stop words, it looks like you're expecting "ky'" to be a separate token?

If I know what the tokenization is intended to be, I can add a few tokenizer tests and help adjust the tokenizer settings.

@tobiusaolo
Copy link
Contributor Author
tobiusaolo commented Jul 15, 2022 via email

@polm polm added the new language Adding support for new languages to spaCy. label Jul 24, 2022
@adrianeboyd
Copy link
Contributor

Do you have a source for the stop words?

I'm still a bit confused about the tokenizer settings vs. stop words.

Is ky' ever a separate token and not just a prefix? With the current tokenizer settings, none of the stop words with ' will end up as separate tokens, so the stop words with apostrophes might not make sense.

For example:

import spacy

nlp = spacy.blank("lg")

doc = nlp("Ekiwandiiko ky'olunaku")
print([t.text for t in doc]) # ['Ekiwandiiko', "ky'olunaku"]

I will add some basic tokenizer tests in a minute with the example above.

@tobiusaolo
Copy link
Contributor Author
tobiusaolo commented Jul 29, 2022 via email

@adrianeboyd
Copy link
Contributor

I'm worried that users will be confused in the future because "ky'" is a stop word but never a separate token that could be marked as a stop word. Does it make sense to remove all these stop words?

contractions = [
    "b'",
    "bw'",
    "by'",
    "eky'",
    "ey'",
    "ez'",
    "g'",
    "gw'",
    "gy'",
    "ky'",
    "lw'",
    "ly'",
    "n'",
    "ng'",
    "olw'",
    "ow'",
    "w'",
    "y'",
    "z'",
]

@tobiusaolo
Copy link
Contributor Author
tobiusaolo commented Aug 16, 2022 via email

@adrianeboyd
Copy link
Contributor

Sorry for the delay, I thought I should wait on an update because in the current version the contractions are still added to the stop words. If the contractions are removed, then I think this is fine to merge. Let me go ahead and do that...

We're actually planning to remove the default stop word lists for v4, but I was hoping to leave all the stop words in v3 as a useful reference for users.

@adrianeboyd
Copy link
Contributor

Thanks again for the PR! We'll mention Luganda in the release notes for the next release (probably v3.4.2).

@adrianeboyd adrianeboyd merged commit c09d2fa into explosion:master Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements lang / lg Luganda language data and models new language Adding support for new languages to spaCy.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants