[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add levenshtein from polyleven #11418

Merged

Conversation

adrianeboyd
Copy link
Contributor

Description

Add a simple levenshtein distance function using the implementation from the polyleven library as spacy.matcher.levenshtein.

Types of change

Enhancement.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Add a simple levenshtein distance function using the implementation from
the polyleven library as `spacy.matcher.levenshtein`.
@adrianeboyd
Copy link
Contributor Author
adrianeboyd commented Aug 31, 2022

This is initially a sketch of what this could look like for use with a FUZZY Matcher operator: #11359.

Using polyleven's python benchmark, this appears to be slightly faster:

$ python benchmark.py 
System: Python 3.8.10 on Linux (x86_64)
Words : 99171
Sample: 100
Total : 9917100 calls

#                              TIME[sec]       SPEED[calls/s]
polyleven.levenshtein          1.849           5362263
polyleven.levenshtein (k=3)    1.310           7569508
polyleven.levenshtein (k=2)    1.185           8367156
polyleven.levenshtein (k=1)    1.125           8811527
spacy.levenshtein              1.453           6824286
spacy.levenshtein (k=3)        1.173           8455878
spacy.levenshtein (k=2)        1.051           9438319
spacy.levenshtein (k=1)        1.019           9734970

@svlandeg svlandeg added enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher labels Aug 31, 2022
Copy link
Contributor
@polm polm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I was able to run it locally without issue. I also ran a benchmark and noticed it was slightly faster in spaCy, similar to polyleven with k=3. Not sure why that would happen.

.gitignore Show resolved Hide resolved
@adrianeboyd
Copy link
Contributor Author

My guess is that the cpdef interface is somehow slightly faster than the standard python one in polyleven, but I don't actually know for sure.

Copy link
Member
@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@svlandeg svlandeg merged commit 7c98245 into explosion:master Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / matcher Feature: Token, phrase and dependency matcher
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants