Does there have any example to remove duplicate docs using MinHash？ #188

zyh3826 · 2022-06-17T08:46:24Z

Does there have any example to remove duplicate docs using MinHash？

ekzhu · 2022-06-20T22:29:18Z

Not yet. But maybe you can create one and add it as a pull request :)

I would start from creating MinHash of normalized (tokenized, lowercased, truncated, etc.) documents. Once you have N MinHash for N documents, you have two choices:

Use brute force to compute the Jaccard similarity of all pairs of MinHash to find documents that have very high similarity (e.g., >0.95 Jaccard)
Use MinHashLSH index. Insert all the MinHash into the index, and then query each MinHash to find highly similar candidates (except itself), compute their Jaccard similarities, and (exactly, or use MinHash), and find ones with very high similarity.

The second option is faster, the first option is more accurate.

rupeshkumaar · 2023-10-27T19:41:21Z

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

ekzhu · 2023-12-01T06:15:41Z

Hi @ekzhu, I would like to work on this. I have already built something similar for my use-case where I have to deduplicate a huge corpus of almost 100M documents. I am using the first approach, I had tried the second one but I was using multiprocessing to achieve parallelism. In MinHashLSH I was not able to merge all the object created into different processes into one. So, I would like to know which approach should we move ahead with for this one.

Sounds good. I believe this also addresses #205. You can submit a PR and we can go from there.

rupeshkumaar · 2024-03-12T17:44:22Z

I am planning to work on this project in my free time. So, a few questions

Do we need to add it as a class method attached to the MinHash class, since it will be a util kind of method (good to have) so, could we keep it separate?
The other one was, which method do you think? should we go for since Merging (Identically Specified) MinHashLSH objects #205 is implemented so we can choose either of the methods? Since it is going to be a tradeoff between speed and accuracy.

If you have any other suggestions, please let me know. @ekzhu

vince62s mentioned this issue Jan 13, 2023

[Question] fast search #194

Closed

ekzhu added help wanted question labels Mar 21, 2023

ekzhu assigned ekzhu and rupeshkumaar Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does there have any example to remove duplicate docs using MinHash？ #188

Does there have any example to remove duplicate docs using MinHash？ #188

Does there have any example to remove duplicate docs using MinHash？ #188

Does there have any example to remove duplicate docs using MinHash？ #188

Comments