[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging (Identically Specified) MinHashLSH objects #205

Closed
hsicsa opened this issue Mar 21, 2023 · 11 comments
Closed

Merging (Identically Specified) MinHashLSH objects #205

hsicsa opened this issue Mar 21, 2023 · 11 comments

Comments

@hsicsa
Copy link
hsicsa commented Mar 21, 2023

Background:

When using a dataflow pipeline, the input stream (e.g. of documents) can be split to increase parallelism & throughput across multiple machines.
In order to do so, when computing the MinHashLSH of the stream, the MinHashLSH objects in two or more sub-streams must be merged.

Generally speaking, LSH implementations do not support merging two or more LSH states into a single LSH state.

However, MinHashLSH (as implemented in datasketch) does appear to be capable of doing so when all the parameters of the objects to be merged are identical. Beyond the threshold, # permutations of the minHash values, ad weights, when the number of hash tables and the associated set of hashvalue ranges are identical, and it seems reasonable to support the merging of the underlying hash tables.

Proposal:

Create a mergeMinHashLSH function that does the following:

  • Takes a list of MinHashLSH objects
  • Asserts that the initialization parameters are identical for all objects
  • Performs a binary merge (i.e. O(log N)) of the objects underlying hash tables, while updating the keys and hashvalues.
  • returns a newly instantiated MinHashLSH object

Additionally, a mergeMinHashLSH method could be added to the MinHashLSH class to perform merging in place.

@ekzhu
Copy link
Owner
ekzhu commented Mar 24, 2023

Thank! I think this is a good idea. It will be a useful classmethod for MinHashLSH class.

@ekzhu
Copy link
Owner
ekzhu commented Mar 24, 2023

Are you interested in submitting a PR for this?

@hsicsa
Copy link
Author
hsicsa commented Mar 24, 2023

I'm interested but won't be able to do so until mid-April at the earliest.

@ekzhu
Copy link
Owner
ekzhu commented Mar 24, 2023 via email

@rupeshkumaar
Copy link
Contributor

Hi @ekzhu. Let me understand this when @hsicsa suggesting to merge identical MinHashLSH objects, it is combining all the keys from similar objects into one object and if the same key appears in multiple objects then in that case should it be merged?

@ekzhu
Copy link
Owner
ekzhu commented Dec 1, 2023

Sorry I missed this... I am not quite following can you elaborate?

I believe the goal is to simply merge the hash tables of a list of MinHashLSH objects. Of course, I would also have a procedure to make sure the sets of keys of the MinHashLSH objects are disjoint, aka., the keys were partitioned when creating the list of MinHashLSH objects in the first place.

@rupeshkumaar
Copy link
Contributor
rupeshkumaar commented Jan 14, 2024

I have taken a look at it, I wanted to understand what initialization parameters are we going to compare. I have implemented the eq and compared the type, threshold, and num_perms do I need to compare any other init parameters too? @ekzhu

@ekzhu
Copy link
Owner
ekzhu commented Jan 14, 2024

You probably want to have an option to check if the keys stored in the hashtables are disjoint before merging them. It will be an expensive operation, so give the caller an option to choose.

@rupeshkumaar
Copy link
Contributor
rupeshkumaar commented Jan 14, 2024

Okay so let me understand. There are actually two keys. One that is given by the user while inserting MinHash into the MinHashLSH object along with the MinHash object and the other key is Hashed one which is getting created by byteswap. Please can you tell me which one you are referring to? And the key can be found in the hashtables in the DictSetStorage. Now, I have already added a check where it checks for the key present in second MinHashLSH object should not be in the first MinHashLSH. if the key exist it will throw an appropriate exception but do I need to merge the keys if user wants it to be? Is that what you are referring to? I can submit a PR in sometime for you to look into.

@ekzhu
Copy link
Owner
ekzhu commented Jan 15, 2024

Okay so let me understand. There are actually two keys. One that is given by the user while inserting MinHash into the MinHashLSH object along with the MinHash object and the other key is Hashed one which is getting created by byteswap. Please can you tell me which one you are referring to?

The user provided key for each MinHash object.

if the key exist it will throw an appropriate exception but do I need to merge the keys if user wants it to be?

Going back a bit. The scenario that a merge could go wrong is when the two MinHashLSH are created with different key --> minhash mappings. But in the subcase when the keys --> minhash mappings are different but the keys are disjoint, the user can still rely on the retrieved keys to decide on whether to filter it out -- it won't affect the recall. But if the keys are not disjoint and the key-->minhash mappings are different, then the user would have no way to tell whether the retrieved keys is what they want.

So, in a typical case we can just go ahead and merge the hash tables if the user does not request for a disjoint check. But in some cases when the user is given a MinHashLSH object they didn't produce, they might want to run a disjoint check to make sure they are not incorrectly merging the keys -- we should raise an exception.

So the disjoint check should only happen when the user is specifically asking for it, and by default it should be turned off.

rupeshkumaar added a commit to rupeshkumaar/datasketch that referenced this issue Jan 20, 2024
rupeshkumaar added a commit to rupeshkumaar/datasketch that referenced this issue Jan 21, 2024
rupeshkumaar added a commit to rupeshkumaar/datasketch that referenced this issue Jan 22, 2024
rupeshkumaar added a commit to rupeshkumaar/datasketch that referenced this issue Mar 12, 2024
@ekzhu ekzhu closed this as completed in a532f06 Mar 12, 2024
@hsicsa
Copy link
Author
hsicsa commented Mar 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants