[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we add an is_in function? #517

Open
westonpace opened this issue Jul 7, 2023 · 5 comments
Open

Should we add an is_in function? #517

westonpace opened this issue Jul 7, 2023 · 5 comments

Comments

@westonpace
Copy link
Member

Currently, in Acero, we've been mapping is_in to SingularOrList. The latter is more generic, and so this is safe, but it doesn't round trip well and is_in is more efficient.

In other words, given something like:

is_in(f0, [7, 3, 4, 6]) we round trip to f0 == 7 || f0 == 3 || f0 == 4 || f0 == 6.

It is possible to recognize that an or-list collapses to is_in but then everyone needs to repeat this optimization and there can be some tricky nuance in case someone provides something like f0 == 7 || 3 == f0 (f0 on both left and right is ok but can confuse a simplistic optimizer routine).

It seems like having is_in is the same thing as having both JoinRel and HashJoinRel. It's just that we're not usually used to seeing the logical/physical relationship play out across expressions in addition to relations.

@EpsilonPrime
Copy link
Member

If we aren't going to have standard optimizations and/or a library for making them any time soon, the argument for having the alternative is more compelling.

One advantage of the optimization approach is that it yields faster results without sacrificing language simplicity. If someone was to present something that was an or list the optimization approach would pick that up and use is_in. A SingularOrList is much easier for a naive optimizer to pick up than a series of nested or statements which I would also like to see appropriately converted.

But ultimately it comes down to timing for me. If it's a year or more that we expect to see a generic optimization library/tool for Substrait it's probably not worth waiting for (and we already have multiples of similar concepts today).

@jacques-n
Copy link
Contributor
jacques-n commented Jul 8, 2023

I'm missing something. How is is_in less generic than the or list and why do you not map it back to is in on the reverse? They seem logically equivalent to me.

@westonpace
Copy link
Member Author
westonpace commented Jul 8, 2023

An or list could be something like (f0 == f1) || (f0 == f2)

is_in requires that all of the choices be literals and is often implemented with something like a hash set.

@westonpace
Copy link
Member Author

Though, now that I think about it, substrait has no way of restricting function arguments to be literals. So is_in couldn't be well described as a function. You could state that the arguments are constant, which I suppose invalidates an earlier claim I made that "constant" doesn't make sense for scalar functions. I seem to have pretzeled myself.

@jacques-n
Copy link
Contributor

Ah. An earlier version of or list restricted all the option expressions to literals... I guess we generalized it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants