-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we add an is_in
function?
#517
Comments
If we aren't going to have standard optimizations and/or a library for making them any time soon, the argument for having the alternative is more compelling. One advantage of the optimization approach is that it yields faster results without sacrificing language simplicity. If someone was to present something that was an or list the optimization approach would pick that up and use is_in. A SingularOrList is much easier for a naive optimizer to pick up than a series of nested or statements which I would also like to see appropriately converted. But ultimately it comes down to timing for me. If it's a year or more that we expect to see a generic optimization library/tool for Substrait it's probably not worth waiting for (and we already have multiples of similar concepts today). |
I'm missing something. How is is_in less generic than the or list and why do you not map it back to is in on the reverse? They seem logically equivalent to me. |
An or list could be something like
|
Though, now that I think about it, substrait has no way of restricting function arguments to be literals. So |
Ah. An earlier version of or list restricted all the option expressions to literals... I guess we generalized it. |
Currently, in Acero, we've been mapping
is_in
toSingularOrList
. The latter is more generic, and so this is safe, but it doesn't round trip well andis_in
is more efficient.In other words, given something like:
is_in(f0, [7, 3, 4, 6])
we round trip tof0 == 7 || f0 == 3 || f0 == 4 || f0 == 6
.It is possible to recognize that an or-list collapses to is_in but then everyone needs to repeat this optimization and there can be some tricky nuance in case someone provides something like
f0 == 7 || 3 == f0
(f0
on both left and right is ok but can confuse a simplistic optimizer routine).It seems like having
is_in
is the same thing as having bothJoinRel
andHashJoinRel
. It's just that we're not usually used to seeing the logical/physical relationship play out across expressions in addition to relations.The text was updated successfully, but these errors were encountered: