-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the semantics of the ANTI join? #325
Comments
Judging by the query syntax, I'm guessing that the intention was for anti-join to be null-aware, i.e. I also agree that null semantics need to be specified better for relations and the built-in expression types. In many cases the documentation skirts over this completely. I lack the database background to trust myself with this, but documentation improvement PRs are very welcome for these sorts of things! |
@jvanstraten Thank you for the reply.
I'm seeing that there is a way to define both IN and EXISTS subqueries, hence, I assume these can be combined with NOT to support NOT IN and NOT EXISTS. The limitation of using these would be that they define a logical not physical plan. It seems to me we need to define anti join types for both NOT IN and NOT EXISTS semantics. What's the process for deciding what do add? Who should we bring into the discussion? |
@jvanstraten, The null semantics of |
In general, at least in protobuf, Substrait only really defines logical plans thus far. Some physical relations are defined on the website, but AFAIK they are incomplete, and they are probably outdated here and there.
To be honest, the process right now is that there isn't much of a process yet. It's pretty much just a matter of proposing something via PR and seeing what people have to say about it, if anything. For simple stuff that I'm personally comfortable reviewing this can go pretty quick, but this is not one of those things. The right person to ping is @jacques-n.
You could remove the "of NOT IN" in that sentence and I'd still wholeheartedly agree. I'd honestly love to toss it all in the bin and start over with a consistent way of treating null, but you basically lose compatibility with SQL at that point, and I don't think that's really an option. |
The NULL semantics of Anyway, the question was about the anti-join relational operator, which is a relational operator, and whose semantics are fairly easy to define. Give your users a well-defined anti-join, and someone else can implement the mess that is |
Please share what you would consider the obvious definition to be, because I honestly don't know how the typical data scientist even expects null to behave anymore. If it were entirely up to me, short of throwing the concept of null out the window entirely, I'd define null == null & null != any other value (so containment and set comparisons actually generalize rather than requiring special cases for every single thing involving nulls), null < any other value (to get rid of the nulls first/nulls last special cases for sorts, though I'd be more inclined to just say that nullable types do not have a defined ordering in the first place), have all functions except for a select few (among which the comparison operators and things that explicitly deal with nulls) return null if any inputs are null (rather than all of a sudden ignoring nulls in aggregates, for example; I mean, filter them out first if you don't want them), and call it a day. With that definition, sure, defining a sane anti-join would be easy, because you don't need any special cases for null anymore at all. But I'm pretty sure my definition would ruffle just about everyone's feathers in some context or another, so I try to stay out of these discussions and only point out things that don't make sense to me in the context of what's already defined in Substrait. ETA: Or, better yet, define generalized sum types and let users figure out for themselves how they want to treat unknown values, special values that are equal to all or none of the other values, empty aggregate markers, error markers, etc. |
It seems as if Substrait has null semantics for scalars, so A few relational operators need scalar expressions: project has a list of expressions of any type, filter has a boolean expression, join (including variants like anti-join, semi-join) has a boolean expression. For those boolean expressions, true is treated as "succeeds" and false and unknown (aka null) are treated as "fails". Equivalently, you can have the operators implicitly wrap their predicate in If you were to implement
evaluates the sub-query to return a list of integers (some of which may be The asymmetric conversion of 3-valued logic to 2-valued logic is why People writing |
Thanks, this is super helpful! I had already pieced a good part of this together from context, but it's great to see those things confirmed as a "by definition" thing rather than as a coincidence. The missing links for me were that set containment also returns a nullable boolean and treats "null" as "undefined" rather than as just a special value (the validator is then probably wrong about the return type of containment subqueries, actually), and that for predicates null is treated as false (I already assumed this, but have received mixed signals about it). I'd like to say that these things should have been obvious, but aggregates don't treat null that way, so... Just to phrase the problem as I now understand it: both semi and anti joins are ill-defined because they are really a filter-by-containment-subquery in disguise, but without control over what happens when the containment subquery returns null, because the expression that the subquery is contained in is implicit. I have to say that I don't really understand why they're special-cased as joins then in the first place, but, fine. As for a solution, I suppose we could just extend the join type enumeration with variants of anti and semi join that treat a null from the implicit containment expression as true instead of false? If I understand correctly, an anti join with that behavior is the same as the |
I think of SEMI and ANTI joins as optimizations, hence, part of physical query plan. In the logical query plan, set operations you mentioned earlier are sufficient.
What is "OP"? I'm not sure I understand the proposal. Are you suggesting to introduce null-aware anti join with NOT IN semantics and regular anti join with NOT EXISTS semantics or something else? Also, while there is a difference in null handling for anti joins (NOT IN vs. NOT EXISTS), I don't think there is anything similar to that for semi joins (IN vs. EXISTS). |
Folks, FYI I'm adding documentation about ANTI JOIN semantics to Velox: facebookincubator/velox#2527 |
In SQL,
They return the same result, but they do it by different means. Suppose one of the employees in department 40 has a NULL That Unlike |
@julianhyde This makes sense. Thank you for a nice explanation. Since SEMI JOIN is a physical operator used to implement some (not all) IN and EXISTS queries, then it feels to me there is no need to define 2 different SEMI JOIN operators. A single implementation that returns only left-side-rows with a match on the right side can be used for both cases (NULLs on either side never match). That said, probably not all queries with IN or EXISTS can be planned using SEMI JOIN. Specifically, the examples you provide may not be executable using SEMI JOIN. In this way, I think SEMI joins are different from ANTI joins, i.e. there is a single way of handling nulls that works for all use cases for SEMI joins, while there are two different ways for handing nulls in case of an ANTI join. |
That's fair enough I suppose. JoinRel is technically supposed to be a logical relation, but also having stuff that sits between a purely logical or purely physical plan works for me.
Original Post or Original Poster I think; either way I meant the first comment of the issue.
That's fair enough because I think I made a mistake somewhere... I've been working the problem from the other direction (i.e. what Substrait's behavior would currently be by following the spec to the letter) and have thoroughly confused myself about the Current Substrait anti join (NOT EXISTS, apparently)
If we follow the current spec to the letter...
... and treat the join expression as a normal predicate (where null is treated as false), then the following assertions are true and equivalent:
It looks like this actually boils down to Current Substrait semi join
... we get following assertion:
Therefore:
"NOT IN"?I figured that putting a trinary not() function in there somewhere would get you the other behavior, and that would have generalized to semi join just the same. Hence why I suggested both. But I'm not sure how to apply the set containment logic to the way Substrait defines these joins, i.e. with a single boolean (or nullable boolean) join expression and then saying something about whether that ever or never returns true. I thought you'd get the right thing if you would instead say "ever or never returns true or null", but I got nonsense when I tried working that out. |
@jvanstraten +1 |
There is a new attempt to address this in #585 |
Folks,
I'm looking at the join descriptions and in particular at ANTI join.
https://substrait.io/relations/logical_relations/#join-types
I believe there are 2 kinds of ANTI join. One is used for NOT IN queries, the other for NOT EXISTS. They differ in how they handle nulls.
vs.
Here are some examples to illustrate the differences.
Null aware anti join (NOT IN):
NULL on build side:
No nulls on build side:
EMPTY build side
Anti join (NOT EXISTS):
NULL on build side:
No nulls on build side
EMPTY build side
I'm wondering which of these semantics are assumed for Substrait ANTI join? Would it make sense to define two anti joins? Null-aware anti join and regular anti-join?
CC: @rui-mo @jacques-n
The text was updated successfully, but these errors were encountered: