[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the semantics of the ANTI join? #325

Open
mbasmanova opened this issue Sep 9, 2022 · 15 comments
Open

What are the semantics of the ANTI join? #325

mbasmanova opened this issue Sep 9, 2022 · 15 comments

Comments

@mbasmanova
Copy link

Folks,

I'm looking at the join descriptions and in particular at ANTI join.

https://substrait.io/relations/logical_relations/#join-types

Return records from the left input. These are returned only if the records do not have a join partner on the right side.

I believe there are 2 kinds of ANTI join. One is used for NOT IN queries, the other for NOT EXISTS. They differ in how they handle nulls.

  • Null aware anti join (NOT IN) returns no rows if build side has null in the join key.
  • Null aware anti join (NOT IN) returns rows with null in the probe side join key only if build side empty.

vs.

  • Anti join (NOT EXISTS) ignores nulls on the build side.
  • Anti join (NOT EXISTS) always returns rows with null in the probe side join key.

Here are some examples to illustrate the differences.

Null aware anti join (NOT IN):

NULL on build side:

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[null, 2, 3]) as _t(id))
select * from a where a.id not in (select id from b)

 id
----
(0 rows)

No nulls on build side:

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[2, 3]) as _t(id))
select * from a where a.id not in (select id from b)

 id
----
  1
(1 row)

EMPTY build side

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[]) as _t(id))
select * from a where a.id not in (select id from b)

  id
------
    1
    2
 NULL
(3 rows)

Anti join (NOT EXISTS):

NULL on build side:

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[null, 2, 3]) as _t(id))
select * from a where not exists (select * from b where b.id = a.id)

  id
------
 NULL
    1
(2 rows)

No nulls on build side

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[2, 3]) as _t(id))
select * from a where not exists (select * from b where b.id = a.id)

  id
------
    1
 NULL
(2 rows)

EMPTY build side

with a as (select * from unnest(array[null, 1, 2]) as _t(id)),
b as (select * from unnest(array[]) as _t(id))
select * from a where not exists (select * from b where b.id = a.id)

  id
------
    2
    1
 NULL
(3 rows)

I'm wondering which of these semantics are assumed for Substrait ANTI join? Would it make sense to define two anti joins? Null-aware anti join and regular anti-join?

CC: @rui-mo @jacques-n

@mbasmanova mbasmanova changed the title What are the semantics of the ANTI join What are the semantics of the ANTI join? Sep 9, 2022
@jvanstraten
Copy link
Contributor

Judging by the query syntax, I'm guessing that the intention was for anti-join to be null-aware, i.e. NOT IN. I say this only because the NOT EXISTS version is structured more like a subquery, and this subquery type is already supported (when combined with a not function call, anyway). However, if this is the only corner case (or one of a few) that JoinRel isn't covering right now, I would also not be opposed to just adding more join types. We just need to be careful that the list doesn't explode.

I also agree that null semantics need to be specified better for relations and the built-in expression types. In many cases the documentation skirts over this completely. I lack the database background to trust myself with this, but documentation improvement PRs are very welcome for these sorts of things!

@mbasmanova
Copy link
Author

@jvanstraten Thank you for the reply.

I say this only because the NOT EXISTS version is structured more like a subquery, and this subquery type is already supported (when combined with a not function call, anyway).

I'm seeing that there is a way to define both IN and EXISTS subqueries, hence, I assume these can be combined with NOT to support NOT IN and NOT EXISTS. The limitation of using these would be that they define a logical not physical plan. It seems to me we need to define anti join types for both NOT IN and NOT EXISTS semantics.

What's the process for deciding what do add? Who should we bring into the discussion?

@julianhyde
Copy link

@jvanstraten, The null semantics of NOT IN in SQL are messed up so you should stay clear of that.

@jvanstraten
Copy link
Contributor

The limitation of using these would be that they define a logical not physical plan.

In general, at least in protobuf, Substrait only really defines logical plans thus far. Some physical relations are defined on the website, but AFAIK they are incomplete, and they are probably outdated here and there.

What's the process for deciding what do add? Who should we bring into the discussion?

To be honest, the process right now is that there isn't much of a process yet. It's pretty much just a matter of proposing something via PR and seeing what people have to say about it, if anything. For simple stuff that I'm personally comfortable reviewing this can go pretty quick, but this is not one of those things. The right person to ping is @jacques-n.

The null semantics of NOT IN in SQL are messed up so you should stay clear of that.

You could remove the "of NOT IN" in that sentence and I'd still wholeheartedly agree. I'd honestly love to toss it all in the bin and start over with a consistent way of treating null, but you basically lose compatibility with SQL at that point, and I don't think that's really an option.

@julianhyde
Copy link

The NULL semantics of NOT IN are worse than the rest, believe me.

Anyway, the question was about the anti-join relational operator, which is a relational operator, and whose semantics are fairly easy to define. Give your users a well-defined anti-join, and someone else can implement the mess that is NOT IN on top of that.

@jvanstraten
Copy link
Contributor
jvanstraten commented Sep 12, 2022

whose semantics are fairly easy to define

Please share what you would consider the obvious definition to be, because I honestly don't know how the typical data scientist even expects null to behave anymore. If it were entirely up to me, short of throwing the concept of null out the window entirely, I'd define null == null & null != any other value (so containment and set comparisons actually generalize rather than requiring special cases for every single thing involving nulls), null < any other value (to get rid of the nulls first/nulls last special cases for sorts, though I'd be more inclined to just say that nullable types do not have a defined ordering in the first place), have all functions except for a select few (among which the comparison operators and things that explicitly deal with nulls) return null if any inputs are null (rather than all of a sudden ignoring nulls in aggregates, for example; I mean, filter them out first if you don't want them), and call it a day. With that definition, sure, defining a sane anti-join would be easy, because you don't need any special cases for null anymore at all. But I'm pretty sure my definition would ruffle just about everyone's feathers in some context or another, so I try to stay out of these discussions and only point out things that don't make sense to me in the context of what's already defined in Substrait.

ETA: Or, better yet, define generalized sum types and let users figure out for themselves how they want to treat unknown values, special values that are equal to all or none of the other values, empty aggregate markers, error markers, etc.

@julianhyde
Copy link

It seems as if Substrait has null semantics for scalars, so x + 1 returns null, and x > 1 returns null (in SQL the boolean null is generally known as unknown, but they really are the same thing). So I'll proceed on that assumption.

A few relational operators need scalar expressions: project has a list of expressions of any type, filter has a boolean expression, join (including variants like anti-join, semi-join) has a boolean expression. For those boolean expressions, true is treated as "succeeds" and false and unknown (aka null) are treated as "fails".

Equivalently, you can have the operators implicitly wrap their predicate in is true so that they only have to deal with true and false. Either way, I think that's a sane semantics for nullable scalar expressions in relational operators.

If you were to implement not in, you can approach the semantics the same way: the query

select * from emp
where job = 'Clerk' 
or sal not in (select min(sal)
    from emp
    where location = 'San Francisco'
    group by deptno)

evaluates the sub-query to return a list of integers (some of which may be null), then for a given row evaluates sal in subQuery (to true, false or null), then converts that using 3-valued not, then combines that with the result of job = 'Clerk' using the 3-valued or, and if that evaluates to true then the row is returned.

The asymmetric conversion of 3-valued logic to 2-valued logic is why not in is so confusing. 5 in (null, 10, 20) evaluates to null in SQL, but if you're using it in where clause the effect is as if it evaluated to false, which is what you want. 5 not in (null, 10, 20) also evaluates to null (because that null represents an unknown value that might be 5).

People writing x not in (subQuery) in SQL almost always want (x in subQuery) is false, and they are very surprised when they don't get it. Furthermore, query optimizers that are trying to give them what they asked for have to do things like counting the rows in subQuery and counting the number of nulls in subQuery, which results in a much less efficient plan than what they wanted.

@jvanstraten
Copy link
Contributor

Thanks, this is super helpful! I had already pieced a good part of this together from context, but it's great to see those things confirmed as a "by definition" thing rather than as a coincidence. The missing links for me were that set containment also returns a nullable boolean and treats "null" as "undefined" rather than as just a special value (the validator is then probably wrong about the return type of containment subqueries, actually), and that for predicates null is treated as false (I already assumed this, but have received mixed signals about it). I'd like to say that these things should have been obvious, but aggregates don't treat null that way, so...

Just to phrase the problem as I now understand it: both semi and anti joins are ill-defined because they are really a filter-by-containment-subquery in disguise, but without control over what happens when the containment subquery returns null, because the expression that the subquery is contained in is implicit. I have to say that I don't really understand why they're special-cased as joins then in the first place, but, fine.

As for a solution, I suppose we could just extend the join type enumeration with variants of anti and semi join that treat a null from the implicit containment expression as true instead of false? If I understand correctly, an anti join with that behavior is the same as the NOT EXISTS variant described by the OP?

@mbasmanova
Copy link
Author

I have to say that I don't really understand why they're special-cased as joins then in the first place, but, fine.

I think of SEMI and ANTI joins as optimizations, hence, part of physical query plan. In the logical query plan, set operations you mentioned earlier are sufficient.

we could just extend the join type enumeration with variants of anti and semi join that treat a null from the implicit containment expression as true instead of false? If I understand correctly, an anti join with that behavior is the same as the NOT EXISTS variant described by the OP?

What is "OP"? I'm not sure I understand the proposal. Are you suggesting to introduce null-aware anti join with NOT IN semantics and regular anti join with NOT EXISTS semantics or something else?

Also, while there is a difference in null handling for anti joins (NOT IN vs. NOT EXISTS), I don't think there is anything similar to that for semi joins (IN vs. EXISTS).

@mbasmanova
Copy link
Author

Folks, FYI I'm adding documentation about ANTI JOIN semantics to Velox: facebookincubator/velox#2527

@julianhyde
Copy link

I don't think there is anything similar to that for semi joins (IN vs. EXISTS).

In SQL, IN and EXISTS do have different null behaviors, but you have to look really closely to see it. Consider these queries ('find all employees who have the same salary as an employee in department 40'):

SELECT *
FROM Emp AS e
WHERE EXISTS (
  SELECT *
  FROM Emp AS e2
  WHERE e2.deptno = 40
  AND e2.sal = e.sal)

SELECT *
FROM Emp AS e
WHERE e.sal IN (SELECT e2.sal
    FROM Emp AS e2
    WHERE e2.deptno = 40)

They return the same result, but they do it by different means. Suppose one of the employees in department 40 has a NULL sal value. For any given e record, the EXISTS returns either TRUE or FALSE, but IN returns either TRUE or UNKNOWN.

That UNKNOWN value goes straight into the WHERE and is treated the same as FALSE, so most SQL users don't notice the difference. But if that UNKNOWN had gone into an OR or NOT or an IS FALSE then the difference would have become clear.

Unlike IN, EXISTS doesn't have a way to say 'I'm not sure whether there is an employee with this salary in department 40' because it has to either return a row or not. It collapses the uncertainty earlier, in its own WHERE clause.

@mbasmanova
Copy link
Author

@julianhyde This makes sense. Thank you for a nice explanation.

Since SEMI JOIN is a physical operator used to implement some (not all) IN and EXISTS queries, then it feels to me there is no need to define 2 different SEMI JOIN operators. A single implementation that returns only left-side-rows with a match on the right side can be used for both cases (NULLs on either side never match). That said, probably not all queries with IN or EXISTS can be planned using SEMI JOIN. Specifically, the examples you provide may not be executable using SEMI JOIN.

In this way, I think SEMI joins are different from ANTI joins, i.e. there is a single way of handling nulls that works for all use cases for SEMI joins, while there are two different ways for handing nulls in case of an ANTI join.

@jvanstraten
Copy link
Contributor

I think of SEMI and ANTI joins as optimizations, hence, part of physical query plan. In the logical query plan, set operations you mentioned earlier are sufficient.

That's fair enough I suppose. JoinRel is technically supposed to be a logical relation, but also having stuff that sits between a purely logical or purely physical plan works for me.

What is "OP"?

Original Post or Original Poster I think; either way I meant the first comment of the issue.

I'm not sure I understand the proposal. Are you suggesting to introduce null-aware anti join with NOT IN semantics and regular anti join with NOT EXISTS semantics or something else?

That's fair enough because I think I made a mistake somewhere... I've been working the problem from the other direction (i.e. what Substrait's behavior would currently be by following the spec to the letter) and have thoroughly confused myself about the NOT IN thing. I've been at it for a while so I'll just post what I have now.

Current Substrait anti join (NOT EXISTS, apparently)

JoinRel {
  left: lhs,
  right: rhs,
  join expression: a.id == b.id
  type: anti
}

If we follow the current spec to the letter...

Return records from the left input. These are returned only if the records do not have a join partner on the right side.

... and treat the join expression as a normal predicate (where null is treated as false), then the following assertions are true and equivalent:

  • a record from lhs is returned if and only if none of the join expressions for that record and the records from rhs return true;
  • a record from lhs is returned if and only if all of the join expressions for that record and the records from rhs return false or null.

a.id == b.id yields null if either side is null. Therefore:

  • applied to lhs = [null, 1, 2] and rhs = [null, 2, 3], you would get [null, 1]. The null is included because the left-hand side of the equality expression is null, therefore all join expressions for the row will be null, which means that none of the join expressions return true. The 1 is included because you get [null, false, false] from the equality expression, none of which are true.
  • applied to lhs = [null, 1, 2] and rhs = [], you would get [null, 1, 2], because applying "all" to an empty set always returns true.

It looks like this actually boils down to NOT EXISTS semantics when you work it all out (unless I made a reasoning error somewhere), rather than the NOT IN thing I figured it'd be modelling. I guess that's a good thing...?

Current Substrait semi join

JoinRel {
  left: lhs,
  right: rhs,
  join expression: a.id == b.id
  type: semi
}

Returns records from the left input. These are returned only if the records have a join partner on the right side.

... we get following assertion:

  • a record from lhs is returned if and only if any of the join expressions for that record and the records from rhs return true;

Therefore:

  • applied to lhs = [null, 1, 2] and rhs = [null, 2, 3], you would get [2].
  • applied to lhs = [null, 1, 2] and rhs = [], you would get [], because applying "any" to an empty set always returns false.

"NOT IN"?

I figured that putting a trinary not() function in there somewhere would get you the other behavior, and that would have generalized to semi join just the same. Hence why I suggested both. But I'm not sure how to apply the set containment logic to the way Substrait defines these joins, i.e. with a single boolean (or nullable boolean) join expression and then saying something about whether that ever or never returns true. I thought you'd get the right thing if you would instead say "ever or never returns true or null", but I got nonsense when I tried working that out.

@julianhyde
Copy link

@jvanstraten +1

@westonpace
Copy link
Member

There is a new attempt to address this in #585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants