Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm going to assume here that we're talking about a subquery here (SELECT * FROM t1 WHERE x NOT IN ( SELECT x FROM t2 )). If you're just talking about a static list, then the basic problem is the amount of data you get back. :-)

The biggest problem with NOT IN is that it has very surprising NULL behavior: Due to the way it's defined, if there is any NULL in the joined-on columns, then _all_ rows must pass. If the column is non-nullable, then sure, you can convert it into an antijoin and optimize it together with the rest of the join tree. If not, it usually ends up being something more complicated.

For this reason, NOT EXISTS should usually be preferred. The syntax sucks, but it's much easier to rewrite to antijoin.



It's not just the nullability behavior. My experience with several databases is that IN is always (or almost always) executing the subquery then using its results to match the outer predicate. But EXISTS can work the other direction, matching the predicates from the outer query then passing those keys into the exists, allowing use of an FK index on the inner query's table.


What databases are those? If you convert to a semijoin, both strategies should be doable.

Note that these caveats do _not_ apply to IN, only NOT IN.


That's useful, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: