Sort of related: thinking in relational algebra, or SQL. It appears to be "natur...

snidane · on May 14, 2022

It's easier not to mess up table based filters using explicit semi-join operators (eg. in, not in, exists) instead of using regular joins because joins can introduce duplicates.

Give me 'any join' operation - ie. just select the first value instead of all, and I'll happily use joins more. They are actually more intuitive.

It's not that relational algebra is untintuitive. It's because standard SQL sucks.

magicalhippo · on May 14, 2022

Indeed, I've taught myself to only use JOIN when I actually need some data from the table I join. For everything else I use EXISTS and friends.

I was thinking SQL could do with a keyword for that, like maybe FILTER, that looks like a JOIN but works like EXISTS.

snidane · on May 14, 2022

Clickhouse implements an explicit SEMI join. It can be called semi or any, it doesn't really matter. It's just another join modifier

    [OUTER|SEMI|ANTI|ANY|ASOF]

https://clickhouse.com/docs/en/sql-reference/statements/sele...

nerdponx · on May 14, 2022

My problem with semijoins is that the semantics of "what exactly does a SELECT evaluate to inside an expression" are sometimes murky and might vary across databases.

magicalhippo · on May 14, 2022

Could you expand a little?

nerdponx · on May 16, 2022

If I write

    WHERE x IN (SELECT ...)

what the heck is the result of evaluating the inner query, in the outer expression?

Maybe I am missing something, but the exact meaning to vary a lot across different databases. Some seem to have a standalone "table" data type, while others don't.

magicalhippo · on May 16, 2022

I might be missing something as I'm self-taught, but the inner select specifies a set, and you "just" do a simple set membership test?

How it's implemented is as usual up to the database server implementation. Ones I've used creates a temporary table (like it does in so many other cases), and as such EXISTS is usually faster.

But I wouldn't rely on this when moving to another implementation, and use the query planner to see, just as I'd view the assembly output when moving to a new compiler.

Again, I don't have tons of experience, so concrete (counter) examples are welcome.

fiddlerwoaroof · on May 14, 2022

“First” doesn’t make sense without an order

jodrellblank · on May 14, 2022

It even has its own tag on StackOverflow: https://stackoverflow.com/questions/tagged/greatest-n-per-gr...

People who want it, want it with an order.

Look at

https://stackoverflow.com/questions/121387/fetch-the-row-whi...

and

https://stackoverflow.com/questions/3800551/select-first-row...

and

https://stackoverflow.com/questions/8748986/get-records-with...

and their combined thousands of votes and dozens of answers, all full of awkward workaround or ill-performing or specialised-for-one-database-engine code for this common and desirable thing which would be trivial with a couple of boring loops in Python.

snidane · on May 14, 2022

It does make sense for semi-joins. I care about the key, not the value.

Random order is also a valid order.

TFortunato · on May 14, 2022

Having taught it, do you have any recommendations for folks who are looking to improve their thinking in relations/sets/SQL skills?

geophile · on May 16, 2022

I teach, in order, more or less (there is some interleaving):

- Data modeling - Relational algebra - SQL - DBMS architecture (buffering, btrees, query processing, index selection, ...) - Query optimization - Transactions

A few assignments have students write an in-memory relational algebra implementation, and then use it to write queries. A typical query is a deeply nested set of function calls (as a "one-liner"). And only then to we get to SQL. The hope is that RA is so ingrained, that the connections to SQL are easier to see. And this background is also really useful in understanding query optimization.

All of this material, including assignments, is available online (this is my own webserver): http://geophile.com/115.

co_dh · on May 14, 2022

https://arxiv.org/abs/1803.05316

agumonkey · on May 14, 2022

I think this is related to "wholemeal" programming as some haskellers/FP-ists do, thinking in sets/tree/graphs operations.

webmaven · on May 14, 2022

> SQL reflects this. I often see students writing nested subqueries, because that is more procedural, where joins would be a cleaner choice.

In my experience, in the non-ad-hoc use-case, views can often be substituted for the procedural approach, forming the equivalent of a POSIX pipe.

*> A colleague of mine wrote a paper many years ago, pointing out that thinking procedurally is more "natural" for many: https://dl.acm.org/doi/10.1145/319628.319656. But thinking set-at-a-time instead of one-at-a-time is a valuable skill, not that far off from thinking functionally.

Hmm. Given the proliferation of tabular data tools (especially spreadsheets) over the intervening 40 years, I wonder if those results would remain the same today (and whether there would be any difference among Excel power users that use pivot tables, etc.)