Does this work with the bag/multiset distinction that the author uses? Like, if ...

LeonB · on July 28, 2023

I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.

SoftTalker · on July 28, 2023

Exactly. If you have two absolutely duplicate rows in a table, you're going to have problems with a lot of your queries. It's usually an indication that you are lacking a unique constraint or have a bug somewhere.

adammarples · on July 29, 2023

It's exactly something I need to support, because the only reason I am ever diffing two tables is to check that my new and improved sql query worked as expected against a known correct table. I don't want my new code to accidentally duplicate on some silly join and then not pick up on it in automated testing.

deely3 · on July 28, 2023

> duplicate rows are not something you need to support

I can imagine that you want to have duplicates rows in a logging. If some events happens twice - you definitely want to log it twice.

LeonB · on July 28, 2023

I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.

Trying to delete duplicates (but leave 1 behind) is tricky in itself. I recorded notes on it one time here — using “row_number()” to act as the iniquitie, https://til.secretgeek.net/sql_server/delete_duplicate_rows....

LeonB · on July 29, 2023

(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)

SoftTalker · on July 28, 2023

Logs usually have a timestamp that would differentiate the two events.

vikingerik · on July 28, 2023

Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.

I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)

kmeisthax · on July 28, 2023

Your log tables don't have timestamps or line numbers on them?

More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.

deely3 · on July 28, 2023

Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.

Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.

Dylan16807 · on July 29, 2023

It's still not something you "want to have", it's something you might be forced to deal with because someone screwed up the design.

And ensuring you have a real primary key should only be good for performance, in the realm of SQL databases.

gregw2 · on July 28, 2023

I spent time researching this a while back.

The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.

I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.

Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.