I am the author of this feature. The background here is that the SQL standard was ambiguous about which of the two ways an implementation should behave. So in the upcoming SQL:202x, this was addressed by making the behavior implementation-defined and adding this NULLS [NOT] DISTINCT option to pick the other behavior.
Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations. But I'm glad that people are also finding other uses for it.
> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one
SQL is old enough and this debate so unsettled still that I think it should be clear there isn't a categorically right behavior here anymore than there is a clear winner between tabs and spaces.
As an example of the irreconciliable weirdness of NULL, consider that "NULL = NULL" is false, and so is "NULL != NULL", while rows with NULL still group together in GROUP BY.
The document you link uses unknown as a synonym for null. If you inserted the result of such a comparison into a table, the value inserted would be NULL.
I'm self-taught, could you show how to insert the result of such a comparison into a table in mssql? I can only think of using CASE or similar, which would mean no direct conversion.
True, I'll blame vacation mode for not catching that one.
However, that doesn't change things, I can't get any of these to parse:
insert into #nulltest(a) values (1 <> 0);
insert into #nulltest(a) values (null <> null);
insert into #nulltest(a) values (true and true);
insert into #nulltest(a) values (null and null);
So my question remains, how do you insert an UNKNOWN?
The point of me asking is that it was my belief contrary to what was suggested, UNKNOWN is not the same as NULL and, crucially, does not evaluate as NULL.
This is s misunderstandig of Hoares “billion dollar mistake”. The mistake was not the existence of nulls, the mistake was type systems where every reference could be null.
SQL absolutely does have that problem. You have no way of enforcing or checking that an expression is supposed to be non-NULL; at best you get an error at runtime when you try to insert it into a given column, which is usually far too late.
No, it's a specific problem of SQL that you can apply constraints to stored or quasistored columns but not to expressions. In a real typed language every expression has a type.
This is a good way to remember the SQL behavior. Though I do hope it'll change in the future. So many other languages interpret null as a comparable value, and it will be less surprising to folks coming from those.
I personally don't have issues with languages that have incomparable values, just with languages that have a generic equality and incomparable values.
If every type has it's own equality operation (e.g. Number.eq) then I have no issue with it being weird in unexpected types (e.g. Number.eq(NULL, NULL) =false)
For example in JavaScript almost every use of === should be replaced by Object.is.
I agree, however that's all fine and well but it turns the languages syntax into dogshit. A babel plugin (transpiling) would be ideal.
On the same note, the original intent was for NaN to be insidious and propagate into every derived calculation or comparison, thus making it easy to tell if a mathematical function had failed, without having to add conditionals at every step within it. Which makes a lot of sense. However, not being able to know exactly where the NaN was produced is kind of shit. Now, if NaN had a stack trace property like Error (exception) objects...it would make this behavior much more useful. Makes me wonder if the prototype of NaN can be extended, hm.
NaNs cannot be extended in any way, in many interpreters NaN is used to "hide" pointers in 64 bit floats using the 52 free bits in NaN values.
IMO NaN should have never existed, like signed zeros, (I am more agnostic about [signed] Infinity) I think that they were implemented in IEEE 754 because the encoding had space left for them.
I not an expert in any way on this this is literally a personal opinion.
> This isn't quite true - these comparisons don't evaluate to false, they evaluate to NULL.
Which large parts of SQL treat as equivalent to false.
WITH vars AS (
SELECT CAST(null AS INT) as X
)
SELECT 'yes this is a tautology' as "is this is a tautology"
FROM vars
WHERE X = X
The above query returns nothing, not an error because the value of X is unknowable. In a sane language, X=X and so it would yield "yes this is a tautology". SQL standard wants it both ways - it wants to say "boolean null isn't false" except when for its own convenience it decides that yes, null is false.
The truth, imho, is that SQL has an archaic type system. In this day and age it should be offering algebraic types where we can implement our own "empty" values and the like that may not use three-value logic as appropriate, so that I can say that for example End Date is blank instead of either unknown or Date.MaxValue.
I think that this topic gets tripped up with people wanting to assume boolean means two possible states... NULL adds an extra state to every type much like the "bottom" type/value in some functional programming languages or type systems.
It's not the `=` operator that is special or significant here. It is the `WHERE` clause semantics, which are defined as returning rows when the boolean expression is true. This treats both false and null cases as the other possible outcome, when rows are not returned. Not understanding this distinction is also where people write buggy queries using WHERE NOT ..., because like equality the NOT operator will return NULL for a NULL input (so NULL is not like false and does not negate to true).
SELECT 'hidden by null result' WHERE CAST(NULL AS boolean);
SELECT 'also hidden by null result' WHERE NOT CAST(NULL AS boolean);
SELECT 'shown by true result' WHERE CAST(NULL AS boolean) IS NULL;
SELECT 'hidden by false result' WHERE NOT (CAST(NULL AS boolean) IS NULL);
The existing GROUP BY behavior is different than WHERE because it does not have a special rule for handling true values. It just returns a group for each distinct value.
The prior (new default) UNIQUE constraint behavior made sense when you think of NULL as this bottom value and meaning "unknown", while also thinking that a uniqueness constraint is only applied to complete keys. The check is effectively deferred, and the partial or incomplete tuple might become valid and unique once its missing part(s) are filled in with real values. The new optional behavior makes more sense if you think that the uniqueness constraint enforces distinct records even with partial keys, just as the GROUP BY clause will determine distinct groups.
There is also a similar issue with FOREIGN KEY constraints. The default MATCH SIMPLE behavior is consistent with the default UNIQUE constraint behavior. An incomplete foreign key is essentially ignored and not subject to integrity checks. The MATCH FULL behavior is almost a style checker to say that a compound foreign key should be all null or all non-null, while still ignoring incomplete foreign keys. As far as I know, the MATCH PARTIAL case is not implemented, but I think it might complement this new UNIQUE NULLS NOT DISTINCT behavior?
One can imagine a further branching point in behaviors for foreign keys and I'm not sure how MATCH PARTIAL will be defined. One mode might be easily implementable with the new index method, if we want partial foreign keys to be able to reference partial keys in the referred table, i.e. the partial key with NULLs is considered an actual row key. But we might also imagine another partial matching mode where the the non-null components of a compound foreign key have to all exist in at least one key in the referred table, while ignoring the null fields as incompletely formed references. This would be an improvement on MATCH SIMPLE which allows the non-null columns in a foreign key to hold values which could not possibly match the referring table no matter which values are filled in for the remaining null fields of the foreign key. However, this enforcement would not be easily implemented without a powerset of indexes for all partial combinations allowed in the referencing tables!
> anymore than there is a clear winner between tabs and spaces.
But there is a clear winner here. Tabs won, almost all code files use spaces. Of course everyone uses the tab key on their keyboard, and the IDE makes spaces for you; no one is going to manually press space 4 times.
There is no winner, both tabs snd spaces are widely used. However with proliferation of better tools the question thankfully became mostly irrelevant. I don't care if it is spaces or tabs, as long as it is consistent across codebase (which CI enforces with autoformatters and linters) and as long as my editor (which obeys project's editorconfig) uses the same convention.
MSSQL considers nulls to be equal, everyone else considers them to be distinct (aside from a few DBs which just refuse to create UNIQUE indexes on nullable columns), and Oracle is apparently huffling glue in a corner.
> As an example of the irreconciliable weirdness of NULL, consider that "NULL = NULL" is false, and so is "NULL != NULL", while rows with NULL still group together in GROUP BY.
Group by is the exception here. And arguably it is so because the alternative is never useful, or close enough as to not be worth it.
So of the three major SQL implementations, you have one of them following the results of the settled debate, one of them being an exception and one of them huffling glue (whatever that means). In addition, there's a language exception even in the settled group, which you have to handwave away.
And oracle considers null to be equal to empty string. My understanding is oracle was created before the sql standard existed and they don’t want to change because of backwards comparability. I don’t understand why they can’t add a db parameter to allow you to set how nulls are handled.
IDK if you ever saw the HN thread where they talked about oracle and change tests from a few years back, but if I had to guess, the pain just isn't worth the overall cost.
Overall I remember running into this but finding it fairly simple to add a couple specific checks around the condition where it came up in our code.
They would need to change their storage format to be able to distinguish between NULL and empty string, and probably a lot of in-memory structures and even more code logic as well. It would be hugely expensive to support both.
Oracle introduced VARCHAR2 in the 90s(?) to mean “Oracle semantics” as opposed to VARCHAR which is documented as “may change to standard SQL semantics in a future version”. However I don’t think they’ll ever materialize that distinction.
A slight tangent, but do you have any pointers on how one might get involved in either:
- The SQL standardisation process (or whether this is even feasible as someone who isn't involved in the development of a major database engine)
- Postgres development
The feature I am particularly keen to get accepted is trailing commas in SELECT column lists (potentially other places too, but SELECT lists would be a very good start). And there are potentially a bunch of other improvements in the syntax sugar department that I might be interested in spearheading if I could successfully get this accepted (e.g. being able to refer to select list column aliases in GROUP BY clauses).
I don't have much experience with C, but I would potentially be up for actually implementing the change if I could get buy in from the maintainers.
> The feature I am particularly keen to get accepted is trailing commas in SELECT column lists
Not "GROUP BY THE OBVIOUS LIST OF EXPRESSIONS, YOU KNOW, THE ONES FROM THE SELECT CLAUSE THAT DON'T CONTAIN AGGREGATE FUNCTIONS"? Aka "GROUP BY ALL" from DuckDB.
Postgres project is always looking for contributors. The project has a very nice, detailed, well-defined documentation on how to contribute [1]. The community is very welcoming, and has a process (CommitFest[2]) in place to ensure all submissions get their due attention.
FWIW, I am starting to make an effort (e.g. [3]) towards helping the newcomers get their work in shape to be acceptable by the committers.
Note: I'm biased towards Postgres community, since I've worked with them for many years. So others' opinion _may_ differ, but it's highly unlikely.
> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations.
Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?
I think it's worth pointing out what the opposite means - to say that "this field might have a value, or it might be null - but only one tuple/row can set this field to be null" implies that you have given the null option some kind of real-world value.
That is to say, you are using null to encode some kind of meaning, when really this is not what null is supposed to be used for.
That's not to say I think we should be morally disapproving of people who do that - I use things for their unintended purpose all the time, and it bugs me when people get on a high horse. Use what makes sense to you - and I love this change for exactly that reason.
But the general theoretical approach is that if you want to care about the value in a field, you need to give it a value. Null is for the valueless, and if that isn't an allowed state, you should simply set the field as not-nullable. Theory is fine in theory. For a practical database that exists in the real world, this option is a good addition.
Fauna has an interesting approach to null that I've grown to like:
Null is a real value that can be compared intuitively like any other value (`Equals(null, null)` returns true). [1]
However, in indexes, any term that evaluates to null is simply not stored in the index. So we can create a unique index with multiple null values because they simply won't exist in the index, so won't violate the unique constraint.
If we do care about null values in a particular index, we can handle that by mapping the null value into another value that will get stored in indexes using a "binding", and then use that index to query for or constrain uniqueness by that mapped value [2].
This is not the most convenient thing in the world, but at the end of the day, it feels like an edge case, so I'm happy with having to jump through some hoops to handle it in exchange for making the much more common cases (null comparisons and unique indexes with nullable values) more intuitive and less error prone.
I think a lot of the trouble stems from the fact that databases do not have sum types and there is no way to encode a "None" type without hacks. NULL is the only reasonable option in a lot of cases.
Yeah, in hindsight the world would be a slightly better place if SQL didn't include NULL, but instead "UNKNOWN" and "NONE", with "NONE = NONE" being true, and "UNKNOWN = UNKNOWN" being unknown.
Or have an Option (or Maybe) type. This is how modern programming languages solve the nil issue, I don't see a reason why a database couldn't take the same approach.
I'm not the person you're responding to, but also think NULLS DISTINCT makes sense in many cases. NULLs often represent missing data. Imagine storing a customer's address and the street name is NULL. If several customers have a street=null, it doesn't mean that they have the same street. So from a data perspective, it makes sense to treat these unknown values as distinct.
For filtering and aggregation I still welcome the change, as it makes sense in many cases to treat them as not distinct.
Thanks. It seems that the issue arises from value equality vs optional value state equality.
To me, a more natural way to treat NULLs is to think of NULL not as a value, but as a state. Several customers with a street=null all have street property in an equal state. However, an equal state doesn't mean the value is also equal, as there is no value in that state. Option type in functional languages models this perfectly:
'a option =
|None // no value here
|Some of 'a // value of type 'a
So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).
Unique constraints should not be a problem. Unique constraints are about values. So if state is NULL there is no value and so constraint does not treat NULL states as equal values. Adding a keyword to change constraint behavior when needed, would be best IMO, as it would need to be rarely used.
> So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).
Is it tho?
Languages with ubiquitous nullability will accept both being null or both being non-null and the same value. Languages with option types do the same (as long as the value is equatable obviously), this includes OCaml: https://v2.ocaml.org/api/Option.html#preds.
The only languages I can think of which seriously diverge would be C and C++ where this is UB.
Maybe if you way stretch it Java because it doesn't have operator overloading, but even there there's Object.equals which is null-safe... and returns `true` if both parameters are `null`.
I completely agree. I tried to explain basically the same thing as you, but it seems there is some ambiguity in my example I didn't intend.
Comparing two nullable/option values when both are in a "null state" will return true in most languages. Which, to me, is the most logical result as they have the same state. So, if you want to check if two people actually live on the same street, it's not enough to check p1.street == p2.street, but you need to also check that at least one of the streets is not null. Two nulls would return true, but if equality returns true and we know that one of the streets isn't null, we know we are comparing actual street values (not null states). Sadly, in SQL, NULL=NULL result is not true, but NULL.
Not OP, but this constraint means "the data must either be empty or unique", which is an extremely common constraint. In contrast, I've never encountered "only 1 entry is allowed not to have the data, all others must be unique".
"only 1 entry is allowed not to have the data, all others must be unique" makes sense in larger schemas as the database grows. It replaces is_default_address, is_primary_child etc. fields and constraints on relational tables. In the perfect world all the data would be normalized not to have such columns and cases, but in real life it just grows that way. So for some cases NULLS NOT DISTINCT will be a welcome addition.
Not parent commenter, but to me it seems "obviously correct".
As explained in the article NULL means "unknown". Let's say I have two people, and a "tax identifier" column. Let's say two people can't have the same tax id. Obviously "don't have one" or "unknown" fit very well in NULL, and two people can both be missing a tax id, even if those who have it need it to be unique.
Except that's the ambiguity of NULL in the standard. It doesn't just mean "unknown". It can also mean "no value". And you can clearly compare two elements with no value.
Yeah that's true. All the database can say is "I don't have a value here". Not "there exists no value for this column".
But this doesn't mean that two NULLs are the same. Two people with NULL tax IDs are not the same (one turns out to be unknown but existing, the other is a toddler without a tax ID). Ten people with NULL address don't live in the same house. Half declined to give address, the other half are homeless. Either way we can't send mail to them.
If you use the database to have a UNIQUE constraint of "only one purchase per household" then it makes no sense to allow NULL addresses, but only allow the first NULL address purchaser to buy the thing.
Or "sorry, in this hotel we only allow one guest at a time without a car, and someone else already doesn't have a car".
Does that guest without a car actually have a car? I don't think that's something that the database can solve.
Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.
> Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.
It’s also a non-problem in practice, because if an application needs to distinguish between multiple types of nulls, it can very easily just use an extra column holding the information needed to disambiguate.
Similarly, the need for non-boolean columns is a non-problem in practice, because an application can just make an extra column for every bit it needs to store. If it needs strings or arbitrary-width numbers it can just join on other tables that store those bits.
(Existence of a workaround doesn't turn a problem into a non-problem)
I stand by it being a non-problem in practice because the workaround is trivial, and the supposed solution adds complexity everywhere.
There isn’t even a universally agreed methodology for how to handle nulls in the relational model in theory.
It’s a semantic problem that can only be solved with reference to the requirements of a particular use case, and when you do that using explicit extra columns, the problem evaporates.
Which happen in world of javascript. It has null and undefined. Null variable is a variable / property that's defined and set with null, while undefined is a variable / property that's not defined yet (or maybe not initialized, I forgot).
I hope at least SQL will give another comparison operation, such as the current equal sign (=) means a.prop is not null && b.prop is not null && a.prop equal b.prop (for not defined yet).
Let's say another sign like a.prop *= b.prop, meaning a.prop is null && b.prop is null || a.prop = b.prop (for our intention is both value are empty).
> It doesn't just mean "unknown". It can also mean "no value".
Isn't the canonical representation of known-no-value an absence of a tuple? Like as opposed to saying "There exists an employee X who works in department NULL", you simply don't make any claim about employee X working in a department? After all, when enumerating the members of a set, you're also omitting the enumeration of non-members of a set, and the law of excluded middle applies.
It's probably a good default because it's more consistent with how NULL equality is handled in SQL generally. From the article:
> In Postgres 14 and prior, unique constraints treated NULL values as not equal to other NULL values.
...
> This is consistent with the SQL Standard handling of NULL in general, where NULL is unknown. It is impossible to determine if one unknown is equal to another unknown. Because NULL values are of unknown equality to one another, they do not violate UNIQUE constraints.
Your last statement would be better written as:
Because NULL values are of unknown equality, they may or may not violate UNIQUE constraints.
The safest implementation would assume that they do violate the constraint, rather than the current behavior assuming they don’t.
In practice I don’t think I’ve ever added a unique index on a nullable column where null might imply unknown. I have used it in cases where null meant “none”.
> Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?
It's the most consistent with the equality behavior of null values elsewhere.
I generally agree that, in most cases, you want the NULLS DISTINCT behavior. But thank you for providing such a developer-friendly feature that allows flexibility here!
NULL in SQL is not terribly consistent overall. Sometimes NULL is treated like "unknown" and sometimes more like "n/a". And the interaction with non-scalar types (like records) is pretty strange. Also, it's common for ORMs to map the app language's NULL (or None/Nil/whatever) to SQL NULL, which adds its own nuance. So I can see this being a useful feature.
Well, if NULL is most often meant to signify an unknown value. You can't compare one unknown value with another and say they are the same.
The standard is, I believe, ambiguous about NULL because it can also mean that it is an absence of data. In other words, it is "this is not yet determined" (aka a CAR table has a COLOR table, but the paint is applied as the final operation so NULL could be used as the car is not at this stage yet and the color will be chosen later). In this case, you can compare NULLs against other NULLs, because you can have two cars that are in an indeterminant state. In that case, it is useful to compare to NULL values as it means the same thing.
An example where NULLS DISTINCT might make sense in a constraint is when a table is used to store an analytical cube roll-up (eg output of GROUP BY ROLLUP), where NULL in a dimension column has special meaning (it indicates subtotals). Having multiple rows with the same NULL key could, in some of those cases, be an error leading to double-counting.
Once you've accepted that X=X doesn't return TRUE when X=NULL, NULLS DISTINCT is consistent with that. ANSI NULL three value boolean algebra is insane, but it's insane in a pretty consistent way.
So, we have a "standard" which says that both behaviors should be supported, and there should be two different ways to express them, but explicitly declines to say which does what?
I'm genuinely not sure what purpose this particular standard serves, but it looks like it's not allowing people to write portable code...
I can't imagine the standard does not specify that NULLS DISTINCT should treat nulls as distinct and NULLS NOT DISTINCT should treat nulls as not distinct.
Ah. So there are three ways to write it, and explicit ways to call for either behavior. That's sane -- just wasn't what I got from the original writeup.
Playing around in a docker image of the beta build, it looks like this allows you to add a `unique nulls not distinct` constraint to a composite set of fields, but still does not allow you to specify those same fields a primary key. For example
-- This works
alter table announcement_destinations
add constraint announcement_destinations_pk unique nulls not distinct (sender, receiving_group, optional_filter);
but
-- Still does not work
alter table announcement_destinations
add constraint announcement_destinations_pk2 primary key (sender, receiving_group, optional_filter);
fails with a message like "ERROR: column "optional_filter" of relation "announcement_destinations" contains null values". Is there a motivation for this distinction?
Currently nulls are considered not distinct in postgres, no? Doesn't this mean anyone upgrading will have to fix up all their table definitions? Or am I wrong about this?
I just care about the backwards compat. To be clear, the second I saw this I thought "finally!!!"
This is great! I have multiple places in our code where I've had to define multiple unique indexes (with conditionals) to get around this. It's definitely confusing behaviour the first time you encounter it.
Honestly I think SQL already breaks it's own standard of "NULL = unknown". One example: If you do a LEFT OUTER JOIN and the joined data is missing, you get NULL in the resultset. But in that case, it's not "unknown" whether there are records that can be joined to, the answer is that there are definitively no records there.
NULL is univerally considered by most programming languages to equal "not present" vs "unknown", and in many cases by SQL in practice. The ship has probably sailed on actually changing the default behaviour, but optional flags like these are a good step.
NULL means "I don't have this data". It doesn't guarantee that the data doesn't exist anywhere.
It's semantically ambiguous, yes.
But it's not ambiguous that the database doesn't have the data. It's very clearly NULL.
If "middle name" is NULL, then the database itself can't tell you if a person doesn't have a middle name, or if it's merely not yet been told the middle name. You could differentiate the two by saying empty string for "has no middle name"… except on Oracle VARCHAR2, where NULL is equal to empty string.
> You could differentiate the two by saying empty string for "has no middle name"… except on Oracle VARCHAR2, where NULL is equal to empty string.
This is what kills me.
This seems like an obvious case where multiple flavors of NULL would make perfect sense. For example Ticket.CancellationDate isn't "unknown", it's this ticket is not cancelled. But SQL standard doesn't have a way to express "empty value" as distinct from "unknown value", so we're stuck working around the strange implications of using this square peg to fill a round hole.
Oracle's bizarre "empty VARCHAR2s are NULL" is strange but is consistent with every other datatype that offers no way to say "empty" but "NULL". 0 is not the same as NULL int and is not the same as empty int. Mindate is not the same as NULL date and is not the same as empty date.
Well, for the string case empty string is… the empty string.
Storing the exit code and stdout of a program run Oracle allows the state "it's not finished yet" to have exit code representation of NULL, but you cannot say the same for string.
"Unknown" is not a perfect map to NULL. "I have no answer for you" is better.
Maybe NULL values could use an annotation.
But I think that would make things worse. Like let's say you have some query that joins addresses with coordinates. You know that semantically any address without a coordinate has an "unknown" coordinate, not that it authoritatively doesn't have coordinates.
But not only do you need to change the data model then, but the query language too, to be able to express the difference where absense means one or the other.
Now absense is NULL, and leaves it up to the application to interpret.
SELECT a.address, b.coord FROM a LEFT OUTER JOIN (SELECT address,coord FROM coords WHERE is_test=FALSE) AS b ON a.address=b.address;
Or maybe some coordinates are missing because their locations no longer exist. The house has been demolished, so it's not "unknown" where the house was, it's just that it no longer has a location. And how would you join that data.
I don't think I've encountered a problem in practice with SQL in differentiating empty from unknown. It's out of scope of what SQL tries to solve, and that's fine.
That's fair - as I said in another comment, it breaks down with keys. I can't think of a great way to model "this row definitively does not have whatever is referenced in another table". Using anything other than NULL for that means you can't use proper foreign keys.
NULL means that the value is unknown, not that the data is unknown. You indeed have no data but its value may exist. You just don’t know it yet. NULL is very powerful when you understand that.
That seems like a fairly big assumption on how I'm modelling the data though. Particularly in the LEFT JOIN example, I'm not sure if I agree with the idea of "a row not being present means that it's unknown whether it exists in the real world" depending on the underlying content being modelled.
For a string column, you do have blanks vs. NULLs, so it's fairly easy to distinguish between "John definitely doesn't have a middle name" vs "It's unknown whether John has a middle name", but that doesn't hold true with most other datatypes.
There are two "blank" integers: 0 is the identity element under addition, 1 is the identity element under multiplication. Or worded differently 0 is a blank addend, 1 is a blank multiplier.
‘’ would be the identity element for string concatenation, or any other string manipulation method I can think of, so it has a stronger claim on being blank than a 0 has.
Unless your string is later parsed as something else. A blank JSON value would look differently from a blank phone number.
It depends on your column. If you have a products table and a certain handmade product has no upc, it’s correct to use ‘’ as its value rather than NULL because you know for a fact that its non-existent. Some people respect NULL too much.
It is mostly true but not always, eg in the output of GROUP BY ROLLUP, a NULL value is used to indicate rows representing subtotals / partial aggregates
This is amazing. We moved from Oracle to Postgres after being on Oracle for 15+ years. That was a huge move that took 2 years. One of the biggest hurdles was rewriting thousands of sql queries. One of the biggest things we had to change was accounting for how oracle handled nulls versus how Postgres handled nulls. However, we missed this nuance about how Postgres treated nulls and unique. It wasn’t until about 2 years into using Postgres, and running into so many issues, that we finally stumbled upon this nuance in the documentation. Crap. That required making a bunch of more changes to account for this. At this point I’m firmly convinced that null handling is the hardest part of database work ;-)
Since we have some PG devs here: Can we please have a way to reorder columns?
Coming from MySQL, this is one of the first missing features one hits. It's like moving to a new house where you're told you can never clean or move the furniture without moving to a new house.
It leaves the unfair impression that this is a "toy" db. "What other basic features is it missing if it doesn't have this?" Please don't think of it as trivial; first impression are very important.
Better to channel your displeasure into advocating for a future SQL standard to include this feature. Other DBs that support this do so via proprietary syntax extensions.
Technically it doesn’t require syntactic extensions, postgres stores the column index as the `attnum` attribute column of the `pg_catalog.pg_attributes` system table.
So this could be made to work by hooking into the storage system and rewriting the table and all pointers to the table when `attnum` is updated.
Without this rewriting this only works if the table was just created and has no data, and no external metadata referencing the columns themselves e.g. views, fks, indexes, defaults, rules, …
An alternative would be to add automatic packing (à la rustc) to postgres, decorrelating the “table position” and the “physical position” of the rows, this would also allow free “table position” reordering.
And while it’s by far the most complex option, one of the nice bits with it would be that the system columns could be packed as well. Currently there’s quite a bit of waste because there are 6 system columns (by default), the first 5 are 4 bytes, but the 6th (ctid) is 6 bytes, meaning 2 bytes of padding if your first column is a SERIAL, and 6 if the first column is a BIGSERIAL (or an other double-aligned column).
> And while it’s by far the most complex option, one of the nice bits with it would be that the system columns could be packed as well. Currently there’s quite a bit of waste because there are 6 system columns (by default), the first 5 are 4 bytes, but the 6th (ctid) is 6 bytes, meaning 2 bytes of padding if your first column is a SERIAL, and 6 if the first column is a BIGSERIAL (or an other double-aligned column).
FWIW, system columns aren't stored as normal columns in tuples. Some of them are implied (e.g. tableoid doesn't need to be stored in each tuple, ctid is inferred from position), others are not stored in the way normal columns are stored (e.g. xmin, xmax).
Couldn't an ordering column be added to the pg_attributes that determines in which way the columns are sorted when they are displayed? Standard SQL code can then be used to manipulate the display order of the columns.
There's like 5 different features wants I could add like this. A columnar database engine comes to mind (slightly more work involved) and poaching a few Sqllite QOL features.
Ultimately Postgres is a community driven product, and people want to work on what they want the work on.
I don't get angry that some kind soul hasn't volunteered their Saturday for free for me.
That's not exactly true. You can write your own patch for a feature and there's a good chance they'll decide they don't want it. It's their prerogative, but also slightly upsetting when you know they're blocking features you want. Just observe the bike shedding over a trailing commas patch [1].
There are lots of good reasons in that thread why the patch was declined.
The patch made things more lenient which makes things less compatible (this whole thread is about ANSI SQL standards and they do matter) and changes what was an error behavior to now silently succeed.
I read the thread and have to disagree with "lots of good reasons". I'm pretty sure there are plenty of things in the postgres dialect that are not cross compatible with any other dialect.
I agree with curating patches in general, but it's a double edged sword and inaccurate to say you can just spend your weekend to get something you care about.
Oh, absolutely; I'm just trying to raise the awareness/importance of this missing feature. Most devs that I know IRL that looked at PG had a very similar reaction to mine.
I'm receptive to the feature. But wouldn't it require locking the full table as it works? Seems like it would hurt availability of the biggest databases that would benefit most. Perhaps this is why it is thought somewhat impractical and therefore low-priority.
Depends on how the PG team decides to implement this feature. If they're OK with a virtual order (for "UI" purposes) while keeping the true order hidden then this would not require table locking (or it would be a very quick lock). If the order reflects the real / physical positioning then yes, it would probably require locking for a full table rebuild.
I like to have my `*_id` columns at the front, and `created_at` etc. at the back. And in the middle all the fields by descending importance. Just a personal habit so I can quickly lookup data in a table.
There is some small benefit to playing column tetris if you have columns with different sizes that waste space due to padding. I'm not convinced this would be worth the complexity of this feature, but in some cases reordering columns might have measurable benefits in reducing the size of the data.
I'm not talking about the physical layout of the data, just a thin UI layer that the DB tool could use. Maybe we could have two modes: physical vs UI ordering?
Views are not a good solution for this. The point is to be able see a specific order in 3rd party apps and when writing quick add-hoc queries (select *). Writing views for every table would just pollute the db.
Based on my understanding of DB storage (which is decades old and as I write the comment my explanation seems stupid but it’s what I was taught. Please someone correct me).
Image a table with the following:
id (int), title (varchar), created_date (date)
That data is stored in a similar order on disk (is this still true?).
So, if the initial insert has a title of 10 characters, then the date will get placed after those 10 characters on disk:
Int, 10 characters, date
Later, the row is updated with a longer title of 500 characters.
Because it’s longer, you now have to get more space for the title on another part of the disk.
Since there are columns after the title, it’ll either leave dead space where the old 10 character title was or it’ll need to move those other columns too.
I can’t remember which I was taught. I just remember that columns after variable columns get impacted when the data in a preceding variable column changes.
If the title were at the end of the row, the db could expand the data without needing to move other columns over (if there happened to be available space right there).
If you were updating with a shorter value, it could shorten the row by moving the terminating character to the new shorter location, freeing up the space the longer title was using.
Bottom line is you want to keep variable columns which get updated most frequently towards the end of your table structure.
Knowing which column that would be isn’t always clear when creating the table.
Again, this understanding was taught to me over 20 years ago. I’m probably remembering parts of it wrong and DB storage has likely (hopefully) advanced since then.
> That data is stored in a similar order on disk (is this still true?).
There should be no requirement for this. Columns in relations are not conceptually ordered, so it shouldn't matter for the things you're doing with the data anyway, and the database should be able to reorder the data in whatever way it likes, since desire to isolate the user from physical data structures was one of the main reasons for the rise of RDBMS.
1. That’s not exactly true, AFAIK all RDBMS return columns in table order when order is unspecified (`*`), and while they could reorder on retrieval
2. postgres definitely does not, and column tetris is absolutely a thing in the same way struct packing is (with the additional complexity of variable-size columns)
But the order of attributes in the presentation of the result relation has nothing to do with physical layout of base relvars -- primarily because any relationship between the two is purely coincidental. The vast majority of useful queries will not reuse the order of attributes in base relvars, so optimizing for the unusual trivial case by prohibiting better rearrangements that could be useful for a much larger number of use cases seems rather pointless.
And what PostgreSQL does is of course an implementation detail of PostgreSQL.
Why would you be trying to prevent the DB from reordering the data? You're not supposed to have better knowledge of what's good for your use case than an RDBMS that can collect usage statistics on queries and such. Ditto for compilers rearranging structures and such. When you start having hundreds of tables and thousands of queries, I don't see how you can do a better job than an automated system at that point.
DB storage is a lot more sophisticated than what you were taught.
Most databases use Slotted Pages to organize storage. Pages are fixed size and numbered by their offset within the database file. The page header contains the number of rows, followed by an array of offsets for individual rows within the page. Rows themselves generally are stored at the end of the page filling downwards. The storage engine can move around rows in arbitrary ways to consolidate free space.
Fundamentally there's no connection between SQL schema order and how table storage is organized on disk. For example in a column store there's often no contiguous row stored anywhere, instead there's just separate indexes per column.
Soothing my OCD is important, but it also helps when working in a team and having a shared ordering style. For example, we have certain types of columns that are always at the bottom. One would look there first. With PG you're forced to scan all the columns every time.
Ideally, postgres would play column Tetris behind the scenes and store the columns on disk in the most appropriate way, while allowing the representation to be changed at will.
With SQL Server the management tool does give you a way to do this. Yes, it does a table rebuild behind the scenes. The point is that it's easy. Don't have experience with the other two, but MySQL is the most popular so it kinda sets the tone whether we like it or not.
Well, writing a procedure that rebuilds the complete table in Postgres or Oracle is easy as well. I never needed this, but I am sure, there are some sample implementations out there.
Rebuilding the entire table doesn't seem feasible for large tables to begin with. Especially with a lot of incoming and outgoing foreign keys.
I disagree that MySQL is the most "popular".
It might be the "most used" one because of so many web hosting services included it for ages by default.
Tangential, but there are almost certainly more SQLite databases in existence than every other RDBMS put together, probably by 3 or 4 orders of magnitude.
I've worked with DBs for 20+ years as well. This is a quality of life type of improvement. If you've worked mainly with DBs that don't make this easy it's hard to know what you're missing. Do a search for column reordering for PG and you'll get a ton of hits.
The problem is that we hear a lot of different features touted as "the crucial missing one" -- I would definitely put this into the polish category, but items in that category are also important; especially for those with a MySQL background.
which recently has been picked up again -- that's awesome to hear
I think the main lesson from this is you should use a NOT NULL constraint on just about every column of your database, since the behaviour of NULLs is weird and likely to indicate a problem with your data.
NULL is totally fine, and happen naturally if you make an outer join. But if you are making a UNIQUE constraint one a nullable column, then you may have a problem.
The idea with NULL is that it is not a value, it is the absence of value, like a field you didn't fill in a form. For example, if you ask two people their personal information, and neither specified their email address (email=NULL), you can't assume they have the same email address. And if you put a UNIQUE constraint on that email field, you probably don't mean that only one person is allowed to leave it blank: either you make it mandatory (NOT NULL), or you let everyone leave it blank.
The reason nullable and unique are rarely seen together is that unique is typically for keys, and generally, you don't want rows that have no key value. Also, putting a unique constraint on something that is not a key may not be the best idea. For example, if you don't intend to use email to uniquely identify people, what's the problem with two people having the same email?
NOT NULL really should be the default. There are of course valid reasons for allowing NULLs, but NOT NULL is much easier to handle for most cases. I don't think nullable columns indicate a problem with the data, but nullable columns that don't have a good justifications are certainly trouble.
The fun bit is this mistake gets replicated in every definition langage, sometimes made ever worse (e.g. openapi where fields can be omitted by default, and you can additionally make them nullable).
Here, "None" means "no value specified", and "Some(None)" means "a value is specified, and that value is null." And "Some(Some(my_string))" means the value is "my_string".
There are other ways to represent this in Rust, some of which might be clearer. This representation seems to be used most often if you have a type "MyType", and you want to automatically derive a type "MyTypePatch" using a macro.
Larger enums to represent and discriminate. Null is just syntax sugar for the one-kind of absence case, and multiple-null-likes eliminates all benefits of that syntax sugar, so you might as well revert to a real discriminator.
Ideally you can force the use of the discriminator… but that depends on your type system
The problem with flattening the two absence cases into an enum is composability and generality because it forces the "outer null" case to intrude into concrete types. Say you have a key value mapping Map<String, T> and you represent changes to this map as Map<String, Option<T>>. One day T itself is Option<Int> so you end up Map<String, Option<Option<Int>>. If you want to use e.g. Option2 for that latter case you lose generality.
Where the "double absence" issue comes up in practice, it's usually in a context where it does make sense to represent and handle the first type of absence separately from the second type.
In cases where the field is required I would agree. But in the context of an update request, {"foo":null} and {} might both be valid with different semantics. "Update this field to null" vs "don't update this field"
That's just plain wrong. You need some value to populate optional fields, and NULL is a pretty decent choice to indicate that a string, date or number has not been filled in. If you try to fix that by having special string or numeric constants, you're just making life harder, especially when you export the data. The poor souls that import it will think "hey, this timestamp says 1-1-1970, so this probably happened at Jan. 1, 1970."
The problem is: NULL is a valid datatype in many situations. There are non-boolean fields that can just be empty; My goto example:
-- Table definition for employee
name, surname, date_entry , date_exit
Everyone has names, and if hes an employee, he probably has an entry date...but until he leaves, what's the exit date?
Other than soothsaying, my choices here are: NULL, some magic value, or an additional field telling me whether the application should evaluate the field.
The latter just makes the code more complicated.
Magic values may not be portable, and lead to a whole range of other problems if applications forget to check for them; I count myself lucky if the result is something as harmless as an auto-email congratulating an employee to his 7000s-something birthday.
That leaves me with NULL. Yes, it causes problems. Many problems. But at least it causes predictable problems like apps crashing when some python code tries to shove a NULL into datetime.datetime.
Be careful with this in Postgres: using a zero, blank space, empty array or similar as a default value instead of NULL can lead to enormous bloat in terms of disk space. There are distinct advantages of NULL that have to be considered.
He most probably means creating a data structure that does not allow NULLs (usually increasing your normal form works) not replacing them by a naive synonym.
Their statement is not “in the general case” though, it’s specifically for postgres.
Postgres uses a fixed-size null bitmap and variable-size rows, so a NULL value takes one bit in the bitmap (and additional nullable columns may require a wider bitmap), but they are skipped in the row itself.
Postgres uses 1 bit per NULL value, though the reality is more complex as everything is padded for alignment. So fewer than 8 nullable columns are free, and above that you pay probably 8 bytes at once (I'm not totally sure on the exact numbers here).
So for Postgres it is generally true that storing NULLs is very cheap in most cases.
I have struggled to implement a tree structure in PG with nullable unique values.
Consider a "docs" table where each doc has a unique name, under a given parent doc. A null parent would be a top-level doc, and top-level docs just have a unique name. This didn't work before, and would hopefully be addressed by PG15.
I'm not sure if null parents really represent a "problem with my data", or if the tree structure was too exotic for PG to support properly.
How I got around it: hardcode a certain root doc ID, then the parent column can be NOT NULL. But this felt janky because the app has to ensure the root node exists before it can interact with top-level docs. Plus there were edge cases introduced with the root doc.
When I store tree-structured data in a relational database I generally add a 'path' column which is a denormalized string containing the names of all the parents with a path separator between.
The biggest reason is it makes finding all descendants very fast with a prefix search (foo/bar/%) on the path column when it's indexed. It's not unusual to want to find or update all descendants because descendants are usually related. If you don't have a path column, then you need to write a recursive CTE, which is Slow, or recurse in your application, which is normally even slower. The reason they're slow is they require a number of seeks which is exponential in the depth of the tree.
It also makes lookup of a node from the path fast, and producing a qualified path to a node fast, but these costs are linear in path length.
Anyway, this path column is also a good place to put your unique constraint.
If you don't want to restrict names from containing the path separator, you can escape application-side. For example, if using '/' as a path separator, consider '::' to escape ':' and ':s' to escape '/' - don't use your path separator in the escape or it'll muck up prefix searches.
The way I'd do this, is by separating concerns into separate tables. If you have a table with (id, name) and a table with (id, parent_id), any doc with a parent will have a corresponding record in the second table.
Interesting! But where can you implement the UNIQUE constraint for doc names under a given parent? I guess your application code would need to handle that
Thanks for this example! As I was reading the post, I was thinking "cool, and feels like a thing people would expect, but does it have a real world usecase?".
This feels like an actual real-world thing people might want to do where indeed you'd want to have a single NULL value.
What do you use instead of null then? For example let's take a purchase table, that keeps track when the purchase has been submitted to the supplier, let's call it sent_to_supplier. What do you use for sent_to_supplier in case the purchase has not been sent to the supplier yet?
It's not a hard and fast rule, but in your example you can solve this in the model without needing NULLs. A table called ORDERS records the orders, SUPPLIERS lists your suppliers, and a table SENT_TO_SUPPLIER links ORDERS to SUPPLIERS, and you don't add a row into that table until the order has been sent.
What about this, let's say I have two electricity meters. They have bunch of fields (active power, reactive power, voltage, current, and all that times 3 (one such measurement for each phase)), but one kind of meter can measure distortions, the other can not.
I would make the distortion columns nullable for the pratical reason that you either have to duplicate all the work for querying the same properties from the meters (for example a dashboard that shows voltages only). If you want to support both types of meters in the same dashboard you would have to do a UNION type query (and not ever forget to do that!) if you store the measurements in a different table.
It's possible to handle these kinds of fields (at most one value) with separate tables and foreign keys. It has advantages around 'proving correctness'. But without extra tooling it has significant usability, and perhaps performance disadvantages. I wonder if you can get around some of this using virtual tables.
I come at this with little experience with SQL, and having worked a bit on ampersand[1], a tool where you declare the structure of your data, and some invariants, and the tool will create a schema for your database, and some automatic checking to ensure your invariants are upheld.
Rather than putting the two types of meters in two different tables and doing an error-prone UNION, you would put the distortion measurements in a different table joined to the first, and the meters that don't support it would simply not include entries in that table.
Boolean capabilities of meter per every entry? Shouldn't impact size that much, easy to index. As a bonus, you can track failure of a sensor per each meter, and know when it first failed.
Wouldn't a meter value that doesn't have distortions just have a value of 0 (that is, not null but just zero); the properties of a meter in this case belong somewhere else, either in a "meter properties" table if you have many meters with many properties that can change, or just hardcoded.
Alternatively, a key / value table, e.g. `meter_id, property, value`. But that isn't very optimal, works better for things like a shop product with many different properties per product.
It doesn't mean there are no distortions, it just means they are not measured. So I vastly prefer knowing there's no data than some placeholder that "simulates" no data
The submission process might deserve its own table with more detail than sent/not sent. Otherwise it's a straigtforward boolean, I don't see any use for NULL here.
Edit: Since it's a date, there are valid use cases for nullable timestamps. But only if there is no additional information attached to the event you're saving in the timestamp. Another complication with nullable timestamps is sorting by them, which you often do with timestamps. With nullable ones this can get messy.
You could create a new table that keeps track of the logistics of the product. If the product is sent to supplier, a new row appears in this table with the date. If it hasn't been sent yet, no row is present.
More careful thought about the processes and data that are to be captured.
In a simplistic view maybe NULL is really the correct choice; but is it? Does NULL represent the desired properties?
Another common pattern is the use of some sort of sentinel value, E.G. the largest possible value for a date field, or the smallest, might be used to indicate an unknown maximum or minimum that propagates.
A related pattern might be some sort of orders_suppliers table which would have a foreign key value; that might be NULL or it could use a sentinel value with a dummy supplier to indicate a special condition and an arbitrary number of inband subsets which can be their own distinct matches.
These all seem like fine examples of solutions that are worse than the problem they were meant to solve. A separate table will be harder and slower to query and will enable M2M relationships even when you don't want those. Sentinel values must be dealt with at every query to ensure they are always filtered out.
If you care about ‘when’ you probably also care about resubmissions. Or metadata like the supplier’s response. Or who approved the submission etc. A separate table seems like a good solution.
A person with alopecia's hair color isn't "unknown" it's non-existent. They do not have hair.
Would you suggest that a database of appearance information about a person should have a separate subtable for "hair" to properly model this feature?
Either way, in the end, for displaying 99% of the time you're going to be using a VIEW that LEFT JOINs all these cute normalized tables back together again, and then you're going to want to filter those views dynamically on a nice grid that the user is viewing, and the fact that X != X is going to bite you in the ass all over again.
Creating more tables is just moving the problem around, not solving it.
SQL uses [0] 3 valued logic, and most people seem to reason using 2-valued logic. So most people fund the behaviour of NULL, the 3rd logical value, to be weird.
> It's perfectly valid.
:o You'll get jumped on, saying this to a forum of software people. Lots of things that are valid are also weird.
That sql null is valid doesn’t make its behaviour of not being equal to itself any less weird, because it works that way essentially nowhere else, and thus is highly unintuitive to developers who don’t live and breathe sql.
The closest thing in most langages is nan and developers also find nan weird.
> The closest thing in most langages is nan and developers also find nan weird.
I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.
YMMV, but personally the unique index behaviour was the main thing I found weird about NULLs. So now that I can turn that off I'm pretty happy with them.
> I'd argue the closest thing in most languages is the null propagation operator (usually using a ? symbol) that at least JavaScript and C# have.
In what sense? null-coalescing operators don’t change how nulls behave or relate to one another, they only provide convenience operators for conditioning operation upon them, like sql’s COALESCE.
It is kinda weird, if you don't know how it works. For instance:
Say you have a unique constraint on two columns (column_a, column_b). column_a is not nullable, column_b IS nullable. Obviously these values are unique:
(1, 2)
(1, 3)
(2, 2)
But these values ARE ALSO UNIQUE:
(1, NULL)
(1, NULL)
It is obvious in hindsight (NULL isn't equal to NULL), but can be quite a surprise.
...
Oh wait, postgres 15 deals with just this situation. Huh.
Honestly it's what you want way more often than not. A unique index on a nullable column is usually representing some optional identifier. You wouldn't usually want only one row to be able to omit the identifier.
It makes lots of queries (like find duplicate queries) much more complex without any benefit. Also it introduces completely unnecessary three value logic in SQL.
With keys it very often means "definitively not present" in practice, and there's no other value that you can use to represent that if you want referential integrity.
In any thread involving SQL NULL, I see a lot of not-quite-right explanations of what SQL NULL is, conceptually. I challenge anyone who feels like they understand NULL conceptually to explain the following query:
-- find orders with a total less than $10000
select order_id, sum(price)
from orders o left join order_lineitems l using (order_id)
group by order_id having sum(price) < 10000;
This query is actually incorrect. Orders with no line items at all are clearly less than $10000, but they will be excluded because: first, the left outer join produces a NULL for the price; second, the group aggregation with SUM over that NULL will result in NULL; and third, the HAVING clause treats that NULL as false-like and excludes the order from the result.
Of course, we can explain procedurally what's happening here, and each individual step makes some sense. But the end result has no conceptual integrity.
Extra challenge: explain why using COUNT instead of SUM in the query does correctly return orders with fewer than 4 items:
-- find orders with fewer than 4 line items
select order_id, count(price)
from orders o left join order_lineitems l using (order_id)
group by order_id having count(price) < 4;
PS: thank you to the author for a developer-friendly feature that adds flexibility here!
I’m not sure what explanation you want, NULL is in essence “not a value”, which works more or less the same way “not a number” does. So yes if you perform operations between a NULL and a value you get a NULL, therefore when you sum or compare NULLs with or to other things you get NULL, which is then treated as false in a boolean context.
If the price of one item is UNKNOWN (which is what NULL represents in SQL) then it stands to reason that the sum is unknown, and it is unknown whether the sum is or is not smaller than 10000.
> But the end result has no conceptual integrity.
> Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?
> Extra challenge: explain why using COUNT instead of SUM in the query does correctly return orders with fewer than 4 items:
You don’t need to determine the values of the sequence to see that there are 4 of them, therefore count doesn’t care that some of the items are null.
Because it’s very common for attributes to be optional but unique if specified e.g. your fleet management might have a 0-1:1 relationship between employees and cars, no car can be associated with 2 employees, an employee can’t have two cars, but any number of employee can be car-less, or car employee-less (lending pool, or cars which have not been attributed yet).
This also fits perfectly well with the normal interpretation / treatment of sql null as “not a value”.
>This change prohibits nothing, it allows specifying something (which was a bit complicated to enforce before).
It allows specifying a constraint - that you can't have two rows with the same values even if one of the values is a NULL. That's prohibiting duplicate NULLs. The change allows you to prohibit duplicate NULLs.
Say you have a table (EmployeeName, CarID). You could do a UNIQUE constraint on those two attributes, but that would still allow:
EmployeeName | CarID
Jeff | 2
Kim | NULL
Kim | NULL
Here, "Kim" is car-less (NULL in the CarID field) twice, which makes no sense.
> It allows specifying a constraint - that you can't have two rows with the same values even if one of the values is a NULL. That's prohibiting duplicate NULLs. The change allows you to prohibit duplicate NULLs.
Sure.
> Here, "Kim" is car-less (NULL in the CarID field) twice, which makes no sense.
The relation makes no sense to start with, it didn’t need duplicate nulls for that.
That's not quite the same, since the previous solution of unique (EmployeeName, CarID) allowed for one employee to have 2 separate cars. You have to use conditional indexes to solve this.
It's a 0-1:1 relationship, so you'd have columns like "name", "email", "car", etc. on the employees table, and "car" might have a unique index (which is probably an ID referencing a "cars" table).
Since employees can only have one car and since a car can only be used by one employee, an unique index is fine, but there may be 20 people that don't have a car so you don't want NULL values to be unique.
Of course there are always ways to work around this, but this is the general idea.
Multiple NULLs in a unique index are useful when an entity has 0 or 1 of a simple scalar value. For example, you might have a column with an external ID, such as a facebook login id in your users table. Not all users have a facebook id, but those who do should have a unique value in that column.
You might want to look at a unique partial index: "CREATE UNIQUE INDEX foo ON users(facebook_id) WHERE facebook_id IS NOT NULL".
I am not sure if this really is a great modelling. If a user happens to have two logins in your table (maybe they signed up with two different emails), and then connects one of his accounts with his FB identity, what does it mean that the other entry cannot be connected to the FB identity?
Lets say you have a Person table with with a SSN column. The column has a unique constraint because no two people have the same SSN. But the column in nullable, since not everyone has a SSN.
It would be weird to expect that only one person does not have SSN.
I can't think of a situation where I'd want "only a single NULL"; that seems much weirder. Another comment described a tree data structure where it could make sense, but I can't think of others. Nullable unique indexes usually describe some kind of optional identifier; if it's present, you want it to be unique, but it's perfectly fine for more than one row to omit it.
You might want an index like this on some aggregate table. Like, shoe_sales_total, which is grouped by the id of the shop assistant that helped you with the selection. But customers that have marched straight to the till and asked for shoe polish result in records with a NULL for agent_id. With a regular unique constraint (month, agent_id), your aggregate table can end up with two records for (May-21, NULL).
Of course, the workaround for such cases is simple: have a sentinel value for unassisted sales and make the column not nullable.
Such "tables" would more typically be constructed as views (materialised if performance requires it) over underlying data. In this case the underlying data would have a row for each sale, with a normal nullable column for agent_id. The query that constructs the view would use GROUP BY, and since NULLs group together in GROUP BY it would work as you expect (grouping "no agent" sales together into a single record in the view). The database system thus handles ensuring data consistency, merely by using its features as they are designed to be used.
One way nullable columns are often used is like an option type. It's easy to see that you might want rows to be unique if they have Some(value) but have multiple rows where the column is None
I agree that this is often the use case but depending on how the query (as well as how sparse the nulls are, the total data size and partitioning scheme I guess) it's also possible to achieve this through a join-table without introducing NULL values.
Good point. I did try to say that whether that's a viable technique would depend on your query (but omitted some verb that should have made it better resemble a sentence in English :)).
But yes, even if no nulls are present in any table, outer join could (or maybe even should) introduce them.
The rows (1, NULL), (1, NULL), and (1, NULL) - each NULL signifies an absence of a value - these NULLs may actually be later filled into (1, 1), (1, 2), and (1, 3). That's why may not know if any two NULL values are equal to one another.
In some cases the above is actually important, whereas in others you may want to treat NULL as a "value" instead.
Coming from an MS SQL Server background, this behavior surprised me.
While SQL Server generally does treat null values as not equal in query predicates, for the purpose of unique constraints/indexes it’ll treat them as the same value. I guess this does go against the spec? But it makes sense as a human when looking at a tuple and deciding if it is “unique” or not.
On that note, now I wonder what SQL Server does with nulls in a DISTINCT or GROUP BY. I suspect that it’ll treat all nulls as the same.
An interesting question. They may be treated differently depending on whether the nullable field is one of the grouping fields, or is one of the selected aggregate fields.
> The background here is that the SQL standard was ambiguous about which of the two ways an implementation should behave. So in the upcoming SQL:202x, this was addressed by making the behavior implementation-defined and adding this NULLS [NOT] DISTINCT option to pick the other behavior.
When something that fundamental changes what is the cascading effect? Do other functions refer UNIQUE/NULL or do they have their own versions of UNIQUE/NULL like code?
Cool but this was trivially solved previously by excluding NULL from the index using a partial index.
I'm more interested in problems that have no great solution, like taking the first element of a set of partitioned data. Right now you have to partition & rank, and then select the rank 1 elements in a subsequent query.
I don't think you understood what this does (or then I misunderstood this) but if you do that, then how are you going to get an index violation from having two NULL values in the index? Because that is what this allows: normally indexing NULL values permitted multiple NULL entries, but with this NULL is comparable to other NULLs, therefore you can have only one NULL value in the index.
Previously I had worked this around with functional index on e.g. COALESCE(value_with_null, -1), but not always you have a good sentinel value easily available. A more complicated index (..CASE WHEN..) would solve that, but it's nice to have this directly expressible, seems efficient as well.
> Cool but this was trivially solved previously by excluding NULL from the index using a partial index.
That seems like the opposite solution, treating NULLs as different in database systems which treat them as identical by default (which I think only includes MSSQL).
In postgres, and most DBMS, using a partial index excluding NULL allows any number of rows to have a NULL value… which is the same behaviour as if you’re not excluding NULLs.
The new opt-in behaviour is treating all nulls as the same value, and thus allowing a UNIQUE index to constraint them.
This required two different partial indexes, or for constrained data subset in hacks like COALESCE-ing nulls to a sentinel.
Maybe it's what you meant be "subsequent query", but you don't actually need to do it in a separate query; you can do the partition & rank in a CTE and select from the resulting table, all in a single query. It's quite tidy actually.
I think this is a great change. I wish NULLS NOT DISTINCT was the default already - if I wanted the constraint to ignore null values I’d have used a partial index to only index non-null values. Doing the opposite is much more difficult.
This is cool, but wouldn't creating a constraint using a nullable column be considered a poor design decision? In which scenarios would this be a good idea?
When an entry can belong to 0 or 1 related object only. Not that I'd put a constraint in such scenario, but I imagine a User can optionally have a Subscription, so subscription_id is either nil or present, and said subscription cannot be associated to any other User.
The foreign key only gaurantees that the other entity exists.
The unique constraint ensures that only one pair of entities has this relationship, preventing a one-to-many binding.
The distinctness of NULL allows you to have multiple entities with the same NULL value without violating the above UNIQUE constraint.
The "NULL is empty" vs "NULL is unknown" is a series of trade-offs of labor-saving. Imho, the wrong trade-offs were made, but once the choice is made it makes sense to continue and be consistent with it. I'd rather be consistently wrong than inconsistently right.
Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations. But I'm glad that people are also finding other uses for it.