Postgres 15 improves UNIQUE and NULL

petereisentraut · on July 11, 2022

I am the author of this feature. The background here is that the SQL standard was ambiguous about which of the two ways an implementation should behave. So in the upcoming SQL:202x, this was addressed by making the behavior implementation-defined and adding this NULLS [NOT] DISTINCT option to pick the other behavior.

Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations. But I'm glad that people are also finding other uses for it.

woevdbz · on July 11, 2022

> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one

SQL is old enough and this debate so unsettled still that I think it should be clear there isn't a categorically right behavior here anymore than there is a clear winner between tabs and spaces.

As an example of the irreconciliable weirdness of NULL, consider that "NULL = NULL" is false, and so is "NULL != NULL", while rows with NULL still group together in GROUP BY.

I appreciate you giving folks the option.

anamexis · on July 11, 2022

> As an example of the irreconciliable weirdness of NULL, consider that "NULL = NULL" is false, and so is "NULL != NULL

This isn't quite true - these comparisons don't evaluate to false, they evaluate to NULL.

I intuitively think of NULL as "unable to compute," a generalized NaN.

magicalhippo · on July 11, 2022

> these comparisons don't evaluate to false, they evaluate to NULL

In the databases I've used, such as MSSQL[1], they evaluate to UNKNOWN.

[1]: https://docs.microsoft.com/en-us/sql/t-sql/language-elements...

bfgoodrich · on July 11, 2022

The document you link uses unknown as a synonym for null. If you inserted the result of such a comparison into a table, the value inserted would be NULL.

magicalhippo · on July 11, 2022

I'm self-taught, could you show how to insert the result of such a comparison into a table in mssql? I can only think of using CASE or similar, which would mean no direct conversion.

orf · on July 11, 2022

insert into table (c) values (null != null)

magicalhippo · on July 12, 2022

Can you provide a complete working example? I'm just getting "incorrect syntax near !" with both mssql and sqlanywhere.

    create table #nulltest (a bit null);
    insert into #nulltest(a) values (0),(1);
    insert into #nulltest(a) values (null != null);

distalzou · on July 12, 2022

Some SQL dialects don't recognize != and use <> instead

magicalhippo · on July 12, 2022

True, I'll blame vacation mode for not catching that one.

However, that doesn't change things, I can't get any of these to parse:

    insert into #nulltest(a) values (1 <> 0);
    insert into #nulltest(a) values (null <> null);
    insert into #nulltest(a) values (true and true);
    insert into #nulltest(a) values (null and null);

So my question remains, how do you insert an UNKNOWN?

The point of me asking is that it was my belief contrary to what was suggested, UNKNOWN is not the same as NULL and, crucially, does not evaluate as NULL.

ratsmack · on July 11, 2022

It's good to see that everyone agrees on the definition of NULL.

DaiPlusPlus · on July 11, 2022

Yes: a billion-dollar mistake.

bazoom42 · on July 11, 2022

This is s misunderstandig of Hoares “billion dollar mistake”. The mistake was not the existence of nulls, the mistake was type systems where every reference could be null.

Sql does not have this problem.

lmm · on July 12, 2022

SQL absolutely does have that problem. You have no way of enforcing or checking that an expression is supposed to be non-NULL; at best you get an error at runtime when you try to insert it into a given column, which is usually far too late.

anamexis · on July 12, 2022

This is only a problem insofar as any data constraint is a problem.

lmm · on July 12, 2022

No, it's a specific problem of SQL that you can apply constraints to stored or quasistored columns but not to expressions. In a real typed language every expression has a type.

erichocean · on July 11, 2022

And with RDBMSs, a multi-trillion dollar success!

paulryanrogers · on July 11, 2022

This is a good way to remember the SQL behavior. Though I do hope it'll change in the future. So many other languages interpret null as a comparable value, and it will be less surprising to folks coming from those.

afiori · on July 12, 2022

I personally don't have issues with languages that have incomparable values, just with languages that have a generic equality and incomparable values.

If every type has it's own equality operation (e.g. Number.eq) then I have no issue with it being weird in unexpected types (e.g. Number.eq(NULL, NULL) =false)

For example in JavaScript almost every use of === should be replaced by Object.is.

goldenkey · on July 12, 2022

I agree, however that's all fine and well but it turns the languages syntax into dogshit. A babel plugin (transpiling) would be ideal.

On the same note, the original intent was for NaN to be insidious and propagate into every derived calculation or comparison, thus making it easy to tell if a mathematical function had failed, without having to add conditionals at every step within it. Which makes a lot of sense. However, not being able to know exactly where the NaN was produced is kind of shit. Now, if NaN had a stack trace property like Error (exception) objects...it would make this behavior much more useful. Makes me wonder if the prototype of NaN can be extended, hm.

afiori · on July 12, 2022

NaNs cannot be extended in any way, in many interpreters NaN is used to "hide" pointers in 64 bit floats using the 52 free bits in NaN values.

IMO NaN should have never existed, like signed zeros, (I am more agnostic about [signed] Infinity) I think that they were implemented in IEEE 754 because the encoding had space left for them.

I not an expert in any way on this this is literally a personal opinion.

goldenkey · on July 12, 2022

Right. I was thinking that maybe NaN in JS was more than just the IEEE version we see in languages closer to the metal.

Pxtl · on July 11, 2022

> This isn't quite true - these comparisons don't evaluate to false, they evaluate to NULL.

Which large parts of SQL treat as equivalent to false.

    WITH vars AS (
        SELECT CAST(null AS INT) as X
    )
    SELECT 'yes this is a tautology' as "is this is a tautology"
    FROM vars
    WHERE X = X

The above query returns nothing, not an error because the value of X is unknowable. In a sane language, X=X and so it would yield "yes this is a tautology". SQL standard wants it both ways - it wants to say "boolean null isn't false" except when for its own convenience it decides that yes, null is false.

The truth, imho, is that SQL has an archaic type system. In this day and age it should be offering algebraic types where we can implement our own "empty" values and the like that may not use three-value logic as appropriate, so that I can say that for example End Date is blank instead of either unknown or Date.MaxValue.

anamexis · on July 11, 2022

> SQL standard wants it both ways - it wants to say "boolean null isn't false" except when for its own convenience it decides that yes, null is false.

I don't see it that way. Null cast to boolean is always false. Null equality is undefined. There is no contradiction there.

saltcured · on July 11, 2022

I think that this topic gets tripped up with people wanting to assume boolean means two possible states... NULL adds an extra state to every type much like the "bottom" type/value in some functional programming languages or type systems.

It's not the `=` operator that is special or significant here. It is the `WHERE` clause semantics, which are defined as returning rows when the boolean expression is true. This treats both false and null cases as the other possible outcome, when rows are not returned. Not understanding this distinction is also where people write buggy queries using WHERE NOT ..., because like equality the NOT operator will return NULL for a NULL input (so NULL is not like false and does not negate to true).

SELECT 'hidden by null result' WHERE CAST(NULL AS boolean);

SELECT 'also hidden by null result' WHERE NOT CAST(NULL AS boolean);

SELECT 'shown by true result' WHERE CAST(NULL AS boolean) IS NULL;

SELECT 'hidden by false result' WHERE NOT (CAST(NULL AS boolean) IS NULL);

The existing GROUP BY behavior is different than WHERE because it does not have a special rule for handling true values. It just returns a group for each distinct value.

The prior (new default) UNIQUE constraint behavior made sense when you think of NULL as this bottom value and meaning "unknown", while also thinking that a uniqueness constraint is only applied to complete keys. The check is effectively deferred, and the partial or incomplete tuple might become valid and unique once its missing part(s) are filled in with real values. The new optional behavior makes more sense if you think that the uniqueness constraint enforces distinct records even with partial keys, just as the GROUP BY clause will determine distinct groups.

There is also a similar issue with FOREIGN KEY constraints. The default MATCH SIMPLE behavior is consistent with the default UNIQUE constraint behavior. An incomplete foreign key is essentially ignored and not subject to integrity checks. The MATCH FULL behavior is almost a style checker to say that a compound foreign key should be all null or all non-null, while still ignoring incomplete foreign keys. As far as I know, the MATCH PARTIAL case is not implemented, but I think it might complement this new UNIQUE NULLS NOT DISTINCT behavior?

One can imagine a further branching point in behaviors for foreign keys and I'm not sure how MATCH PARTIAL will be defined. One mode might be easily implementable with the new index method, if we want partial foreign keys to be able to reference partial keys in the referred table, i.e. the partial key with NULLs is considered an actual row key. But we might also imagine another partial matching mode where the the non-null components of a compound foreign key have to all exist in at least one key in the referred table, while ignoring the null fields as incompletely formed references. This would be an improvement on MATCH SIMPLE which allows the non-null columns in a foreign key to hold values which could not possibly match the referring table no matter which values are filled in for the remaining null fields of the foreign key. However, this enforcement would not be easily implemented without a powerset of indexes for all partial combinations allowed in the referencing tables!

bobbylarrybobby · on July 11, 2022

Er, sorry, but I thought that the answer to tabs vs spaces was use tabs for semantic/logical indent, use spaces for physical/alignment indent.

jon_richards · on July 11, 2022

Yes, mixing invisible characters will solve all our problems.

account42 · on July 12, 2022

They don't have to be invisible if you use a good editor.

qez · on July 11, 2022

> anymore than there is a clear winner between tabs and spaces.

But there is a clear winner here. Tabs won, almost all code files use spaces. Of course everyone uses the tab key on their keyboard, and the IDE makes spaces for you; no one is going to manually press space 4 times.

bornfreddy · on July 11, 2022

There is no winner, both tabs snd spaces are widely used. However with proliferation of better tools the question thankfully became mostly irrelevant. I don't care if it is spaces or tabs, as long as it is consistent across codebase (which CI enforces with autoformatters and linters) and as long as my editor (which obeys project's editorconfig) uses the same convention.

spacemanmatt · on July 11, 2022

As a savage, I use 4 tabs now.

tingletech · on July 11, 2022

I hit space 4 times...

jjoonathan · on July 11, 2022

...which is clear evidence that hitting space 4 times is the wrong thing to do :P

masklinn · on July 11, 2022

> SQL is old enough and this debate so unsettled

Is the debate really unsettled?

MSSQL considers nulls to be equal, everyone else considers them to be distinct (aside from a few DBs which just refuse to create UNIQUE indexes on nullable columns), and Oracle is apparently huffling glue in a corner.

> As an example of the irreconciliable weirdness of NULL, consider that "NULL = NULL" is false, and so is "NULL != NULL", while rows with NULL still group together in GROUP BY.

Group by is the exception here. And arguably it is so because the alternative is never useful, or close enough as to not be worth it.

feanaro · on July 11, 2022

So of the three major SQL implementations, you have one of them following the results of the settled debate, one of them being an exception and one of them huffling glue (whatever that means). In addition, there's a language exception even in the settled group, which you have to handwave away.

Sounds pretty unsettled to me.

irrational · on July 11, 2022

Oracle considers null to be equal to empty string.

mike_hock · on July 11, 2022

That is truly huffing glue.

irrational · on July 11, 2022

And oracle considers null to be equal to empty string. My understanding is oracle was created before the sql standard existed and they don’t want to change because of backwards comparability. I don’t understand why they can’t add a db parameter to allow you to set how nulls are handled.

to11mtm · on July 11, 2022

IDK if you ever saw the HN thread where they talked about oracle and change tests from a few years back, but if I had to guess, the pain just isn't worth the overall cost.

Overall I remember running into this but finding it fairly simple to add a couple specific checks around the condition where it came up in our code.

layer8 · on July 11, 2022

They would need to change their storage format to be able to distinguish between NULL and empty string, and probably a lot of in-memory structures and even more code logic as well. It would be hugely expensive to support both.

Oracle introduced VARCHAR2 in the 90s(?) to mean “Oracle semantics” as opposed to VARCHAR which is documented as “may change to standard SQL semantics in a future version”. However I don’t think they’ll ever materialize that distinction.

nicoburns · on July 11, 2022

A slight tangent, but do you have any pointers on how one might get involved in either:

- The SQL standardisation process (or whether this is even feasible as someone who isn't involved in the development of a major database engine)

- Postgres development

The feature I am particularly keen to get accepted is trailing commas in SELECT column lists (potentially other places too, but SELECT lists would be a very good start). And there are potentially a bunch of other improvements in the syntax sugar department that I might be interested in spearheading if I could successfully get this accepted (e.g. being able to refer to select list column aliases in GROUP BY clauses).

I don't have much experience with C, but I would potentially be up for actually implementing the change if I could get buy in from the maintainers.

orthoxerox · on July 11, 2022

> The feature I am particularly keen to get accepted is trailing commas in SELECT column lists

Not "GROUP BY THE OBVIOUS LIST OF EXPRESSIONS, YOU KNOW, THE ONES FROM THE SELECT CLAUSE THAT DON'T CONTAIN AGGREGATE FUNCTIONS"? Aka "GROUP BY ALL" from DuckDB.

gurjeet · on July 11, 2022

Postgres project is always looking for contributors. The project has a very nice, detailed, well-defined documentation on how to contribute [1]. The community is very welcoming, and has a process (CommitFest[2]) in place to ensure all submissions get their due attention.

FWIW, I am starting to make an effort (e.g. [3]) towards helping the newcomers get their work in shape to be acceptable by the committers.

Note: I'm biased towards Postgres community, since I've worked with them for many years. So others' opinion _may_ differ, but it's highly unlikely.

[1]: https://wiki.postgresql.org/wiki/Development_information [2]: https://wiki.postgresql.org/wiki/Development_information#Com... [3]: https://www.postgresql.org/message-id/CABwTF4Us8WGMffNGTNY1X...

zigzag312 · on July 11, 2022

> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations.

Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?

undecisive · on July 11, 2022

I think it's worth pointing out what the opposite means - to say that "this field might have a value, or it might be null - but only one tuple/row can set this field to be null" implies that you have given the null option some kind of real-world value.

That is to say, you are using null to encode some kind of meaning, when really this is not what null is supposed to be used for.

That's not to say I think we should be morally disapproving of people who do that - I use things for their unintended purpose all the time, and it bugs me when people get on a high horse. Use what makes sense to you - and I love this change for exactly that reason.

But the general theoretical approach is that if you want to care about the value in a field, you need to give it a value. Null is for the valueless, and if that isn't an allowed state, you should simply set the field as not-nullable. Theory is fine in theory. For a practical database that exists in the real world, this option is a good addition.

eins1234 · on July 11, 2022

Fauna has an interesting approach to null that I've grown to like:

Null is a real value that can be compared intuitively like any other value (`Equals(null, null)` returns true). [1]

However, in indexes, any term that evaluates to null is simply not stored in the index. So we can create a unique index with multiple null values because they simply won't exist in the index, so won't violate the unique constraint.

If we do care about null values in a particular index, we can handle that by mapping the null value into another value that will get stored in indexes using a "binding", and then use that index to query for or constrain uniqueness by that mapped value [2].

This is not the most convenient thing in the world, but at the end of the day, it feels like an edge case, so I'm happy with having to jump through some hoops to handle it in exchange for making the much more common cases (null comparisons and unique indexes with nullable values) more intuitive and less error prone.

[1] https://fauna.com/blog/understanding-nothing-or-null-in-faun...

[2] https://docs.fauna.com/fauna/current/learn/cookbook/fql/sear...

nightski · on July 11, 2022

I think a lot of the trouble stems from the fact that databases do not have sum types and there is no way to encode a "None" type without hacks. NULL is the only reasonable option in a lot of cases.

wongarsu · on July 11, 2022

Yeah, in hindsight the world would be a slightly better place if SQL didn't include NULL, but instead "UNKNOWN" and "NONE", with "NONE = NONE" being true, and "UNKNOWN = UNKNOWN" being unknown.

stouset · on July 11, 2022

Or have an Option (or Maybe) type. This is how modern programming languages solve the nil issue, I don't see a reason why a database couldn't take the same approach.

kzrdude · on July 11, 2022

When would NONE show up and when would UNKNOWN show up?

wtetzner · on July 11, 2022

NONE would mean “there is no value”, and UNKNOWN “there might be a value, but I don’t know what it is”.

dx034 · on July 11, 2022

I'm not the person you're responding to, but also think NULLS DISTINCT makes sense in many cases. NULLs often represent missing data. Imagine storing a customer's address and the street name is NULL. If several customers have a street=null, it doesn't mean that they have the same street. So from a data perspective, it makes sense to treat these unknown values as distinct.

For filtering and aggregation I still welcome the change, as it makes sense in many cases to treat them as not distinct.

zigzag312 · on July 11, 2022

Thanks. It seems that the issue arises from value equality vs optional value state equality.

To me, a more natural way to treat NULLs is to think of NULL not as a value, but as a state. Several customers with a street=null all have street property in an equal state. However, an equal state doesn't mean the value is also equal, as there is no value in that state. Option type in functional languages models this perfectly:

  'a option = 
    |None       // no value here
    |Some of 'a // value of type 'a

So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).

Unique constraints should not be a problem. Unique constraints are about values. So if state is NULL there is no value and so constraint does not treat NULL states as equal values. Adding a keyword to change constraint behavior when needed, would be best IMO, as it would need to be rarely used.

masklinn · on July 11, 2022

> So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).

Is it tho?

Languages with ubiquitous nullability will accept both being null or both being non-null and the same value. Languages with option types do the same (as long as the value is equatable obviously), this includes OCaml: https://v2.ocaml.org/api/Option.html#preds.

The only languages I can think of which seriously diverge would be C and C++ where this is UB.

Maybe if you way stretch it Java because it doesn't have operator overloading, but even there there's Object.equals which is null-safe... and returns `true` if both parameters are `null`.

zigzag312 · on July 11, 2022

I completely agree. I tried to explain basically the same thing as you, but it seems there is some ambiguity in my example I didn't intend.

Comparing two nullable/option values when both are in a "null state" will return true in most languages. Which, to me, is the most logical result as they have the same state. So, if you want to check if two people actually live on the same street, it's not enough to check p1.street == p2.street, but you need to also check that at least one of the streets is not null. Two nulls would return true, but if equality returns true and we know that one of the streets isn't null, we know we are comparing actual street values (not null states). Sadly, in SQL, NULL=NULL result is not true, but NULL.

mekster · on July 13, 2022

It seems SQL NULL means both undefined and none which seems to cause confusion.

dtech · on July 11, 2022

Not OP, but this constraint means "the data must either be empty or unique", which is an extremely common constraint. In contrast, I've never encountered "only 1 entry is allowed not to have the data, all others must be unique".

enkrs · on July 11, 2022

"only 1 entry is allowed not to have the data, all others must be unique" makes sense in larger schemas as the database grows. It replaces is_default_address, is_primary_child etc. fields and constraints on relational tables. In the perfect world all the data would be normalized not to have such columns and cases, but in real life it just grows that way. So for some cases NULLS NOT DISTINCT will be a welcome addition.

knorker · on July 11, 2022

Not parent commenter, but to me it seems "obviously correct".

As explained in the article NULL means "unknown". Let's say I have two people, and a "tax identifier" column. Let's say two people can't have the same tax id. Obviously "don't have one" or "unknown" fit very well in NULL, and two people can both be missing a tax id, even if those who have it need it to be unique.

chris_wot · on July 11, 2022

Except that's the ambiguity of NULL in the standard. It doesn't just mean "unknown". It can also mean "no value". And you can clearly compare two elements with no value.

Which is why nulls are so incredibly confusing.

knorker · on July 11, 2022

Yeah that's true. All the database can say is "I don't have a value here". Not "there exists no value for this column".

But this doesn't mean that two NULLs are the same. Two people with NULL tax IDs are not the same (one turns out to be unknown but existing, the other is a toddler without a tax ID). Ten people with NULL address don't live in the same house. Half declined to give address, the other half are homeless. Either way we can't send mail to them.

If you use the database to have a UNIQUE constraint of "only one purchase per household" then it makes no sense to allow NULL addresses, but only allow the first NULL address purchaser to buy the thing.

Or "sorry, in this hotel we only allow one guest at a time without a car, and someone else already doesn't have a car".

Does that guest without a car actually have a car? I don't think that's something that the database can solve.

Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.

jl6 · on July 11, 2022

> Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.

It’s also a non-problem in practice, because if an application needs to distinguish between multiple types of nulls, it can very easily just use an extra column holding the information needed to disambiguate.

jstanley · on July 11, 2022

Similarly, the need for non-boolean columns is a non-problem in practice, because an application can just make an extra column for every bit it needs to store. If it needs strings or arbitrary-width numbers it can just join on other tables that store those bits.

(Existence of a workaround doesn't turn a problem into a non-problem)

jl6 · on July 11, 2022

I stand by it being a non-problem in practice because the workaround is trivial, and the supposed solution adds complexity everywhere.

There isn’t even a universally agreed methodology for how to handle nulls in the relational model in theory.

It’s a semantic problem that can only be solved with reference to the requirements of a particular use case, and when you do that using explicit extra columns, the problem evaporates.

fendy3002 · on July 11, 2022

Which happen in world of javascript. It has null and undefined. Null variable is a variable / property that's defined and set with null, while undefined is a variable / property that's not defined yet (or maybe not initialized, I forgot).

I hope at least SQL will give another comparison operation, such as the current equal sign (=) means a.prop is not null && b.prop is not null && a.prop equal b.prop (for not defined yet).

Let's say another sign like a.prop *= b.prop, meaning a.prop is null && b.prop is null || a.prop = b.prop (for our intention is both value are empty).

pdw · on July 11, 2022

Such a comparison operator exists: IS (NOT) DISTINCT FROM.

jhgb · on July 11, 2022

> It doesn't just mean "unknown". It can also mean "no value".

Isn't the canonical representation of known-no-value an absence of a tuple? Like as opposed to saying "There exists an employee X who works in department NULL", you simply don't make any claim about employee X working in a department? After all, when enumerating the members of a set, you're also omitting the enumeration of non-members of a set, and the law of excluded middle applies.

Tostino · on July 11, 2022

Nulls will reappear in your queries due to joins even if none exists in your data model.

jhgb · on July 11, 2022

Ah yes, they're the curse of SQL. Still doesn't seem to be a reason to not squash them out as much as feasible within the constraints of SQL.

richdougherty · on July 11, 2022

It's probably a good default because it's more consistent with how NULL equality is handled in SQL generally. From the article:

> In Postgres 14 and prior, unique constraints treated NULL values as not equal to other NULL values.

...

> This is consistent with the SQL Standard handling of NULL in general, where NULL is unknown. It is impossible to determine if one unknown is equal to another unknown. Because NULL values are of unknown equality to one another, they do not violate UNIQUE constraints.

zeroimpl · on July 11, 2022

Your last statement would be better written as: Because NULL values are of unknown equality, they may or may not violate UNIQUE constraints.

The safest implementation would assume that they do violate the constraint, rather than the current behavior assuming they don’t.

In practice I don’t think I’ve ever added a unique index on a nullable column where null might imply unknown. I have used it in cases where null meant “none”.

petereisentraut · on July 11, 2022

> Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?

It's the most consistent with the equality behavior of null values elsewhere.

jeff-davis · on July 11, 2022

I generally agree that, in most cases, you want the NULLS DISTINCT behavior. But thank you for providing such a developer-friendly feature that allows flexibility here!

NULL in SQL is not terribly consistent overall. Sometimes NULL is treated like "unknown" and sometimes more like "n/a". And the interaction with non-scalar types (like records) is pretty strange. Also, it's common for ORMs to map the app language's NULL (or None/Nil/whatever) to SQL NULL, which adds its own nuance. So I can see this being a useful feature.

chris_wot · on July 11, 2022

Well, if NULL is most often meant to signify an unknown value. You can't compare one unknown value with another and say they are the same.

The standard is, I believe, ambiguous about NULL because it can also mean that it is an absence of data. In other words, it is "this is not yet determined" (aka a CAR table has a COLOR table, but the paint is applied as the final operation so NULL could be used as the car is not at this stage yet and the color will be chosen later). In this case, you can compare NULLs against other NULLs, because you can have two cars that are in an indeterminant state. In that case, it is useful to compare to NULL values as it means the same thing.

woevdbz · on July 11, 2022

An example where NULLS DISTINCT might make sense in a constraint is when a table is used to store an analytical cube roll-up (eg output of GROUP BY ROLLUP), where NULL in a dimension column has special meaning (it indicates subtotals). Having multiple rows with the same NULL key could, in some of those cases, be an error leading to double-counting.

Pxtl · on July 11, 2022

Once you've accepted that X=X doesn't return TRUE when X=NULL, NULLS DISTINCT is consistent with that. ANSI NULL three value boolean algebra is insane, but it's insane in a pretty consistent way.

rst · on July 11, 2022

So, we have a "standard" which says that both behaviors should be supported, and there should be two different ways to express them, but explicitly declines to say which does what?

I'm genuinely not sure what purpose this particular standard serves, but it looks like it's not allowing people to write portable code...

anamexis · on July 11, 2022

I can't imagine the standard does not specify that NULLS DISTINCT should treat nulls as distinct and NULLS NOT DISTINCT should treat nulls as not distinct.

rst · on July 11, 2022

Ah. So there are three ways to write it, and explicit ways to call for either behavior. That's sane -- just wasn't what I got from the original writeup.

mjw1007 · on July 11, 2022

No, it's not as bad as that.

The thing that's implementation defined is what happens if you don't explictly select either option.

zeugmasyllepsis · on July 11, 2022

Playing around in a docker image of the beta build, it looks like this allows you to add a `unique nulls not distinct` constraint to a composite set of fields, but still does not allow you to specify those same fields a primary key. For example

    -- This works
    alter table announcement_destinations
        add constraint announcement_destinations_pk unique nulls not distinct (sender, receiving_group, optional_filter);

but

    -- Still does not work
    alter table announcement_destinations
        add constraint announcement_destinations_pk2 primary key (sender, receiving_group, optional_filter);

fails with a message like "ERROR: column "optional_filter" of relation "announcement_destinations" contains null values". Is there a motivation for this distinction?

Justsignedup · on July 11, 2022

Currently nulls are considered not distinct in postgres, no? Doesn't this mean anyone upgrading will have to fix up all their table definitions? Or am I wrong about this?

I just care about the backwards compat. To be clear, the second I saw this I thought "finally!!!"

masklinn · on July 12, 2022

> Or am I wrong about this?

This one. Currently nulls are always distinct (different from one another).

thijsvandien · on July 11, 2022

> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one

I find this to be at odds with IS DISTINCT FROM, which is false for two NULLs.

ryanbrunner · on July 11, 2022

This is great! I have multiple places in our code where I've had to define multiple unique indexes (with conditionals) to get around this. It's definitely confusing behaviour the first time you encounter it.

Honestly I think SQL already breaks it's own standard of "NULL = unknown". One example: If you do a LEFT OUTER JOIN and the joined data is missing, you get NULL in the resultset. But in that case, it's not "unknown" whether there are records that can be joined to, the answer is that there are definitively no records there.

NULL is univerally considered by most programming languages to equal "not present" vs "unknown", and in many cases by SQL in practice. The ship has probably sailed on actually changing the default behaviour, but optional flags like these are a good step.

knorker · on July 11, 2022

NULL means "I don't have this data". It doesn't guarantee that the data doesn't exist anywhere.

It's semantically ambiguous, yes.

But it's not ambiguous that the database doesn't have the data. It's very clearly NULL.

If "middle name" is NULL, then the database itself can't tell you if a person doesn't have a middle name, or if it's merely not yet been told the middle name. You could differentiate the two by saying empty string for "has no middle name"… except on Oracle VARCHAR2, where NULL is equal to empty string.

Pxtl · on July 11, 2022

> You could differentiate the two by saying empty string for "has no middle name"… except on Oracle VARCHAR2, where NULL is equal to empty string.

This is what kills me.

This seems like an obvious case where multiple flavors of NULL would make perfect sense. For example Ticket.CancellationDate isn't "unknown", it's this ticket is not cancelled. But SQL standard doesn't have a way to express "empty value" as distinct from "unknown value", so we're stuck working around the strange implications of using this square peg to fill a round hole.

Oracle's bizarre "empty VARCHAR2s are NULL" is strange but is consistent with every other datatype that offers no way to say "empty" but "NULL". 0 is not the same as NULL int and is not the same as empty int. Mindate is not the same as NULL date and is not the same as empty date.

knorker · on July 11, 2022

Well, for the string case empty string is… the empty string.

Storing the exit code and stdout of a program run Oracle allows the state "it's not finished yet" to have exit code representation of NULL, but you cannot say the same for string.

"Unknown" is not a perfect map to NULL. "I have no answer for you" is better.

Maybe NULL values could use an annotation.

But I think that would make things worse. Like let's say you have some query that joins addresses with coordinates. You know that semantically any address without a coordinate has an "unknown" coordinate, not that it authoritatively doesn't have coordinates.

But not only do you need to change the data model then, but the query language too, to be able to express the difference where absense means one or the other.

Now absense is NULL, and leaves it up to the application to interpret.

SELECT a.address, b.coord FROM a LEFT OUTER JOIN (SELECT address,coord FROM coords WHERE is_test=FALSE) AS b ON a.address=b.address;

Or maybe some coordinates are missing because their locations no longer exist. The house has been demolished, so it's not "unknown" where the house was, it's just that it no longer has a location. And how would you join that data.

I don't think I've encountered a problem in practice with SQL in differentiating empty from unknown. It's out of scope of what SQL tries to solve, and that's fine.

ryanbrunner · on July 11, 2022

That's fair - as I said in another comment, it breaks down with keys. I can't think of a great way to model "this row definitively does not have whatever is referenced in another table". Using anything other than NULL for that means you can't use proper foreign keys.

SnowHill9902 · on July 11, 2022

NULL means that the value is unknown, not that the data is unknown. You indeed have no data but its value may exist. You just don’t know it yet. NULL is very powerful when you understand that.

ryanbrunner · on July 11, 2022

That seems like a fairly big assumption on how I'm modelling the data though. Particularly in the LEFT JOIN example, I'm not sure if I agree with the idea of "a row not being present means that it's unknown whether it exists in the real world" depending on the underlying content being modelled.

For a string column, you do have blanks vs. NULLs, so it's fairly easy to distinguish between "John definitely doesn't have a middle name" vs "It's unknown whether John has a middle name", but that doesn't hold true with most other datatypes.

SnowHill9902 · on July 11, 2022

The string ‘’ is as arbitrarily blank as 0 is a blank integer.

wongarsu · on July 11, 2022

There are two "blank" integers: 0 is the identity element under addition, 1 is the identity element under multiplication. Or worded differently 0 is a blank addend, 1 is a blank multiplier.

‘’ would be the identity element for string concatenation, or any other string manipulation method I can think of, so it has a stronger claim on being blank than a 0 has.

Unless your string is later parsed as something else. A blank JSON value would look differently from a blank phone number.

SnowHill9902 · on July 11, 2022

Sure, but it’s arbitrary from a data structure standpoint and sloppy design.

ryanbrunner · on July 11, 2022

SQL gives you no other options though, there's no other construct that means "not present", since NULL is unknown vs. not present.

SnowHill9902 · on July 11, 2022

It depends on your column. If you have a products table and a certain handmade product has no upc, it’s correct to use ‘’ as its value rather than NULL because you know for a fact that its non-existent. Some people respect NULL too much.

woevdbz · on July 11, 2022

It is mostly true but not always, eg in the output of GROUP BY ROLLUP, a NULL value is used to indicate rows representing subtotals / partial aggregates

irrational · on July 11, 2022

This is amazing. We moved from Oracle to Postgres after being on Oracle for 15+ years. That was a huge move that took 2 years. One of the biggest hurdles was rewriting thousands of sql queries. One of the biggest things we had to change was accounting for how oracle handled nulls versus how Postgres handled nulls. However, we missed this nuance about how Postgres treated nulls and unique. It wasn’t until about 2 years into using Postgres, and running into so many issues, that we finally stumbled upon this nuance in the documentation. Crap. That required making a bunch of more changes to account for this. At this point I’m firmly convinced that null handling is the hardest part of database work ;-)

pbz · on July 11, 2022

Since we have some PG devs here: Can we please have a way to reorder columns?

Coming from MySQL, this is one of the first missing features one hits. It's like moving to a new house where you're told you can never clean or move the furniture without moving to a new house.

It leaves the unfair impression that this is a "toy" db. "What other basic features is it missing if it doesn't have this?" Please don't think of it as trivial; first impression are very important.

jl6 · on July 11, 2022

Better to channel your displeasure into advocating for a future SQL standard to include this feature. Other DBs that support this do so via proprietary syntax extensions.

masklinn · on July 11, 2022

Technically it doesn’t require syntactic extensions, postgres stores the column index as the `attnum` attribute column of the `pg_catalog.pg_attributes` system table.

So this could be made to work by hooking into the storage system and rewriting the table and all pointers to the table when `attnum` is updated.

Without this rewriting this only works if the table was just created and has no data, and no external metadata referencing the columns themselves e.g. views, fks, indexes, defaults, rules, …

An alternative would be to add automatic packing (à la rustc) to postgres, decorrelating the “table position” and the “physical position” of the rows, this would also allow free “table position” reordering.

And while it’s by far the most complex option, one of the nice bits with it would be that the system columns could be packed as well. Currently there’s quite a bit of waste because there are 6 system columns (by default), the first 5 are 4 bytes, but the 6th (ctid) is 6 bytes, meaning 2 bytes of padding if your first column is a SERIAL, and 6 if the first column is a BIGSERIAL (or an other double-aligned column).

anarazel · on July 11, 2022

> And while it’s by far the most complex option, one of the nice bits with it would be that the system columns could be packed as well. Currently there’s quite a bit of waste because there are 6 system columns (by default), the first 5 are 4 bytes, but the 6th (ctid) is 6 bytes, meaning 2 bytes of padding if your first column is a SERIAL, and 6 if the first column is a BIGSERIAL (or an other double-aligned column).

FWIW, system columns aren't stored as normal columns in tuples. Some of them are implied (e.g. tableoid doesn't need to be stored in each tuple, ctid is inferred from position), others are not stored in the way normal columns are stored (e.g. xmin, xmax).

RedShift1 · on July 11, 2022

Couldn't an ordering column be added to the pg_attributes that determines in which way the columns are sorted when they are displayed? Standard SQL code can then be used to manipulate the display order of the columns.

urthor · on July 11, 2022

There's like 5 different features wants I could add like this. A columnar database engine comes to mind (slightly more work involved) and poaching a few Sqllite QOL features.

Ultimately Postgres is a community driven product, and people want to work on what they want the work on.

I don't get angry that some kind soul hasn't volunteered their Saturday for free for me.

DangitBobby · on July 11, 2022

That's not exactly true. You can write your own patch for a feature and there's a good chance they'll decide they don't want it. It's their prerogative, but also slightly upsetting when you know they're blocking features you want. Just observe the bike shedding over a trailing commas patch [1].

1. https://postgrespro.com/list/thread-id/1853280

TheP1000 · on July 11, 2022

There are lots of good reasons in that thread why the patch was declined.

The patch made things more lenient which makes things less compatible (this whole thread is about ANSI SQL standards and they do matter) and changes what was an error behavior to now silently succeed.

+1 to postgres devs for curating patches.

DangitBobby · on July 12, 2022

I read the thread and have to disagree with "lots of good reasons". I'm pretty sure there are plenty of things in the postgres dialect that are not cross compatible with any other dialect.

I agree with curating patches in general, but it's a double edged sword and inaccurate to say you can just spend your weekend to get something you care about.

pbz · on July 11, 2022

Oh, absolutely; I'm just trying to raise the awareness/importance of this missing feature. Most devs that I know IRL that looked at PG had a very similar reaction to mine.

mixmastamyk · on July 11, 2022

I'm receptive to the feature. But wouldn't it require locking the full table as it works? Seems like it would hurt availability of the biggest databases that would benefit most. Perhaps this is why it is thought somewhat impractical and therefore low-priority.

pbz · on July 12, 2022

Depends on how the PG team decides to implement this feature. If they're OK with a virtual order (for "UI" purposes) while keeping the true order hidden then this would not require table locking (or it would be a very quick lock). If the order reflects the real / physical positioning then yes, it would probably require locking for a full table rebuild.

SnowHill9902 · on July 11, 2022

What do you gain from that other than soothing your OCD when executing \d t (?)

iruoy · on July 11, 2022

I like to have my `*_id` columns at the front, and `created_at` etc. at the back. And in the middle all the fields by descending importance. Just a personal habit so I can quickly lookup data in a table.

ggregoire · on July 11, 2022

I'm totally with you on this, and would love to be able to directly reorder the columns in the tables.

A possible workaround is to have a view for every table (which I often read is a good practice anyway) and order the columns as you want in the views.

pbz · on July 11, 2022

Exactly... You gain so much speed by having everything in the right order.

sophacles · on July 11, 2022

This seems pretty voodoo to me. Do you have any examples that show it?

pbz · on July 11, 2022

I'm not sure how I could show if something is easier/harder to work with. Speed in this context refers to human processing speed, not database speed.

fabian2k · on July 11, 2022

There is some small benefit to playing column tetris if you have columns with different sizes that waste space due to padding. I'm not convinced this would be worth the complexity of this feature, but in some cases reordering columns might have measurable benefits in reducing the size of the data.

pbz · on July 11, 2022

I'm not talking about the physical layout of the data, just a thin UI layer that the DB tool could use. Maybe we could have two modes: physical vs UI ordering?

andruby · on July 11, 2022

As mentioned by another comment: you can use views for that.

pbz · on July 11, 2022

Views are not a good solution for this. The point is to be able see a specific order in 3rd party apps and when writing quick add-hoc queries (select *). Writing views for every table would just pollute the db.

MichaelApproved · on July 11, 2022

Based on my understanding of DB storage (which is decades old and as I write the comment my explanation seems stupid but it’s what I was taught. Please someone correct me).

Image a table with the following:

id (int), title (varchar), created_date (date)

That data is stored in a similar order on disk (is this still true?).

So, if the initial insert has a title of 10 characters, then the date will get placed after those 10 characters on disk:

Int, 10 characters, date

Later, the row is updated with a longer title of 500 characters.

Because it’s longer, you now have to get more space for the title on another part of the disk.

Since there are columns after the title, it’ll either leave dead space where the old 10 character title was or it’ll need to move those other columns too.

I can’t remember which I was taught. I just remember that columns after variable columns get impacted when the data in a preceding variable column changes.

If the title were at the end of the row, the db could expand the data without needing to move other columns over (if there happened to be available space right there).

If you were updating with a shorter value, it could shorten the row by moving the terminating character to the new shorter location, freeing up the space the longer title was using.

Bottom line is you want to keep variable columns which get updated most frequently towards the end of your table structure.

Knowing which column that would be isn’t always clear when creating the table.

Again, this understanding was taught to me over 20 years ago. I’m probably remembering parts of it wrong and DB storage has likely (hopefully) advanced since then.

jhgb · on July 11, 2022

> That data is stored in a similar order on disk (is this still true?).

There should be no requirement for this. Columns in relations are not conceptually ordered, so it shouldn't matter for the things you're doing with the data anyway, and the database should be able to reorder the data in whatever way it likes, since desire to isolate the user from physical data structures was one of the main reasons for the rise of RDBMS.

masklinn · on July 11, 2022

1. That’s not exactly true, AFAIK all RDBMS return columns in table order when order is unspecified (`*`), and while they could reorder on retrieval

2. postgres definitely does not, and column tetris is absolutely a thing in the same way struct packing is (with the additional complexity of variable-size columns)

jhgb · on July 11, 2022

But the order of attributes in the presentation of the result relation has nothing to do with physical layout of base relvars -- primarily because any relationship between the two is purely coincidental. The vast majority of useful queries will not reuse the order of attributes in base relvars, so optimizing for the unusual trivial case by prohibiting better rearrangements that could be useful for a much larger number of use cases seems rather pointless.

And what PostgreSQL does is of course an implementation detail of PostgreSQL.

MichaelApproved · on July 11, 2022

> the database should be able to reorder the data in whatever way it likes

My point was that you’re trying to prevent the DB from reordering the data because there’s a performance cost when that happens.

jhgb · on July 11, 2022

Why would you be trying to prevent the DB from reordering the data? You're not supposed to have better knowledge of what's good for your use case than an RDBMS that can collect usage statistics on queries and such. Ditto for compilers rearranging structures and such. When you start having hundreds of tables and thousands of queries, I don't see how you can do a better job than an automated system at that point.

MichaelApproved · on July 11, 2022

> Why would you be trying to prevent the DB from reordering the data?

Sure. “Reduce the need” would be a better word than “prevent”.

If I can do a good job organizing the table columns (as described above) it’ll lower the need for the DB to reorder data.

Reduced need to reorder, improves performance.

jasonwatkinspdx · on July 11, 2022

DB storage is a lot more sophisticated than what you were taught.

Most databases use Slotted Pages to organize storage. Pages are fixed size and numbered by their offset within the database file. The page header contains the number of rows, followed by an array of offsets for individual rows within the page. Rows themselves generally are stored at the end of the page filling downwards. The storage engine can move around rows in arbitrary ways to consolidate free space.

Fundamentally there's no connection between SQL schema order and how table storage is organized on disk. For example in a column store there's often no contiguous row stored anywhere, instead there's just separate indexes per column.

lsaferite · on July 11, 2022

That would seem like an argument in favor of allowing column reordering.

pbz · on July 11, 2022

Soothing my OCD is important, but it also helps when working in a team and having a shared ordering style. For example, we have certain types of columns that are always at the bottom. One would look there first. With PG you're forced to scan all the columns every time.

9dev · on July 11, 2022

This is why I order anything alphabetically, always. Properties, methods, columns… anything. Saves you the trouble of custom conventions.

pbz · on July 11, 2022

The trouble is when you need to add a new column you can only add it at the end, so you'd lose the alphabetical ordering.

Tostino · on July 11, 2022

Ideally, postgres would play column Tetris behind the scenes and store the columns on disk in the most appropriate way, while allowing the representation to be changed at will.

pbz · on July 11, 2022

Yeah with maybe an option to manually optimize (that would rebuild the table if needed).

hans_castorp · on July 11, 2022

> It leaves the unfair impression that this is a "toy" db.

So you consider Oracle, SQL Server and DB2 also to be "toy" databases?

pbz · on July 11, 2022

With SQL Server the management tool does give you a way to do this. Yes, it does a table rebuild behind the scenes. The point is that it's easy. Don't have experience with the other two, but MySQL is the most popular so it kinda sets the tone whether we like it or not.

hans_castorp · on July 11, 2022

Well, writing a procedure that rebuilds the complete table in Postgres or Oracle is easy as well. I never needed this, but I am sure, there are some sample implementations out there.

Rebuilding the entire table doesn't seem feasible for large tables to begin with. Especially with a lot of incoming and outgoing foreign keys.

I disagree that MySQL is the most "popular".

It might be the "most used" one because of so many web hosting services included it for ages by default.

jl6 · on July 11, 2022

Tangential, but there are almost certainly more SQLite databases in existence than every other RDBMS put together, probably by 3 or 4 orders of magnitude.

It doesn’t support column reordering either.

dboreham · on July 11, 2022

> Please don't think of it as trivial

I've worked with relational databases for 20+ years. This is the very first time I heard of this.

pbz · on July 11, 2022

I've worked with DBs for 20+ years as well. This is a quality of life type of improvement. If you've worked mainly with DBs that don't make this easy it's hard to know what you're missing. Do a search for column reordering for PG and you'll get a ton of hits.

anarazel · on July 11, 2022

The problem is that we hear a lot of different features touted as "the crucial missing one"...

Anyway, there's been work on this in the past, which recently has been picked up again. It's not all that trivial to do well.

pbz · on July 11, 2022

The problem is that we hear a lot of different features touted as "the crucial missing one" -- I would definitely put this into the polish category, but items in that category are also important; especially for those with a MySQL background.

which recently has been picked up again -- that's awesome to hear

rwmj · on July 11, 2022

I think the main lesson from this is you should use a NOT NULL constraint on just about every column of your database, since the behaviour of NULLs is weird and likely to indicate a problem with your data.

GuB-42 · on July 11, 2022

NULL is totally fine, and happen naturally if you make an outer join. But if you are making a UNIQUE constraint one a nullable column, then you may have a problem.

The idea with NULL is that it is not a value, it is the absence of value, like a field you didn't fill in a form. For example, if you ask two people their personal information, and neither specified their email address (email=NULL), you can't assume they have the same email address. And if you put a UNIQUE constraint on that email field, you probably don't mean that only one person is allowed to leave it blank: either you make it mandatory (NOT NULL), or you let everyone leave it blank.

The reason nullable and unique are rarely seen together is that unique is typically for keys, and generally, you don't want rows that have no key value. Also, putting a unique constraint on something that is not a key may not be the best idea. For example, if you don't intend to use email to uniquely identify people, what's the problem with two people having the same email?

fabian2k · on July 11, 2022

NOT NULL really should be the default. There are of course valid reasons for allowing NULLs, but NOT NULL is much easier to handle for most cases. I don't think nullable columns indicate a problem with the data, but nullable columns that don't have a good justifications are certainly trouble.

masklinn · on July 11, 2022

The fun bit is this mistake gets replicated in every definition langage, sometimes made ever worse (e.g. openapi where fields can be omitted by default, and you can additionally make them nullable).

KptMarchewa · on July 11, 2022

Different null types are the worst :(

One of the worst offenders was one API I was working with Optional<Something> could be null.

CipherThrowaway · on July 11, 2022

What is your solution for representing the difference between

{"foo": null} and {}

Because they are different, right?

ekidd · on July 11, 2022

In Rust, I have occasionally seen things like:

   Option<Option<String>>

Here, "None" means "no value specified", and "Some(None)" means "a value is specified, and that value is null." And "Some(Some(my_string))" means the value is "my_string".

There are other ways to represent this in Rust, some of which might be clearer. This representation seems to be used most often if you have a type "MyType", and you want to automatically derive a type "MyTypePatch" using a macro.

CipherThrowaway · on July 11, 2022

That's also how I've seen it done. But that's not so different from the nullable optional approach. It's still effectively two flavors of null.

What I'm wondering is how people who have an issue with different representations of absence would approach (or avoid?) situations requiring them.

setr · on July 11, 2022

Larger enums to represent and discriminate. Null is just syntax sugar for the one-kind of absence case, and multiple-null-likes eliminates all benefits of that syntax sugar, so you might as well revert to a real discriminator.

Ideally you can force the use of the discriminator… but that depends on your type system

CipherThrowaway · on July 11, 2022

The problem with flattening the two absence cases into an enum is composability and generality because it forces the "outer null" case to intrude into concrete types. Say you have a key value mapping Map<String, T> and you represent changes to this map as Map<String, Option<T>>. One day T itself is Option<Int> so you end up Map<String, Option<Option<Int>>. If you want to use e.g. Option2 for that latter case you lose generality.

Where the "double absence" issue comes up in practice, it's usually in a context where it does make sense to represent and handle the first type of absence separately from the second type.

tinco · on July 11, 2022

Not parent, and not necessarily agree, but one should throw a parser/typecheck error and the other should not.

CipherThrowaway · on July 11, 2022

In cases where the field is required I would agree. But in the context of an update request, {"foo":null} and {} might both be valid with different semantics. "Update this field to null" vs "don't update this field"

dbrgn · on July 11, 2022

A similar issue is {}, {"mylist": null} and {"mylist": []}. Three ways to represent a missing or empty list.

AmericanChopper · on July 11, 2022

I don’t think default check constraints should be a part of any serious RDBMS. Developers are free to default to any behaviour they prefer.

tgv · on July 11, 2022

That's just plain wrong. You need some value to populate optional fields, and NULL is a pretty decent choice to indicate that a string, date or number has not been filled in. If you try to fix that by having special string or numeric constants, you're just making life harder, especially when you export the data. The poor souls that import it will think "hey, this timestamp says 1-1-1970, so this probably happened at Jan. 1, 1970."

usrbinbash · on July 11, 2022

The problem is: NULL is a valid datatype in many situations. There are non-boolean fields that can just be empty; My goto example:

    -- Table definition for employee
    name, surname, date_entry , date_exit

Everyone has names, and if hes an employee, he probably has an entry date...but until he leaves, what's the exit date?

Other than soothsaying, my choices here are: NULL, some magic value, or an additional field telling me whether the application should evaluate the field.

The latter just makes the code more complicated.

Magic values may not be portable, and lead to a whole range of other problems if applications forget to check for them; I count myself lucky if the result is something as harmless as an auto-email congratulating an employee to his 7000s-something birthday.

That leaves me with NULL. Yes, it causes problems. Many problems. But at least it causes predictable problems like apps crashing when some python code tries to shove a NULL into datetime.datetime.

p49k · on July 11, 2022

Be careful with this in Postgres: using a zero, blank space, empty array or similar as a default value instead of NULL can lead to enormous bloat in terms of disk space. There are distinct advantages of NULL that have to be considered.

SnowHill9902 · on July 11, 2022

He most probably means creating a data structure that does not allow NULLs (usually increasing your normal form works) not replacing them by a naive synonym.

idoubtit · on July 11, 2022

Your statement is wrong in the general case. It probably depends on the DB engine and several parameters, but, at least with MariaDB/MySQL, allowing NULL increases the disk usage. See https://stackoverflow.com/questions/13207838/disk-space-impl...

masklinn · on July 11, 2022

Their statement is not “in the general case” though, it’s specifically for postgres.

Postgres uses a fixed-size null bitmap and variable-size rows, so a NULL value takes one bit in the bitmap (and additional nullable columns may require a wider bitmap), but they are skipped in the row itself.

fabian2k · on July 11, 2022

Postgres uses 1 bit per NULL value, though the reality is more complex as everything is padded for alignment. So fewer than 8 nullable columns are free, and above that you pay probably 8 bytes at once (I'm not totally sure on the exact numbers here).

So for Postgres it is generally true that storing NULLs is very cheap in most cases.

evv · on July 11, 2022

I have struggled to implement a tree structure in PG with nullable unique values.

Consider a "docs" table where each doc has a unique name, under a given parent doc. A null parent would be a top-level doc, and top-level docs just have a unique name. This didn't work before, and would hopefully be addressed by PG15.

I'm not sure if null parents really represent a "problem with my data", or if the tree structure was too exotic for PG to support properly.

How I got around it: hardcode a certain root doc ID, then the parent column can be NOT NULL. But this felt janky because the app has to ensure the root node exists before it can interact with top-level docs. Plus there were edge cases introduced with the root doc.

barrkel · on July 11, 2022

When I store tree-structured data in a relational database I generally add a 'path' column which is a denormalized string containing the names of all the parents with a path separator between.

The biggest reason is it makes finding all descendants very fast with a prefix search (foo/bar/%) on the path column when it's indexed. It's not unusual to want to find or update all descendants because descendants are usually related. If you don't have a path column, then you need to write a recursive CTE, which is Slow, or recurse in your application, which is normally even slower. The reason they're slow is they require a number of seeks which is exponential in the depth of the tree.

It also makes lookup of a node from the path fast, and producing a qualified path to a node fast, but these costs are linear in path length.

Anyway, this path column is also a good place to put your unique constraint.

If you don't want to restrict names from containing the path separator, you can escape application-side. For example, if using '/' as a path separator, consider '::' to escape ':' and ':s' to escape '/' - don't use your path separator in the escape or it'll muck up prefix searches.

SnowHill9902 · on July 11, 2022

Doesn’t feel natural in Postgres. I’d check an auxiliary table nodes_nodes (integer, integer) having an index of all ancestor-descendants.

evv · on July 11, 2022

Agree its a bit unnatural, but it seems workable. My problem with the auxiliary table is, I wouldn't know how to implement the unique constraint.

SnowHill9902 · on July 11, 2022

In your current implementation you either have no unique constraint or it is a more particular case of UNIQUE(ancestor, descendant).

pdenton · on July 11, 2022

The way I'd do this, is by separating concerns into separate tables. If you have a table with (id, name) and a table with (id, parent_id), any doc with a parent will have a corresponding record in the second table.

evv · on July 11, 2022

Interesting! But where can you implement the UNIQUE constraint for doc names under a given parent? I guess your application code would need to handle that

rocqua · on July 11, 2022

Thanks for this example! As I was reading the post, I was thinking "cool, and feels like a thing people would expect, but does it have a real world usecase?".

This feels like an actual real-world thing people might want to do where indeed you'd want to have a single NULL value.

radiospiel · on July 11, 2022

You might want to look at a unique partial index: "CREATE UNIQUE INDEX foo ON pages(name, parent_id) WHERE parent_id IS NOT NULL"

knorker · on July 11, 2022

You can't get away from having to reason about NULL in SQL even if you remove all NULLs from the database.

You can still get NULLs at query time, and they'll have semantically valid meanings in views and subqueries.

RedShift1 · on July 11, 2022

What do you use instead of null then? For example let's take a purchase table, that keeps track when the purchase has been submitted to the supplier, let's call it sent_to_supplier. What do you use for sent_to_supplier in case the purchase has not been sent to the supplier yet?

rwmj · on July 11, 2022

It's not a hard and fast rule, but in your example you can solve this in the model without needing NULLs. A table called ORDERS records the orders, SUPPLIERS lists your suppliers, and a table SENT_TO_SUPPLIER links ORDERS to SUPPLIERS, and you don't add a row into that table until the order has been sent.

RedShift1 · on July 11, 2022

What about this, let's say I have two electricity meters. They have bunch of fields (active power, reactive power, voltage, current, and all that times 3 (one such measurement for each phase)), but one kind of meter can measure distortions, the other can not.

I would make the distortion columns nullable for the pratical reason that you either have to duplicate all the work for querying the same properties from the meters (for example a dashboard that shows voltages only). If you want to support both types of meters in the same dashboard you would have to do a UNION type query (and not ever forget to do that!) if you store the measurements in a different table.

rocqua · on July 11, 2022

It's possible to handle these kinds of fields (at most one value) with separate tables and foreign keys. It has advantages around 'proving correctness'. But without extra tooling it has significant usability, and perhaps performance disadvantages. I wonder if you can get around some of this using virtual tables.

I come at this with little experience with SQL, and having worked a bit on ampersand[1], a tool where you declare the structure of your data, and some invariants, and the tool will create a schema for your database, and some automatic checking to ensure your invariants are upheld.

[1] https://ampersandtarski.gitbook.io/documentation/

cwillu · on July 11, 2022

Rather than putting the two types of meters in two different tables and doing an error-prone UNION, you would put the distortion measurements in a different table joined to the first, and the meters that don't support it would simply not include entries in that table.

RedShift1 · on July 11, 2022

But that would still get you nulls in the result set, which is what we're trying to avoid?

synthmeat · on July 11, 2022

Boolean capabilities of meter per every entry? Shouldn't impact size that much, easy to index. As a bonus, you can track failure of a sensor per each meter, and know when it first failed.

jve · on July 11, 2022

Oh yeah, solve complexity by introducing complexity squared.

Cthulhu_ · on July 11, 2022

Wouldn't a meter value that doesn't have distortions just have a value of 0 (that is, not null but just zero); the properties of a meter in this case belong somewhere else, either in a "meter properties" table if you have many meters with many properties that can change, or just hardcoded.

Alternatively, a key / value table, e.g. `meter_id, property, value`. But that isn't very optimal, works better for things like a shop product with many different properties per product.

RedShift1 · on July 11, 2022

It doesn't mean there are no distortions, it just means they are not measured. So I vastly prefer knowing there's no data than some placeholder that "simulates" no data

fabian2k · on July 11, 2022

The submission process might deserve its own table with more detail than sent/not sent. Otherwise it's a straigtforward boolean, I don't see any use for NULL here.

Edit: Since it's a date, there are valid use cases for nullable timestamps. But only if there is no additional information attached to the event you're saving in the timestamp. Another complication with nullable timestamps is sorting by them, which you often do with timestamps. With nullable ones this can get messy.

pprotas · on July 11, 2022

You could create a new table that keeps track of the logistics of the product. If the product is sent to supplier, a new row appears in this table with the date. If it hasn't been sent yet, no row is present.

therein · on July 11, 2022

Handles the cases where an order had to be resent to the supplier better as well.

formercoder · on July 11, 2022

And when a supplier returns a shipment, and when new things happen that you haven’t thought of so a new column isn’t needed.

mjevans · on July 11, 2022

More careful thought about the processes and data that are to be captured.

In a simplistic view maybe NULL is really the correct choice; but is it? Does NULL represent the desired properties?

Another common pattern is the use of some sort of sentinel value, E.G. the largest possible value for a date field, or the smallest, might be used to indicate an unknown maximum or minimum that propagates.

A related pattern might be some sort of orders_suppliers table which would have a foreign key value; that might be NULL or it could use a sentinel value with a dummy supplier to indicate a special condition and an arbitrary number of inband subsets which can be their own distinct matches.

stdbrouw · on July 11, 2022

These all seem like fine examples of solutions that are worse than the problem they were meant to solve. A separate table will be harder and slower to query and will enable M2M relationships even when you don't want those. Sentinel values must be dealt with at every query to ensure they are always filtered out.

mjevans · on July 11, 2022

Thank you for pointing that out.

A better relationship would be if the order_suppliers table was the one with a relation to this table and there wasn't even a column.

Avamander · on July 11, 2022

A boolean column?

RedShift1 · on July 11, 2022

It keeps track of _when_, so timestamp.

konha · on July 11, 2022

If you care about ‘when’ you probably also care about resubmissions. Or metadata like the supplier’s response. Or who approved the submission etc. A separate table seems like a good solution.

vosper · on July 11, 2022

What’s the null-replacing value in a timestamp field?

Pxtl · on July 11, 2022

A person with alopecia's hair color isn't "unknown" it's non-existent. They do not have hair.

Would you suggest that a database of appearance information about a person should have a separate subtable for "hair" to properly model this feature?

Either way, in the end, for displaying 99% of the time you're going to be using a VIEW that LEFT JOINs all these cute normalized tables back together again, and then you're going to want to filter those views dynamically on a nice grid that the user is viewing, and the fact that X != X is going to bite you in the ass all over again.

Creating more tables is just moving the problem around, not solving it.

papito · on July 11, 2022

Also, do not order a vanity plate that says "NULL" - you are likely to have a very bad time.

https://www.wired.com/story/null-license-plate-landed-one-ha...

nicoburns · on July 11, 2022

Except that this change means that the behaviour of nulls is not weird anymore!

masklinn · on July 11, 2022

Null is still null, sql null is always “weird” wherever it exists.

gigatexal · on July 11, 2022

Why is null weird? Null means unknown. It's perfectly valid.

roenxi · on July 11, 2022

SQL uses [0] 3 valued logic, and most people seem to reason using 2-valued logic. So most people fund the behaviour of NULL, the 3rd logical value, to be weird.

> It's perfectly valid.

:o You'll get jumped on, saying this to a forum of software people. Lots of things that are valid are also weird.

[0] https://www.postgresql.org/docs/14/functions-logical.html