Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But the world just works like that. There are unions everywhere. No matter how it's implemented under the hood, it should really be a first class concept in any language, including SQL.


SQL already have first class support for union types in the form of relations. Adding union-type columns would actually be second class compare to this. You would have to add special-case operators and it wouldn't give you more power compared to just using relations.

That said, I'm not averse to the idea if someone can provide a realistic use case. The JSON example in the article is misguided though - you should not save the structure of the syntax of a serialization format, you should save the data model which is represented by the syntax.


Relations don’t provide sum types in a first-class way.

If you want a list of employees each of which have different roles, and depending on the role you have different fields guaranteed to be available (not null), this is not possible to express directly in SQL.

You can express this with sum types in ML, Haskell or Rust.

In SQL you would have to split it out into separate relations, like:

create table employee(id it)

And then employee_boss, employee_dev, employee_sales which each have different sets of non-null fields.

You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

This is trivial in languages with sum types but not in SQL. In this sense they are strictly more powerful.

Various triggers can be added to try to enforce this constraint, but that’s a runtime check: better languages tell you when you type check the code.


Yes! This! Exactly! There’s so much that _can be knowable_ with SQL based on the constraints of the _query_ (not just the relational model) that it should be possible to know the characteristics of the data (and in some cases even the performance characteristics) just from type characteristics.


> You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

True. (You can in some databases using check constraints, but it is not necessarily simple) The solution is to improve the constraint model to make it much simpler to define such constraints. For example it is trivial (in most databases) to add primary key and foreign key constraints, but more complex constraints are difficult or impossible.

I think the desire for tagged unions are really an XY problem, just like the desire for array types and other composite types. The relation type itself is powerful enough. But the desire for such features show that certain patterns are too complex and un-intuitive to apply in SQL.


>You can express a foreign key constraint that these tables must link to an employee, but not the other way around. You can’t guarantee in the limited type system of SQL that every employee row has a corresponding employee_X table, and exactly one, not more.

in this case instead of an employee table you would have an employee view that’s the union of the separate tables - engines vary dramatically in what sort of constraints they’ll enforce on the view now but the relational model is fine


In most languages where unions are a first class concept, are they not generally warned against?


The author uses the term "union", but unfortunately they picked the one wrong term against a long list of correct alternatives: sum type, tagged union, discriminated union, coproduct, disjoint union, variant, algebraic data type, ...

"Union" does not imply disjointness (which is the source of many problems), whereas all those other terms do.


Correct, I just continued to use the term of OP. Sum types r coproducts are what I mean.


By the "author", I meant the author of the article. I figured you were just being consistent with them.


You’re probably thinking of C/C++ unions, and they do indeed have quite tricky semantics and are difficult to use correctly. What the article and the other commentators are talking about is more properly called a “sum type”, like Rust’s enums, for instance. Those are different things from C unions. In languages with first class support for sum types, they are used everywhere, it’s an incredibly useful concept.


Gotcha, ta. That becomes a tricky problem when we're talking about something that's primarily a storage engine though, right?


Well, sum types are not magic - they must be stored in the memory somehow. Rust's enums tend to use a simple approach: there's a large enough integer tag (usually an 8-bit integer), followed by enough space to store any of the variants. The same exact encoding could be used with a database storage engine too.


Not sure - because in the end, a programming language also has to store this concept in memory. I don't think there is a fundamental difference.


Not at all. Untagged unions, sure, but no decent modern language has first-class support for untagged unions. First-class tagged unions are a real treat to use.


I was talking specifically about the JSON example in the article. Needing to store objects in that manner is just a silly problem to have. Any solution will be slow or ugly or both.

You're right, unions are everywhere. Right now a human has to think about each union and how to represent it in a database. It would be really cool if I could store capnp objects like the one below and still get optimal query performance and aesthetics without thinking about it:

struct Shape {

  area @0 :Float64;   
  union {
    circle :group {
      radius @1 :Float64;
    }
    rectangle :group {
      width @2 :Float64;
      height @3 :Float64;
    }
  }
}


Maybe I didn't get your point correctly but this seem actually doable in PostgreSQL.

You could create a custom type based on jsonb type with a trigger to enforce the types inside the jsonb. You could as well write some functions to ease manipulation of such custom type.

There is no out of the box light syntax to do that right now but this could be implemented in a future update/extension. Just look a the complex and successfully implemented geodata types brought in by the PostGis extension.


You can already do that by having circle and rectangle relations. Union types dos not give you any additional power compared to relations.

Your proposal might be more convenient though than creating multiple relations, so we should look into making it just as convenient to create the necessary relations to express this. So lets say your syntax proposal creates multiple relations under the hood - then I'm all aboard!

The "everything is a relation" is a really powerful concept in the relational model. Adding composite column types like arrays, unions etc. are still less powerful than just using relations. But people like it because the syntax is more convenient to use.


Agreed, that would solve my aesthetic concerns and kinda solve OP's XY problem with JSON.

But the performance of auto-created tables for every union member would kinda suck if you never take advantage of the relation and still pay the cost. You'd also have to auto-create indexes so that queries that read full objects is as fast (IOPS) as storing BLOBs.

I was imagining a future database that takes this a step further and lets you define named parameterized queries next to your tables and chooses the best implementation of unions based on your queries. That's what I meant by "without thinking about it". We already do this for algorithms (query optimizer), and I'd love to see it work with data structures


The basic premise of the relational model is that the physical storage structure and optimizations (including indexes) are independent of the logical model. A table could be stored row-by-row or column-by-column or any other clever way without it affecting the schema and queries - although it will affect the performance of the queries.


But if I have a circle relation and a triangle relation, how do I create a foreign key for the favourite_shape column in my user relation?


Thinking a bit more about it I think this would be a very elegant solution for Tagged-Unions in SQL:

- simply allow tagged unions for foreign key constraints: CONSTRAINT my_fk FOREIGN KEY (own_column) REFERENCES EITHER table_a (a_id) OR table_b (b_id) OR table_c (c_id)

- add join syntax to join via defined foreign key: FROM own_table JOIN VIA my_fk

The union FK would store an additional flag determining the target table and the JOIN VIA would switch on that flag and it would be allowed to use any columns from all target tables in the query but only the matching ones would be not null.


If you put the foreign keys on the table_a/b/c tables, then you can express the same without having to introduce tagged unions.


As I said in response to a sibling: But can you can not (in a simple way) prevent that both a circle and a triangle point to the same user and you can not enforce a NOT NULL for a user to must have a favourite shape.


You would have the foreign key in the circle/triangle relations pointing to favourite_shape rather than the other way around.


But can you can not (in a simple way) prevent that both a circle and a triangle point to the same user and you can not enforce a NOT NULL for a user to must have a favourite shape.


You can do that with a check constraint, although I don't know if you consider it simple. It would be cool with better support for common constraint patterns.


Could you give a concrete example for such a check constraint? I imagine each shape-table would need to have it's own check to test the existence of the same id in all other shape tables, which scales qudratically.

Or if you inverse the FK to make the user-table reference the shape tables you would need multiple potential nullable columns and a check constrain that exactly one is non-null. And it would still not solve the query side of things.


Check constraints are just SQL expressions, so in pesudocode it would be something like:

    (SELECT COUNT(*) shape JOIN square JOIN circle) = 1
How conveniently this can be expressed depend on the database system in question.

But if we are talking about adding features to SQL, it could have built-in support for such a constraint, just as there is built-in support for say foreign key constraints.


Are you by any chance generating GPT-3 comments as part of an experiment?

If not please take time to actually dive into modern SQL concepts there is likely a way to achieve your goal but I personally consider that it's pretty rude to ask random strangers a proof for a complex and loosely defined problem in the comments section. Stack Exchange should be a better place for instance.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: