Having is less understood? After 20+ years of SQL usage (as an ordinary dev, not...

CodeIsTheEnd · on Aug 6, 2022

I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...

nerdponx · on Aug 6, 2022

I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.

Is there such a book?

hobs · on Aug 6, 2022

In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...

chrisweekly · on Aug 6, 2022

This is great! Thanks for the link. :)

eatonphil · on Aug 6, 2022

Joe Celko has a number of SQL for Smarties books I've been meaning to look through.

mwexler · on Aug 6, 2022

These are phenomenonal books. Your understanding of sql as both a language and as a tool will be transformed.

A bit dated in places but that also is helpful as you learn how a technique works instead of just relying on a vendor specific extension

zweifuss · on Aug 6, 2022

Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.

ttfkam · on Aug 7, 2022

Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.

MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html

For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html

There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)

ttfkam · on Aug 7, 2022

I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.

mastax · on Aug 6, 2022

I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)

If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;

"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL

or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL

However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'

If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.

kmac_ · on Aug 6, 2022

SQL opens up when used with OLAP schemas. Most devs are experienced in querying "object mapped" schemas where cube, roll up, etc. are not useful. Nothing bad per se, but it can give an impression that SQL is a bad language, when actually it clicks well with a proper data schema.

rjbwork · on Aug 6, 2022

Indeed. I think your mind can really be opened by having to answer complex business questions with an expansive and well designed data warehouse schema. It's a shame it's such a relatively niche and unknown topic, especially in the startup world.

nerdponx · on Aug 6, 2022

This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.

OJFord · on Aug 6, 2022

I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:

`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.

zweifuss · on Aug 6, 2022

Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.

remram · on Aug 6, 2022

Your suggestion is incorrect. CUBE is not part of the expression.

OJFord · on Aug 6, 2022

Well, I did say I wasn't familiar with it, but it's correct for `coalesce` and I don't think `cube` is different: https://www.postgresql.org/docs/current/cube.html

In the context of `group by` it's treated as grouping sets, but that's not its only use. (Though that does seem to be special cased in terms of parsing, since afaict - I can't find the full query BNF on mobile - `grouping sets` is not optional.)

remram · on Aug 6, 2022

https://www.postgresql.org/docs/14/sql-select.html

    GROUP BY [ ALL | DISTINCT ] grouping_element [, ...]

    grouping_element can be one of:

        ( )
        expression
        ( expression [, ...] )
        ROLLUP ( { expression | ( expression [, ...] ) } [, ...] )
        CUBE ( { expression | ( expression [, ...] ) } [, ...] )
        GROUPING SETS ( grouping_element [, ...] )

You're right about COALESCE of course.

cm2187 · on Aug 6, 2022

Exactly the same for me. It makes me think I need to start reading the TSQL documentation cover to cover.

hobs · on Aug 6, 2022

Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.

Simon_O_Rourke · on Aug 6, 2022

Heard of group by cube before, even tried it in production, but it was a total dog of a query to run, so I filed it away to try again in a few years.

RexM · on Aug 6, 2022

Same!

I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.

ramraj07 · on Aug 6, 2022

The only “standard” feature id rather try not to understand is recursive CTEs lol

conceptme · on Aug 6, 2022

it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.

dspillett · on Aug 6, 2022

If you aren't careful you can cause that without infinite recursion. If the query optimiser an't see to push relevant predicates down to the root level where needed for best performance, or they are not sargable anyway, or the query optimiser simply can't do that (until v14 CTEs were an optimisation barrier in posrges), then you end up scanning whole tables (or at least whole indexes) multiple times, where a few seeks might be all that is really needed. In fact, you don't even need recursion for this to have a bad effect.

CTEs are a great feature for readability, but when using them be careful to test your work on data at least as large as you expect to see in production over the life of the statements you are working on, in all DBMSs you support if your project is multi-platform.

eyelidlessness · on Aug 6, 2022

they’re just self joins on views of the query.