Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Semantic Diff for SQL (github.com/tobymao)
228 points by s0ck_r4w on July 29, 2022 | hide | past | favorite | 31 comments


Very nice/interesting.

Somewhat related question and apologies if this is already stated in the documentation (it's rather dense and I haven't had the time to read through it completely)...

Can you use sqlglot to create custom DDL dialects that have custom first class objects? For instance, if I want to build a custom SQL/DDL/DML dialect that had a new kind of object such as a "pipe", "kitchensink", etc, would sqlglot be a good tool to use?

I've tried playing around with Apache Calcite, but it lost me pretty quickly since the examples to customize/extend DDLs were quite lacking in my opinion.


Yes. SQLGlot is very customizable and you can pretty much override everything. It kind of needs to be flexible because even common sql dialects vary greatly.

Here's an example of how we use SQLGlot to output raw Python code directly from SQL.

https://github.com/tobymao/sqlglot/blob/main/sqlglot/executo...


Very nice, and exactly along the lines of what I was thinking... I want to be able to create a custom SQL dialect that can output some code.

Thanks very much, looking forward to spending some time reading through all of this!


I've been very impressed with sqlglot, and am looking forward to trying this feature. The only issue I've had with sqlglot is transpiling for use with a specific spark version: in my experience Spark is not great about surfacing obvious 'not registered' errors when a function isn't supported (especially in >=2.4). I ran into this with width_bucket, which is only in the most recent release. I am curious whether there's a straightforward way to write with a specific release and catch the error in transpilation rather than execution.

Side note: Iaroslav (post author) and Toby (sqlglot creator) are both amazing, and I'm so glad that they're working on open source projects like this.


It's easy to add errors for dialects by calling #unsupported when a function is used.

In terms of versioning of engines, I haven't implemented that yet, but presumably it could be done by adding a dialect subclass and having versioning route to it, so we could do something like parse(sql, dialect="spark", version=...) which could then route to a 2.3 version of spark.

Happy to chat more about this and we can see about adding it (or feel free to make a pr). You can DM me on twitter or some other avenue as well if you want to dive in deep.



Really awesome work :-)

I've implemented the Fast Match / Simple Edit script algorithm almost 10 years ago for my Master's thesis[1] for my database project[1][2] in order to import revisions of files with a hopefully minimal edit number of edit operations between the stored revision and a new one (back then it was for XML databases).

The diffing was only one aspect for the visual analytics approach to compare the revisions (tree structures) visually [4]. Internally the nodes are addressed through dense, ascending 64bit ints stored in a special trie index. Furthermore, during the import optionally changes are tracked as well as a rolling hash is stored for each node optionally. After the import you can query the changes or execute time travel queries easily.

Technically, a tree of tries is mapped to an append-only data file using a persistent data structure (in the functional sense), COW with path copying and a novel sliding snapshot algorithm for the leaf data pages itself.

I always have the vision to implement different visualizations to compare the revisions in a web frontend, but I'm currently spending my time on improving the latency of both writes and reads.

Thus, if someone would like to help, that would be awesome :-)

Kind regards

Johannes

[1] https://github.com/JohannesLichtenberger/master-thesis/blob/...

[2] https://github.com/sirixdb/sirix

[3] https://github.com/sirixdb/sirix/tree/master/bundles/sirix-c...

[4] https://youtube.com/watch?v=l9CXXBkl5vI


Comparing tree structures can be used to have diff of EXPLAIN plans. During query optimization, it might make a lot of sense.


I thought this was going to be something else like being able to tell that a rewritten query returns the same set of rows, but with potentially a very different query plan. E.g. dependent EXISTS subquery vs IN subquery.


This is what the sqlglot optimizer is used for. The optimizer converts EXISTS and IN into a canonicalized SQL (some variant of left join) which can then be compared to another query.

So if you run the optimizer first and then the diff tool, it could solve this kind of use case.


This exists for Postgres, let me try to find the name of the tool

EDIT: I can't find it, I searched for 30 mins, I promise this exists though. If anyone else can remember the name of it, please post.


Pg_query maybe?

https://pganalyze.com/blog/pg-query-2-0-postgres-query-parse...

Scroll down to the section marked "Fingerprints in...".


As far as I can tell, pg_query would treat the equivalent EXISTS and IN (...) forms as different queries with different fingerprints, which seems to fit it's intended use. The github README says:

> Usage: Fingerprinting a query

> Fingerprinting allows you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different object ids in the WHERE clause), or because of formatting.

Pretty cool that we can get the parse tree that PostgreSQL would make from Go or Ruby via the library.


Interesting, will give sqlglot a look when we get to adding SQL support in DiffLens [https://github.com/marketplace/difflens]. Or perhaps DiffLens can just use sqlglot :) Either way we're very happy to see another semantic diff tool.

P.S: We work on DiffLens. It currently supports TS, JS, CSS and text diffs. We're working on making a VS Code extension currently


I wonder if this is a topical thread to check if anyone is aware of a Java based solution to parse a CREATE VIEW statement to get a mapping between the view columns and the corresponding source table columns. I checked out jsqlparser[0] and it does produce an AST which can be parsed using the visitor-pattern[1] but was wondering if there is a more "out-of-the-box" solution involving less work. Due to various reasons, querying the database information schema is not an option I can pursue.

[0]: https://github.com/JSQLParser/JSqlParser

[1]: https://en.wikipedia.org/wiki/Visitor_pattern


What do you mean by a mapping with a source table columns? The data columns in the view could come from 0 to n source tables. Entirely synthetic, transformed, combined data from multiple tables etc.


True - I mean the view columns that _are_ from actual tables. Could be straight up table columns as is or may be part of function like max(a.abcd) where a is an alias to table xyz


Apache Calcite can do this, though it's not a beginner-friendly task:

https://calcite.apache.org/


> (a + b) => (b + a)

> Semantically the query hasn’t changed

Now hang on a minute. Extend that to 3 and it can be (mssql but true in any I guess):

   declare @hi int =  2147483647;
   declare @lo int = -2147483648;

   declare @x int = @hi + @lo + @hi;  -- ok

   declare @y int = @hi + @hi + @lo; -- 'Arithmetic overflow error'
Worse yet with floats. I see what you're saying and good stuff, I'm thinking about this myself and I appreciate this article and will read it properly, but the edge cases have to be acknowledged.

Edit: this kind of thing is apparently something compiler writers keep rediscovering the hard way.


Associativity and communtativity are different things.

@hi + @lo + @hi means (@hi + @lo) + @hi

@hi + @hi + @lo means (@hi + @hi) + @lo

Just because you can write this without the parentheses does not mean you can ignore them. To get from one to the other need not just the commutativity rule relied on by the diffing algorithm but also the associativity rule:

(@hi + @lo) + @hi =assoc=> @hi + (@lo + @hi) =comm=> @hi + (@hi + @lo) =assoc=> (@hi + @hi) + @lo

Here the third (associativity) change introduces the overflow, not the commutativity change which should always be safe for both integers and floating point numbers.


You're right but I'm just making the point that machine arithmetic ain't number arithmetic and you can't rearrange freely.

His example included SUM(b + c) and SUM(c + b) as equivalent. As such it probably is[1] but extend it to 3 numbers and you can't rearrange, as I think we agree.

[1] assuming b and c are numbers not strings, as + is string concat in some dialects


What about difftastic?


Difftastic seems like a really cool tool. There are a few reasons, however, why it doesn't apply well to use cases I had in mind:

1. It's in JS and not Python. These days a common data (including data tooling) stack revolves around Python and fitting JS into this ecosystem is not straightforwad.

2. Limited dialect support. As far as I can see it only supports "PostgreSQL flavor" (not sure what exactly is meant by "flavor" here). Support for dialects like Spark, Trino, Hive, etc SQL was crucial.

Definitely a worthy mention, though, thank you!


Difftastic's wiki also has a breakdown of some structured diff algorithms: https://github.com/Wilfred/difftastic/wiki/Structural-Diffs

(I've been working on a similar problem, effectively diffing an XML tree.)


> It's in JS and not Python

it's in Rust:

https://github.com/Wilfred/difftastic

what am i missing?


Nitpick:

> when a nested query is refactored into a common table expression (CTE), this kind of change doesn’t have any functional impact on either a query or its outcome

This isn’t quite true, at least in Postgres. It won’t affect the outcome, but it can affect the query plan.


I believe that was true but in current PG the CTE no longer acts as an optimisation barrier.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...


For this use case, it’s intentional, since only the outcome is important to us. That’s kind of the draw of a declarative language, you ask what you want and don’t have too much control over how that’s done (the engine should optimize that away).


this is very cool, but I believe this bit of the README is incorrect:

Text-based diff tools such as git diff, when applied to a code base, have certain limitations. First, they can only detect insertions and deletions, not movements or updates of individual pieces of code.

git diff can detect movements. looking at my .gitconfig, I think it's the "frag = magenta" line.


I think color-moved might also be what you’re thinking of. https://git-scm.com/docs/git-diff#Documentation/git-diff.txt...


Post it or link it so we can better understand!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: