I hate these sort-of-well-known-language-but-not-really DSLs. Since they’re sort of the well known language, documentation is usually sparse, you never know how to deal with edge cases without trial and error. For instance, in the given example, how do you escape SQL keywords? How do you escape quotes in string literals? Even SQL dialects don’t necessarily agree on what’s allowed, never mind this DSL with a half-assed parser.
Not to mention this practically invites dumb string interpolation in SQL preparation, undoing decades of work to rid the world of this terrible habit. (Edit: This problem is solvable.)
Don’t try to be clever, use a structured format (JSON, protobuf, etc.) with a well documented schema.
Note that I’m only talking about APIs. It’s an entirely different matter if your product allows users to type in and send complex queries interactively.
I believe they mean that the language described in here is not really SQL. It looks like SQL. But the question is: what are the differences? Is it really like SQL?
Aye, and if you need a query language inside your API it's probably best to use one of the existing languages specifically made for this use case, like GraphQL
GraphQL does not solve the problem if the query language is user-facing. Additionally, GraphQL a pretty huge tech stack injection. Something like this is much simpler. So, it depends on the context of what you're making and why.
In my humble opinion, it would benefit all better if we standardized something like jql (jira query language, so the user layer) first. From experience, many who find sql too complex have no problem with writing complicated jql queries. it's both easy to use and a great first step to writing actual SQL. I wish it was a standard like json.
The way the JSON filters work would be directly coupled to the JSON fix notation and SQL infix notation so it's directly coupled to SQL and doesn't really require a standard, imho.
It's "prefix" notation .. if you search something like "prefix vs infix vs postfix programming languages" it should return a decent explanation. For my case it's personal choice more than anything.. also easier for me to write a VM for.
From an end user standpoint, I preferred working with such API than with the ones creating nested json filters structures to model the filtering logic.
It also provides a convenient way to select fields in the response, better in my opinion than "field mask" alternative that you can find on other APIs.
This is asking users to generate SQL code client-side, parses the SQL server-side to make sure it's safe, and then generates new SQL for some reason. This seems needlessly complex, requiring more logic on the client side, and putting a lot of trust in the server-side validator. The examples also use non-parametrised SQL, which means the client needs to know how to correctly escape values, and the database might fail to optimise queries. I see no benefits and many disadvantages compared to a simple JSON structure, even if it's custom: it's trivial to generate and parse, it can be used with an ORM, and the server will probably use parametrised SQL, because that's easier than the alternative.
I don't think JSON solves the problem, especially if you want users to be able to input complex queries, JSON will be a far worse experience. In his example, he does say he is processing the input language and generating SQL via a library. I think that calling the input language SQL, in this case, is probably a red herring. It's a very simple language with a strict set of binary operators that happens to look like SQL. Like I said in another post, I do a similar thing for the product we make, but our input language is not SQL, but equally as expressive as the subset of SQL this blog post allows.
I don't see an issue in failing to optimize queries. The database does query planning on the fly. You do have to be a bit careful that you don't generate queries that can DoS your DB, but all solvable and not drastically different than a developer adding a new query to the product. What we do does create a prepare statement on the fly. We process the input query and generate SQL and collect each value the user as input into arrays for the types the correspond to and then index those arrays. So, for example, the query "pr:123 and (user:foo or user:bar)" would turn to SQL like: "pull_request = ($1)[1] and (user = ($2)[1] or user = ($2)[2])" (this allows us to statically type our queries).
In short, if your requirements that users should be able to generate complex queries, I think this is significantly better than JSON. And you can process an input language into JSON and the problem is effectively the same from that point.
For many applications, such as the bicycle shop in the examples, having users type in SQL or custom-language queries would be too difficult, considering most of the audience is not tech-savvy. Your user-friendly frontend UI would be happier to generate JSON instead of the custom textual format. Manually typing queries works for GitHub, but not much beyond it.
Databases do query planning, but some of them cache query plans based on the SQL text before substituting arguments, and may optimise reoccurring queries more aggressively.
There are lots of applications where a query interface that looks like SQL is not valid and there are lots of applications where it is. There are also lots of applications where "some query language" that has a translation to SQL would be a perfectly pleasant human interface, including product search.
Even if there are three applications, this blog post isn't saying every single problem has to be solved this way, so if you're one of the people solving one of those three problems, this could be a great solution for you.
> Databases do query planning, but some of them cache query plans based on the SQL text before substituting arguments, and may optimise reoccurring queries more aggressively.
Ok, and? Even if we go with your JSON solution, you still need to query the database at the end of the day, and your JSON query is not going to translate into the exact same SQL every time (unless you're doing very limited set of operations). I'm not sure what the real advice or information is here. Are you saying we should just never do some query language that translates to SQL at all? All SQL queries should just be dynamic on the input parameters? Even if your statement on query plan optimization is true, how do we know it matters for the product? Maybe slightly longer queries are totally fine in this context?
'custom' part is the problem. Dealing with few vendor service APIs everyone has a different idea and syntax on how to implement simple AND, OR filter. Few weeks back spent half a day trying to figure out what the correct syntax is to do '<value> IN ...' query due to poor API documentation. Something more intuitive would be nice.
I half expected this to promote Postgrest or one of it's alternatives and was surprised it didn't do that by the end of the post.
I used it for this exact usecase, we had some data in postgress and front-end had limited api, and readonly postgrest was simpler than adding the api I needed to the frontend.
In the end, this is really just about using a well-known syntax to express queries: since you end up parsing it, validating it, and building the actual query anyway, the fact that the query is initially written in SQL or anything else doesn't really matter, except that developers know SQL already (which is a good point).
Yes, but it's a little bit more: SQL is a fully fledged language that covers an enormous range of potential tasks and the syntax works mostly seemlessly to enable that.
This means, extending the API by allowing more and more parts of SQL is easy and still backwards compatible.
It's much harder to come up with a custom language / structure from the get-go that doesn't cause backwards-compatibility issues down the road. I guess I should have emphasized this point in the article a bit more, thank you for the comment.
The trouble with SQL is that it is designed around the idea of being able to execute many queries without any overhead. No problem when you are within the confines of local memory. Manageable when you are in a well connected datacenter. But you will quickly find yourself in a world of hurt when you have clients on flaky, high latency mobile connections.
Okay, so now you find some way to bundle many queries, with allowances for connecting the result of one query to another, into a single request with all the results in a single response. Congratulations, you just recreated GraphQL.
I can't follow. I'm very familiar with both SQL and GraphQL. But SQL is strictly more powerful. If you want, you can extend the API to let the caller use with-statements, which allows to aggregate all necessary data within the same SQL query and being returned as a single response. No disadvantage to GraphQL here.
I'm not saying that I suggest to do that, but for the specific point that you mention, I don't see any theoretical advantage of GraphQL in terms of performance.
I'm not sure I understand. GraphQL doesn't need to be, nor is it meant to be, powerful. Its only purpose in life is to roll up multiple actions into a single response made from a single request. It was originally designed for those actions to be REST/RPC endpoints, but if you have an SQL API endpoint the same applies.
SQL as an API isn't some novel thing. Your DBMS already does exactly that! We've been using SQL as an API for 50-some-odd years and will probably still be using it for APIs in 50 more. But, there is good reason why we have added layers on top in the datacenter. Don't expect SQL to be a suitable API for the entire world to use.
> which allows to aggregate all necessary data within the same SQL query and being returned as a single response.
If you push SQL to really pained lengths it can be done, but with a whole lot of added overhead elsewhere. There is no free lunch. Best to leave SQL to the job it was designed for; the job it is good at. It's okay to let it have help.
> Yes, if you push SQL to really pained lengths it can be done, but with a whole lot of added overhead that isn't necessary.
I still don't get your point. Maybe a conrete example helps: listing the contents of a user's posts. In graphql:
users(id: "x") {
postIds # don't really need those
posts {
text
}
}
Here, if this is stored in SQL, there will probably be 3 tables: users, userposts and posts. All conntected by the respective postId. GraphQL is nice here, because instead of making a query to get the user's postIds and then one ore more queries to get each post's content, with GraphQL we only need one network trip.
If the backend is half stupid, it will run 2 SQLs here. If it's smart, it will resolve the whole thing with one join.
So far there is no performance benefit over writing a single SQL with a join and having the backend execute it directly.
Still, I want to emphasize that I'm not advocating to do the latter - it has various drawbacks. But just im terms of performance, there is no disadvantage for the SQL solution.
Even if we look at running 2 completely different semantical queries in the same graphql request, we can do the same in SQL with a width-clause. With the technique shown in the article, it's really easy to support that in the API.
> If it's smart, it will resolve the whole thing with one join.
Yes, as a practical engineer that is what you would do to deal with the overhead you have encountered in the datacenter, but you would realize that you are abusing SQL to hack around overhead issues. That is not what you would do in a less constrained environment. Logically, users and posts are distinct relations as it pertains to that example. They have no business being joined here. There is a place for joins, but that's not it.
So already, for the simplest case imaginable, you are using SQL outside of the way it was designed to be used. Which is fine if it gets the job done. Purity can step to the side for the sake of engineering, of course. We are not purists here. But as the complexity ramps up, there will be a point where it stops being a decent tradeoff and enters a point of being ridiculous – where you've ended up recreating GraphQL... but poorly.
And yet not being a mind reader I won't know what your reason is until you tell me. You're not obliged to tell me what that reason is, of course, but so long as you don't say what that reason is, you fail to rule out the possibility that you don't actually have a reason.
Sure. Consider the obverse: "Humans ALWAYS have a [rational] reason for their actions." The word "always" is logically necessary and yet it sets an extremely high bar to be cleared. Is this the position you want to stake yourself to?
Note: yes, I added the word "rational" here to be explicit but which I only implied earlier. If you want to quibble about that let me save you the trouble. If you want to say that you had a reason, but it just wasn't a rational one, I'll accept that. After all, I'm not unreasonable.
The statement is not considerable as stated. It does not provide enough information to be able to do so. Logically, rationality is context dependent, but the statement is made without defining the context under which it is posed.
The second half of the comment suffers much the same problem. If I wanted to say that the reason was rational (or not), I would need to know under what context it is being said. The reason is undoubtedly both rational and irrational, depending on context. There is not enough information in this thread to establish under what context the reader is intending to interpret the assertion should one want to assert it.
Gibberish. Sophistry. Gorilla dust thrown to obscure what's probably happened here. What's probably happened is that you had a reason--or thought you did--but that's now thrown into doubt and so you're trying to obfuscate the matter. What's probably happened is that you were motivated to assume that I hadn't read the thread (I mean, you said as much) as an easy way to deflect scrutiny of your reasoning. Except, I did read the thread, which means your assumption was wrong. Rather than admit that, again the tactic is to obfuscate. I've seen it a million times.
Fun story. If you had put that much effort into actually reading the thread in the first place you would have been able to answer your own question. For what it is worth, it is recognized that you did go back and eventually read the thread – and thus answer your own question – later once you realized how dumb the question was.
Maybe you're right. Maybe I didn't read the thread. Thing is, you haven't demonstrated that yet. You could've, though. You could've really put me in my place by laying to rest the lingering doubts that you know anything about databases, at all. You could've, and yet you can't, because you don't, and so you didn't, and now it's too late.
Oh well! Happy New Year. Try sometime new. Read a book about relational databases for a change. Peace!
Perhaps I could have chosen to put you in your place, but why would I want to? It makes no difference to me what you think. You have clearly not thought this through.
By avoiding going down that pointless road to nowhere, you offered me some wonderful creative writing that I have been quite amused by. Had I put you in your place instead, all I would have gotten in the end is something to the effect of "Oh, yeah." How boring. That would have been the biggest waste of time ever. The outcome here has been much more beneficial.
And with that, I thank you for the amusement you provided. It has been fun.
> Logically, users and posts are distinct relations as it pertains to that example. They have no business being joined here. There is a place for joins, but that's not it.
So what would you do in your world? Run one SQL to get the post's ids and then run another query with that list to get the content for those posts?
If you were to use SQL as designed, with an idealized implementation, you would first query the users and then run a query for each user to retrieve the the corresponding posts. That is what the example given describes.
I know that's not exactly practical in the real world. Maybe if you dataset is tiny you can get away with it, but any more you'll soon be killed by the overhead of all those queries in the datacenter, let alone over flaky networks.
So, in the real world you have to find a workaround. Using a join here is a decent workaround with decent tradeoffs. You can condense what might be hundreds of queries into one. It does not come free. There is no free lunch. But the tradeoffs are acceptable in a lot of cases.
But we're also talking about the simplest case imaginable. While it is a reasonable tradeoff for that simple example, those hacks don't scale particularly well as the complexity grows. SQL is not designed for one query to return everything and the kitchen sink.
Okay, I think I understand now. I'm a bit surprised, since for me, joins are a very fundamental feature of SQL - but I understand where you come from if you see that differently. Thank you for the explanation.
Joins are fundamental to where joins are a logical operation. But given the specific example of trying to pack unrelated datasets into one relation, to unpack them again on the client, to save on overhead costs is not at all what joins are designed for.
What you are talking about is commonly known as the n+1 problem. It is a problem because the mathematically pure solution does not usually work in the real world due to real world overhead constraints. If computers operated in an idealized world the n+1 problem wouldn't exist, but we live in a harsh reality where not everything works out so perfectly. If joins were meant to be used this way, obviously it wouldn't be a problem in the first place... But it is a problem because that is not what joins are designed for, even if they can help deal with the problem in some cases.
Obviously you can use things beyond what they are designed for, but in this case it only works in the small scale. Give us a complicated example and watch the nightmare unfold. Again, SQL is designed for working with the relational model. The example GraphQL schema is not relational. It describes a graph model. There is, as they say, an impedance mismatch. The closest approximation to a graph in the relational model is multiple relations, and multiple relations in SQL requires multiple queries. In SQL, one query always returns just one relation.
I think you are wrong. I think you are "confidently wrong" as evidently the saying goes, on many points.
- GraphQL isn't meant to bundle many operations together. People mean things and the people who created GraphQL said what they meant and this isn't it, or at least it isn't all of it.
- SQL isn't meant to be used without joins and it isn't being abused by the presence of joins.
- "users" and "posts" in the example are not unrelated. We were explicitly told by the author of the example that they ARE related.
- consequently it's not true that there's no business joining them. On the contrary there are very good reasons for doing exactly that.
- doing that--making those joins--is not a hack
- while it's true that a SQL query returns only one relation, there's nothing inherently wrong with it being a relation over a nested data type like XML or GraphQL.
In my view, you are causing real harm but spreading false information on this subject.
Agreed. There was nothing to suggest otherwise. Did you not bother to read anything before replying?
> "users" and "posts" in the example are not unrelated.
They are related as per the graph, but that does not make them related in a relation. These are not equivalent models. After all, if they were suitably related in a relation you wouldn't have two tables in which to join. They would already exist in one relation. A simple `SELECT * FROM kitchen_sink` would do.
> On the contrary there are very good reasons for doing exactly that.
Yes, we went over them in detail. How did you end up here without reading a single word?
> consequently it's not true that there's no business joining them.
That's right, there is a good reason to join them: To overcome the overhead problem, better known as the n+1 problem. That would still be a hack if using an idealized system, or even a practical system that focuses on minimizing said overhead, like SQLite.
In fact, the SQLite docs even tell you should not resort to such hacks while using SQLite as it is not necessary.
> there's nothing inherently wrong with it
That's right. Nobody said there was anything wrong with it. I even explicitly stated it was a good solution in many cases. I still don't understand how you managed to get here without reading a single thing.
> the people who created GraphQL said what they meant and this isn't it
I also said what I meant, but that didn't stop you from going off to la-la land. What makes you so sure you understood what they said when you can't even manage this simple conversation?
> In my view, you are causing real harm but spreading false information on this subject.
You haven't even read the discussion... But go on, let's assume I spread some falsehood. What harm has been caused?
> There was nothing to suggest [that SQL isn't meant to be used without joins]. Did you not bother to read anything before replying?
This you? "If you were to use SQL as designed, with an idealized implementation, you would first query the users and then run a query for each user to retrieve the the corresponding posts"
> Yes, we went over [the very good reasons for joining "users" and "posts"] in detail
You literally wrote, "Logically, users and posts are distinct relations as it pertains to that example. They have no business being joined here."
> That's right, there is a good reason to join them: To overcome the overhead problem, better known as the n+1 problem
That's not right. That is not the reason to use joins.
> That would still be a hack if using an idealized system, or even a practical system that focuses on minimizing said overhead, like SQLite.
Joining tables in general and joining these tables in particular is not a hack.
> In fact, the SQLite docs even tell you should not resort to such hacks while using SQLite as it is not necessary.
Evidence or it didn't happen.
> there's nothing inherently wrong with [a nested data type like XML or JSON]
You also wrote, "Obviously you can use things beyond what they are designed for, but in this case it only works in the small scale. Give us a complicated example and watch the nightmare unfold. Again, SQL is designed for working with the relational model. The example GraphQL schema is not relational. It describes a graph model. There is, as they say, an impedance mismatch. The closest approximation to a graph in the relational model is multiple relations, and multiple relations in SQL requires multiple queries. In SQL, one query always returns just one relation."
Look, I read your comments many times despite your multiple incorrect assertions that I did not. If I haven't understood you perfectly, well in my defense it's because I'm dealing with gibberish like this. It sure seems like you're insinuating there's some kind of problem using SQL to generate the response for GraphQL queries and your reasoning if we can call it that seems to rest in part on the fact that a SQL query returns just one relation. Yes, that's true, but no, that is not a problem for generating the responses for GraphQL.
> I also said what I meant [about the purpose of GraphQL]
You said, "GraphQL doesn't need to be, nor is it meant to be, powerful. Its only purpose in life is to roll up multiple actions into a single response made from a single request"
No, rolling up multiple actions into a single response isn't the only reason GraphQL was created.
> You haven't even read the discussion.
Wrong. Do you always try to take the easiest way out when your reasoning is subjected to scrutiny?
> What harm has been caused [by my comments]?
Oh, I don't know. How about by promulgating the idea that joins are somehow a "hack" for starters?
I think that the url search parameter convention as is used by PostgREST is a better alternative then using SQL for this. It is more ergonomic, easier to parse and allows for some additional functionality you would usually need GraphQL for. It has some smaller downsides but those are in part due to the translation mismatch between URLs and SQL. Remember that the SQL spec runs in the thousands of pages and that the only reliable way to parse Postgres's version of SQL is by linking to actual Postgres implementation.
I've been playing around with this idea of using this language as a colloquial standard for writing things like a DB poxy/cache or a "standard" interface for something like a ECS based OLAP-esque on columnar stores. Nothing to report at the time but I love to hear from people who have worked on a similar thing.
With SQLite in the browser via WASM, and the sync technology that's available for that (ElectricSQL (I work on this), CR-SQLite, Sqlsync), webapps are increasingly using SQL in the UI layer to a local DB. It's only one step further to then have a multi-modal app that uses SQL to query the server database for non-synced data.
I'm expecting raw SQL to replace the API for a whole class of apps.
Throw in an event sourcing architecture for server side processing and you have a single SQL based API for all operations.
You can do what the author is suggesting without the "SQL" part. In reality, they aren't exposing SQL but, at best, a subset, and most likely really a "language that looks roughly like SQL". This article is timely as I am implementing a similar thing for our product. I already knew I didn't want to go into it doing the original solution the author presents. Our product already has a little query language called tag queries[0], so I already have a parser for it. Now I have two consumers of the AST: one that does what our docs says, but now we use it for various APIs that need query ability. It's pretty simple to write a transformer from the bespoke language to SQL.
In the normal usage we have special tag prefixes with meaning, and now each query entry point has the same query language and you just need to know what the tag prefixes mean for that context.
I haven't done it yet, but next step is to add a complexity limit, that way people just can't send us queries that will DoS the database.
The fundamental problem with SQL is ambient authority
- it's insecure (update anything, anywhere! just need to get by check constraints)
- it's slow (table scans, when I meant to use index)
- it's bad for horizontal scaling (things like sharding fundamentally relies on shrinking the domain of access patterns).
Solve that by introducing some sort of capability system into the query language (think foreign keys, forward or backward), and we can make major strides on all fronts.
> - it's insecure (update anything, anywhere! just need to get by check constraints)
It's entirely possible to add permissions and access control on any modern dbms.
> - it's slow (table scans, when I meant to use index)
Maybe you are using an OLTP when you meant to use an OLAP dbms?
> - it's bad for horizontal scaling (things like sharding fundamentally relies on shrinking the domain of access patterns).
See above.
SQL is more adequate as a "language" or "standard" because that's what it is. What matters is that the underlying dbms that offers this functionality is appropriate.
> It's entirely possible to add permissions and access control on any modern dbms.
My understanding is this is all mandatory access control which is cumbersome as usual.
> Maybe you are using an OLTP when you meant to use an OLAP dbms?
"It's slow" was too pithy. When it's fast it great, but it's too easy to write things that are slow my mistake. I want things to be rejected if it they don't use an index in many cases, not fall back on a slow table scan.
The most key part is that I posit that the "when are table scans allowed?" policy is very similar to the security policy.
"cumbersome" is a matter of judgement. As for table scans, again as I've written elsewhere here, statement timeouts will put an end to those [concerns].
I don't disagree with that. I want something better.
The dream I have is that if one carefully writes a schema (in a whitelist, not blacklist manner), all the information is there to make:
1. Exposing the database to the public safe.
2. Automatically know how to scale out if more machines are added.
3. Easier to understand cost of operations from the syntax alone, without needing to consult the query planer.
That's fair. You do "you", I'll do "me", and we'll both have a very merry [seasonal holiday of togetherness of your choice]. So long as we're saying what we want in the way of gifts, here's what I want. What I want is to push the envelope of what can be done with SQL well past what should be done, not as a practical matter but as a theoretical one. In my view, accumulated received wisdom causes people to err too far in the other direction. They stop well short of what can be done practically with just SQL, out of an outsized an unexamined fear of what shouldn't be done in SQL.
I don't literally want just a SQL user interface to my applications, but I like to imagine if I did, think about the potential problems, then try to be resourceful in solving those problems right within SQL before giving up and resorting to other tools. I believe you can go very far with this approach, maybe not all the way, but farther than people tend to give SQL credit for. At the end of the day, I still need something else in addition to SQL, but the "something else" turns out to be smaller and less involved than it would have been had I been governed by the accumulated received wisdom.
I'm not sure if you're saying one shouldn't use a SQL database at all or you're saying that using SQL as a user facing query language is bad. In this case, I think it's better to consider the user facing language as a simple language that roughly looks like SQL. The author has exposed a very trivial amount of the power of SQL, so a lot of languages would fit the bill.
This is a bit off-topic, but I wonder if anyone knows of a programming language/DSL that supports 'subsentential' logical operators. I was just thinking the other day how convenient it'd be to be able to write SQL queries like this:
... WHERE (either A or B or C) (either beginswith OR endswith) (either X or Y or Z)
i.e. expressing the condition in a way closely mirroring the English syntax, instead of having to fully write out all 18 (=3x2x3) clauses. (Natural rules for short-circuiting would apply, so if 'A beginswith X' evaluates to true, no further conditions need be check)
This isn't just limited to query languages. In many programming languages, it strikes me as odd that I can't write something like `if A is (either X or Z)`, which mirrors how I would naturally express the condition in English, but need to reformulate it as `if A is X or A is Z`, or (worse IMO from the readability standpoint) something like `if [X,Z].includes(A)`.
I don't know any language that offers this, I just wanted to say that I have also thought about this feature at times - but when reading your pseudocode I didn't initially get it, because of the OR, which already has its meaning - instead one could use "either", like "either(a, b, c).either(beginswith, endswith)...".
I am using a similar approach to filter network flows, but the language is just SQL-like (notably because users may not be familiar with all the quirks of SQL) and gets translated to ClickHouse SQL. The language is exposed to the end user, so there is completion and syntax highlighting. Since the result is open source, you get a complete example (notably to evaluate how complex this is).
We have been seeing this trend (or pendulum swing) of pushing SQL and simplicity. I am not saying this solution is simple (will leave that for discussion).
Tangential, if this kinda stuff is interesting, folks might be interested in what Omnigres (disclaimer: my employer) is doing with the python integration in Postgres
When I was reading Beautiful Racket [0] I was still working as a data engineer, writing complex SQL queries all day. I actually sort of fell in love with SQL, and decided to write a web development framework in SQL.
INSERT requests would create forms, SELECT would create a table.
The idea was that you could create something simple like an BBS with only a few lines of code:
CREATE TABLE posts (title TEXT, content TEXT)
INSERT INTO posts (title, content) VALUES (request.title, request.content)
SELECT title, content FROM posts
Sadly, I stopped working on it after only 1 or 2 weekends because I switched professions to cybersecurity and had too much to learn - every weekend thereafter was CTFs or bug bounties.
Oh well! Someday I'd like to write a more intuitive, "conventional" style language for Urbit, maybe called HoonScript or something. So my affection for Racket's language oriented programming will (maybe, one day) not be in vain!
If this reinvents half of postgrest then postgrest is a very small piece of software (I don't think it is). I think the author has restricted themselves to such a small portion of SQL, it's probably not really worth comparing it to that, it's really just a small bespoke language transformer to SQL, and this is not so hard to implement.
But, even if they did reinvent half of postgrest, there are plenty of reasons this might be the right choice given more contextual information, tech decisions aren't made in a vacuum, it just might not be possible to use postgrest.
I implemented something like this for a log platform I was working on. The user would write queries in Apache lucene, and they would be converted into json, and then a specific sql query.
It worked nice, but apart from querying logs, I don’t see a use case where one would do this rather than just creating api endpoints.
Instead of inventing your own DSL for filtering, there are also some well-specified standards.
ODATA [1] is probably the main one, and it’s adopted by e.g. Microsoft’s Azure APIs. JSON:API [2] is another, although I’ve never seen it used anywhere personally.
They’re both pretty heavy specifications so I’ve never bothered implementing anything that 100% follows them. But I have picked bits and pieces from time to time—in particular ODATA’s idea for filtering is nice and comprehensive imo, and there are a number of libraries in various languages which handle parsing its filter strings into structured data.
I’m currently playing with PostgreSQL, foreign data wrappers, and pgrx rust extensions. My development experience has been surprisingly smooth and enjoyable.
My main issue is that joins will be processed locally, so all the foreign data will be fetched before the join happens. But otherwise basic CRUD is easy.
I accepted that particular answer because it was balanced and comprehensive; I still think there’s a way to do this right that provides lots of benefits, namely something similar to what TFA is describing, that is: a wrapper layer that speaks SQL and applies ACLs and other limiters before passing the query to the database, potentially even translating between a virtual, exposed schema, and the actual database.
The key advantage here being the ability to package this into a library, instead of everyone and their mother rolling their own.
I think that's funny because personally I DIDN'T find that answer to be comprehensive or balanced. By my lights, it seems to go out of its way to ignore the affordances modern databases offer to protect against abuse. I found the following paragraph from the answer to be especially immeasured.
"One of the most common vulnerabilities on web applications is SQL Injection, and you are giving a SQL console to your users. You are giving them a wide variety of guns, lots of bullets, and your foot, your hand, your head... And some of them don't like you."
No, you are not giving your users a wide variety guns and bullets and hands and heads. You DO have some measure of control in the amount of power available to users. I don't consider it comprehensive or balanced not to address that fact in the answer.
This post reminded me of Fly's shipping the DB idea [0]. It sounds crazy, but works beautifully for use cases. Instead of data, the caller gets the replicated SQLite itself and may do whatever it wishes to do.
In this vein I was surprised to learn that crt.sh (certificate transparency log) lets anyone connect to their postgres and write their own queries against it:
So what? That role has read-only access probably to all the relations, and it has a statement_timeout of '2 min'. Evidently, that's acceptable to its maintainers and we're in no position to say that they're wrong.
Do you really think this is a good option? Even if you do properly enforce row-level security (Postgresql supports this), you're still exposing a lot of power to the user. They could write some pretty rough queries that could make guaranteeing QoS difficult or impossible. Row-level security is not easy to get right, as well. Maybe there is a library that can help you set this up.
If you need row level security for some tables you can also solve that with views that are made with the appropriate rows.
QoS could be a problem, indeed. If QoS it is an issue in the case one is working on, exposing your data as SQL might not be best choice, and neither is exposing any type of api that causes the server to have to do hard work.
For small applications with a hundred to a few hundred users using the databases own user management is quite handy. You can be very sure that you don't leek information and you can be quite certain that potential bugs in your application code won't lead to many of the common security vulnerability's that exist.
> QoS could be a problem, indeed. If QoS it is an issue in the case one is working on, exposing your data as SQL might not be best choice, and neither is exposing any type of api that causes the server to have to do hard work.
It seems to me that there is a large gap between "inject SQL from the user directly into the database" and "provide a limited query language that can be translated to SQL". There are plenty of situations where the latter is totally fine. Is that reinventing the wheel? I don't think so.
That certainly is an option and might make sense in the context of some problems. I think limits like that tend to be problematic once some operations, validly, take longer. But YMMV. I think exposing the entire database to users is a pretty big surface area to hand out.
Valid operations that take longer can be performed with another role with more generous allowances on statement timeout, but more stringent limitations on the allowable operations (e.g. substituting procedures for a general query interface).
I'm not really sure how that solves the problem. They can still do anything in that account with more generous allowances. It feels like we're just adding complexity without much benefit. It also doesn't address my concern of handing out an interface with a very large surface area. But if this works for you, great. I don't think I'd be able to sleep well operating such a service.
They can't still do anything in that account with more generous allowances on statement timeouts. They can only do what permissions and policies allow, which could be "nothing." That's not a very large surface area.
I think we're missing the point. Clearly we're handing out access to the database and giving the user the ability to run arbitrary SQL for a reason, so giving them access where they can do "nothing" isn't going to solve whatever need we're trying to solve. The amount of security configuration going on to make this solution viable in any real system seems significantly more costly than just writing something that translates a small DSL to SQL.
Judging by some of the comments here it wouldn't surprise me if there are some people operating under the mistaken belief that "full privileges" is the only option. Establishing that "no privileges" (or "nothing") technically IS another option establishes the boundaries of the "surface area" that can be exposed. Evidently, there are at least two possible discrete values for that surface area and "none" is one of them (the other being "all"). That's progress--albeit incremental progress--in this discussion. To me, the obvious next step is to ask and answer the question, "OK, can the exposed surface area take on any other values in the interval [none, all]?" to which I would answer, "Yes."
For example, an OLAP system might with some justification be expected to offer more than 100ms statement timeout in order to support analytical workloads, but also continue to offer arbitrary queries to support exploratory analysis. Fine. Make the statement timeout 10000ms. In Redshift, say (you are using an analytical DB for analytical workloads, aren't you?) you can also impose quotas on rows fetched and data size returned. On the other hand, in an OLTP system (you are keeping your OLTP and OLAP workloads completely separate, aren't you?) for DEV you might have more stringent limits but still preserve a general-purpose SQL query interface. On yet another hand, in an OLTP system for PROD you might lock things down further by "white-listing" blessed queries by baking them into procedures, then turning off access to the underlying relations.
The point which I would like us not to miss, is that we can dial the degrees of protection and of access up or down to suit our needs, right within the database. Sure, you can't have everything. You can't have an API that BOTH offers generous quotas AND ALSO a wide-open and flexible general-purpose SQL query API. Life is about trade-offs, after all.
> You can't have an API that BOTH offers generous quotas AND ALSO a wide-open and flexible general-purpose SQL query API.
But that's not what's being proposed here. What's being proposed is a query language that users can write that compiles to SQL. We don't need all of the power of SQL, all of the functions of SQL. We can wrap the query the user provided in some extra checks to limit what they can see. This is orders of magnitude simpler to implement with a simple compiler than all of the security engineering you've proposed.
I'm not evaluating what's being proposed here. In this thread anyway, I'm responding to this comment, which I regard as untrue:
> They could write some pretty rough queries that could make guaranteeing QoS difficult or impossible.
However, were I to evaluate what's being proposed I would say it's predicated on a matter of judgement rather than a matter of fact: you may not feel you need "all the power of SQL", you may feel that the proposal is "orders of magnitude simpler", and you may feel that things like "row level security are tricky to get right", and that's fine, but others are not obliged to share your feelings. I'm sorry, but I don't. I just don't believe that using the security engineering affordances offered by modern databases is difficult to do, no matter how many times you tell me I should believe that.
> you may not feel you need "all the power of SQL"
It's specified in the blog post. It does depend on the specific use case you have. In my case, I am implementing a language that is more user friendly for users to work with than SQL but compiles down to it, so in my case giving them straight SQL access would not fit the feature requirements. But, of course, it depends.
> you may feel that the proposal is "orders of magnitude simpler"
We can debate if it's literal "orders of magnitude" but I have a hard time seeing how for any complicated database, correctly implementing, testing, and guaranteeing future correctness, of handing access to the database out to users is less complicated than compiling some SQL. For my usecase, we have a multi-tenant database with sensitive information in it that would be bad if somehow another user was able to access it. The solution of compiling to SQL I have implemented in around 200 lines of code plus another 200 (and growing) of tests. I am making a concrete statement here: that 200 lines of code is a lot simpler than whatever you will have to do for security engineering. That does not mean what you're proposing is not the right choice given whatever constraints you have, but tell me how it is simpler.
> I'm implementing a language that is more user friendly for users to work with than SQL
"User-friendly" is a matter of judgement not of fact.
> I am making a concrete statement here: that 200 lines of code is a lot simpler than whatever you will have to do for security engineering
Again, that's not a statement of fact, but setting that aside personally I'm unimpressed. It's easy to imagine multinenant databases with sensitive information providing safety guarantees in under 200 lines of DDL.
I don't even really know what we're disagreeing about at this point. My statement from the beginning has been that handing out direct SQL access to users is not a universal solution. It can work in some cases but there are definitely cases where it is not a valid solution to the problem at hand. I happen to believe that in a realistic production database your solution is going to be significantly more complicated. I have presented a crude measurement for the complexity of my solution, and your response has just been "well, I could imagine it's just as small doing it my way". Ok, I can imagine a lot of things, too.
I feel that your responses have been scoped at just trying to invalidate my most recent response rather than discussing the system as a whole and the pros and cons of various solutions, and that makes me feel frustrated.
I'm sorry you're frustrated. I admit that I am scoping my responses to your most recent response rather than trying to discuss the system as a whole. In my experience, that's common in threaded comment systems. Whatever its shortcomings, it does help combat non-sequiturs. If you'd rather discuss the system as a whole, we can do that.
As for what we're disagreeing about, maybe this will help. In this thread I have tried to say the following:
1. I believe that using a SQL role with the correct access rights, augmented by other features within modern databases, is a good option. It's not the ONLY good option, but it's not the case that it's never a good option.
2. I believe that it's not NECESSARILY true that a user could write some pretty rough queries that threaten QoS against a SQL API. There ARE steps that can be taken, right within the database.
3. I believe that exposing a SQL database to users is not NECESSARILY handing out a big surface area. Modern databases offer tools to limit the size of that surface area. They may not limit it in all possible ways to satisfy every possible person (e.g. I'm not aware of any SQL database that allows you to forbid joins, for example), but it's a mistake to believe that you can't limit the surface area AT ALL.
4. I believe that using these features (RBAC, RLS, data types, constraints, triggers, views, quotas, etc.) in development is not NECESSARILY more costly than writing something that translates a small DSL to SQL.
5. I believe it's difficult to quantify this cost or effort but in any case it's not orders-of-magnitude more difficult to use the database than to compile a DSL.
6. I believe it's presumptuous to declare that other people don't need all the power of SQL. Some people may not need it or want it, and that's fine, but if other people say that they do need it or want it, I'm not going to tell them they're wrong.
7. I believe NEITHER of us is in a position to tell other people that they're doing it wrong, or that what they're doing isn't valid, on matters of judgment. If you like compiling a DSL and it's good for you, I'm not going to say that's not valid. I'd appreciate it if you'd do likewise when I say that I like using the database because it's good for me.
8. I believe BOTH of us, however, are free to say that something is incorrect on matters of fact (e.g. numbers 1 - 4 above).
If you don't disagree with any of these things then perhaps we're not disagreeing about anything after all.
I agree with most of this. The things I don't agree with entirely are small:
3. I do think it is a big surface area. It can be reduced and, in this thread, I have never stated you cannot reduce it at all.
6. I did not state this, at least, but I did state that for my use case, the interface I am providing is more user-friendly. That does not mean all use-cases match mine. In particular, the query DSL we have expands out to some large SQL that would be difficult to write.
I think the only real disagreement we have is how expensive the various solutions are to implement. I still don't buy that for a real-world dataset, the security engineering would not be onerous relative to a DSL, and you haven't provided anything concrete to counter that. But that's fine, it's not your responsibility to convince me. Thank you for the lively discussion.
> 3. I do think it is a big surface area. It can be reduced and, in this thread, I have never stated you cannot reduce it at all.
No, but you did state, "I think exposing the entire database to users is a pretty big surface area to hand out." without qualification. Owing to the ambiguities of human language I never know for sure what people are saying online, but it seemed plausible to me that you thought the surface area couldn't be reduced at all. I welcome your clarification, so thank you for that.
> 6. I did not state [people don't need all the power of SQL].
Well, you did state, "We don't need all of the power of SQL." Sorry, but I took "we" to be "all of us." Perhaps "we" meant just you and your colleagues for your particular use-case. Again, I would be grateful for clarification though you're not obliged to provide it.
> I still don't buy that for a real-world dataset, the security engineering would not be onerous relative to a DSL.
Let's get down to brass tacks. The "security engineering" which I don't think would be onerous comprises:
- basic data types
- custom data types
- custom domains
- primary key constraints
- foreign key constraints
- check constraints
- triggers
- procedures
- views
- Row Based Access Controls (RBAC)
- Row Level Security (RLS)
- recource limits (e.g. statement timeouts)
Do you still consider these to be onerous?
> and you haven't provided anything concrete to counter that
Let me ask you this. Take your use-case of "a multi-tenant database with sensitive information in it that would be bad if somehow another user was able to access it." I don't know the details so for all I know it could be something simple. In that case, it could be handled with something like what's depicted in this demo.
grant all on tenant to tenant; -- use RBAC for the table
create policy tenant on tenant as restrictive for all using (id = current_setting('session.id')::uuid); -- RLS specific policy
create policy global on tenant as permissive for all using (true); -- RLS general policy
alter table tenant enable row level security; -- Turn on RLS for the table
Granted, your use-case probably isn't this simple. Then again, we still have 196 lines to work with before we hit the dreaded 200.
It's been a long day and this thread is getting awful deep, so I don't expect you to respond. If you'd like to continue it in different channel (e.g., GitHub discussions) I'd be happy to oblige, however. Peace.
I actually like and use database security features. However, there are generally two problems with them:
1. There is a limit to the customization
2. They are generally harder to test and change
Application code is usually optimized for both points. So for me, I actually prefer a combination: use rough database security features to reduce the blast-radius of a bug in the application code. And use application code for everything else, since it's easy to customize, easy to understand, independent of the database technology and very easy to test.
> Well, you did state, "We don't need all of the power of SQL." Sorry, but I took "we" to be "all of us." Perhaps "we" meant just you and your colleagues for your particular use-case.
The context of that sentence is talking about the specific proposal of the blog post, which states that their usecase did not require all of SQL. I can understand why it would be confusing as I switched to "we" in the next sentence. For my usecase, as well, we do not need all of the power of SQL.
> Let's get down to brass tacks. The "security engineering" which I don't think would be onerous comprises
> ...
> Do you still consider these to be onerous?
What you have listed is not an actual solution, just a list of things you'd need to solve it. The security engineering is how you actually use it. I could say the only thing you need for the DSL to SQL solution is a programming language, my list would be one item, but would that be reflective of the actual complexity in solving the problem? No. So I cannot say whether or not it is onerous. Additionally, the solution I mentioned is around 200 SLOC, but it has another 200 SLOC of tests, so how to test this is a valid question as well. My tests also don't require a database to with data in it to test, we just validate that the query looks like what we expect it to.
> Granted, your use-case probably isn't this simple. Then again, we still have 196 lines to work with before we hit the dreaded 200.
Thank you for the example, it clarifies in my head more how this would be implemented. Do you have an example of how this works if you are performing joins between tables? Does every table need to have some sort of user id in it directly for that to work?
> What you have listed is not an actual solution, just a list of things you'd need to solve it
Sorry. HN comments is a narrow channel and so I elected not to try to squeeze a full-blown actual solution through it.
> I could say the only thing you need for the DSL to SQL solution is a programming language, my list would be one item
Well...I enumerated the features of the language I'm using just as you could enumerate the features of your programming language. My list could have one item just as easily as yours can: "PostgreSQL DDL"
> how to test this is a valid question as well
My answer to that question has been to use pgTAP for testing and postgresql_faker to generate synthetic data
> My tests also don't require a database to with data in it to test
No, but they do require a runtime, be it in Scala or whatever. That's no different from my case where my test runtime is an ephemeral PostgreSQL database.
> Do you have an example of how this works if you are performing joins between tables?
> Does every table need to have some sort of user id in it directly for that to work?
It's common for single-database multi-tenant data models to add something like a "tenant_id" to every table. It's simple, more efficient, and more foolproof. You can however just join to other tables in the policy condition as I have done. Extra care should be taken as discussed in the PostgreSQL docs:
Thank you for the information. I've even started to use the statement timeout in my DSL implementation to ensure that queries are bounded. I think we view the relative costs of the implementations quite differently, probably due to our backgrounds, but this has been very enlightening for me, thank you.
But faster than you can imagine, the next requirement comes in: OR filters!
bicycles that are (made of steel AND weigh from 10 to 20kg) OR
(made of carbon AND weigh from 5 to 10kg)
Hmm.. I don't know if this is a good example. I have been in the product search space for almost 10 years now and that feature request never came up. And I cataloged thousands of feature requests users had about product search interfaces.
In the rare case users do something like this, they usually use two tabs. One, where they compare steel bikes and one where they compare carbon bikes.
It makes sense, when you think about the frontend implications. Users are very used to checkboxes and range sliders. The user would check "steel" and set the range slider to "10-20". Now how do they expect to duplicate the sliders and the checkboxes to construct the alternative configuration? They open the page in a new tab and use the alternative settings in that tab.
Yeah, maybe the example isn't that great, I admit so much. :-)
However, it's not only about the users as in the visitors of the shop. It can also be a feature such as "show similar bicycles" that will require more complicated queries. And while it's of course possible to maybe run multiple queries and combine the results by code, it can be nice to just do it in one query that is easy to understand and debug/optimize because it's pretty much just SQL.
Maybe it's how you approach products or the space you've been in, but the ability to do "OR" has come up quite frequently for me, and on the opposite side of things, I often am very frustrated when products with search do not let me do OR.
Pretty much any log search interface, I use Papertrail. Google search. For the feature I am currently working on, it's an audit trail for the operations we run and OR is required. I wish the various searches on GitHub supported "or".
Additionally, tons of products do support "OR". Take the Amazon search. It has sections that are "OR" inside of them and "AND" across the sections.
Outside of product search, there certainly are user customizable OR functions in search. That's why I said that the product search example the article gave is probably not a good example. Because in product search it is very, very unusual and not something users expect.
I'm not sure exactly what you mean with the Amazon search, but I have the feeling that what you refer to is not the type of user customizable OR that the article used as an example.
Can you link to such an Amazon search with "OR" and "AND"?
Literally any Amazon search is an OR and an AND. For example, I want products below 200 euros and brands speedo or quicksilver.
If you mean an Amazon search that allows you to explicitly type this in, they don't and, sorry, I was probably not explicit enough in what I am saying.
Now my point is, OR does show up very often in product search, but, it's usually very limited in that it's OR within a category and AND across categories, and there is a pretty standard interface for expressing this.
products below 200 euros and brands speedo or quicksilver
That is what I expected you could mean. Yes, that is a popular pattern. But it is not customizable like the "bicycles that are (made of steel AND weigh from 10 to 20kg) OR (made of carbon AND weigh from 5 to 10kg)" given in the article. When you select two brands on Amazon, they are always connected via OR and when you select the 200 euros, that is always connected to the rest via AND. So the typical
search?brands=speedo,quicksilver&priceMax=200
works.
The point of the article is that this does not work for customizable AND OR constructs, and therefore one should use SQL. And then it gives the bicycles example, stating that this feature request comes up "faster than you can imagine". But no, it almost never comes up in product search. I have not seen it even once in the last 9 years among the thousands of feature requests I looked at.
this is why we go full circle all the time.
1) devs expose plain sql query to the users
2) devs realize it's a bad idea
3) 10 years later, a dev thinks maybe it's not that bad
4) full circle
---
1) devs use the server to generate html response to the users
2) devs think maybe it's better to do everything on the frontend side instead
3) 10 years and 250k frameworks later, devs think maybe it should be done on the server side instead
You could get queries that don't use indexes and have to perform table scans. That's fine when your data is small, but it's a performance bottleneck waiting to happen.
Author here, unfortunately not (to my knowledge). I've used a lot of Python and Typescript and have never found a solution that made me happy when it comes to pattern matching / destructuring. Sorry :'(
> Because we are essentially inventing our own query language here. And inventing a custom language should be pretty much the last resort, I hope we agree on that.
No. I don’t agree at all. Creating a custom query language based on json or lisp or whatever is easy enough, flexible enough, prone to evolution, and helps building an abstraction between the front and the back. With this in mind, the day you decide to query an ElasticSearch instead of the RDBMS, the frontal API doesn’t need to change, there’s juste to update the translation layer to generate an ES query instead of SQL.
For example, you might have a table "users" but while "admins" can see all users, "moderators" might only be allowed to see the users of their moderated areas.
As soon as this kind of flexibility is needed, solutions like Postgrest stop working - they are just not made for that level of customization.
Not to mention this practically invites dumb string interpolation in SQL preparation, undoing decades of work to rid the world of this terrible habit. (Edit: This problem is solvable.)
Don’t try to be clever, use a structured format (JSON, protobuf, etc.) with a well documented schema.
Note that I’m only talking about APIs. It’s an entirely different matter if your product allows users to type in and send complex queries interactively.