Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
I wrote a SQL engine in Python (github.com/marsupialtail)
282 points by marsupialtail_2 on Dec 30, 2022 | hide | past | favorite | 73 comments


I think the most interesting part of this project is the fault tolerance. I can’t say I’ve seen any other projects do this, but it seems reasonable to want checkpointing during a long computation.

Another thing I like is that conceptually it seems like it would be simple to switch the underlying query engine (right now it’s Polars) in the future. Seems like a pretty general distributed system.


Checkpointing is actually a bit difficult for fault tolerance for long running batch computations and state tends to grow linearly as the computation progresses, unlike streaming applications where you typically use windows. This means periodic checkpointing leads to quadratic amount of writes to durable storage which is pretty bad.

Quokka supports checkpointing but does not enable that by default to prevent this common problem from killing normal operation performance.


Fault tolerance is an important feature but, given Python’s popularity in data science, the most interesting part of this project to me is support for Python UDFs.

In principle, the programming language should not be the greatest consideration because developers can learn and use different languages for different applications. In practice, being able to draw on familiar syntax and libraries can make a real difference in usability.


Perhaps you can write the underlying query engine :-) in something that's not Python lol


Even the self driving in Tesla use Python. If your Python is slow, you are using Python wrong. So it’s kind of a good thing that it’s natively slow, because you are not supposed to do your computations with native Python. It it was faster (added a lot of type hints and build processes), you would probably not be nudged in the right direction.

https://cerfacs.fr/coop/fortran-vs-python


Self driving for TSLA cars is not really the benchmark I'd use for "something that works reliably"


I don’t think there’s any evidence that FSD’s reliability is affected by being written in Python.


Polars is already not python


I don't have very much background in ML or distributed systems, so forgive my naive questions...

> After all, most ML in industry today seems to be lightweight models applied to heavily engineered features

I assume "lightweight models" are those that don't have too many parameters, and "heavily engineered features" mean that the data fed into the model has undergone significant pre-processing via potentially complicated UDFs -- hence the motivation for the project. Is that right?

> Quokka is an open-source push-based vectorized query engine ... it is meant to be much more performant than blocking-shuffle based alternatives like SparkSQL

Does anyone have pointers to what push-based vs blocking-shuffle engines are? Any good papers?

> It should work on local machine no problem (and should be a lot faster than Pandas!)

So I understand why Quokka is faster than Spark, but I'm a bit uncertain as to why the author is also making a comparison with Pandas on a single machine. Is it because the streaming pipeline design means that Quokka can better take advantage of multiple cores?


That's right. My background is mostly in quantitative finance, where we would use models like linear regression on expert-engineered features based on market data, instead of throwing a deep neural network at raw price data like what some people might imagine.

For push vs. pull, I'd recommend: https://news.ycombinator.com/item?id=27006476.

On single machine, you really should just use Polars. Quokka is faster than Pandas because it can take advantage of multiple cores, but so can Polars -- and it is likely to be faster.


Thanks for the answers! And the push vs pull link is a great explanation indeed :)


Sorry if I missed it -- Are there plans to offer a way to query this in actual sql? I believe SingleStore is MySQL compatible for example which I think is a nice feature. Basically I want to be able to interact with this much like I'd interact with another database I'm using or perhaps with a sqlalchemy core integration (which both SingleStore and Snowflake have).


Yes -- we are building a sql compiler to the dataframe API. It currently passes half of TPC-H, but is not really ready. It will be open-sourced soon.

The SQL optimizations like predicate pushdown and early projection are all there already in the dataframe API, similar to Polars.


> It will be open-sourced soon.

Just reminds many of such projects like Lightworks:

> Editor's note: The intent for Lightworks to go open source seems to have been abandoned.

[0] https://opensource.com/business/12/10/lightworks-linux-devel...


Can you explain how this might differ from something like https://github.com/apache/arrow-ballista

I've seen several variants of "next-gen" spark, but nowhere have I really seen the different tradeoffs/advantages/disadvantages between them.


Firstly quokka is pure Python, which I believe is good for interoperability with Python based UDFs.

Secondly quokka tries to be fault tolerant. I.e. it can handle worker failures intra query and not have to start over. This is quite important in real world spark deployments with thousands of nodes running many hours. AFAIK this is not well supported by most spark alternatives.

Finally quokka has a much stronger focus on time series data analytics. It is meant to excel at workloads like range joins and asof/PIT joins used for feature engineering. (This part isn't too stable yet so is not open source)

This means quokka optimizes on a different point in the UDF/performance/fault tolerance tradeoff space than something like arrow ballista or starrocks, which I think are pure performance plays


Interesting, I was wondering if you considered building on top of https://github.com/apache/arrow-datafusion-python

I really do think a distributed db with compute/storage separation and optimized for feature engineering/dataloading (for training NNs) is underserved.

I'd be very interested in the time series aspects of what you're building.


hmm I wasn't aware of https://github.com/apache/arrow-datafusion-python... thanks for the pointer.

time series target release by April this year. main challenge is supporting them in the SQL API -- execution engine support is already done


Thanks for sharing.

I have a SQL Engine in Python too (https://github.com/mabel-dev/opteryx). I focused my initial effort on supporting SQL statements and making the usage feel like a database - that probably reflects the problem I had in front of me when I set out - only handling handfuls of gigabytes in a batch environment for ETLs with a group of new-to-data-engineering engineers. Have recently started looking more at real-time performance, such as distributing work. Am interesting in how you've approached.


Wait this is really cool. How do you parse and optimize SQL?


It uses https://github.com/sqlparser-rs/sqlparser-rs as the parser and lexer. The binder, planner, optimizer and executor are in Python. The optimizer stage only works on the logical plan and the rules are heuristic only.


Trino can be fault tolerant but you have to explicitly enable fault tolerant execution.

It might be worth running your benchmarks against Trino with fault tolerant execution mode enabled. Check the documentation here: https://trino.io/docs/current/admin/fault-tolerant-execution...

Adding fault tolerant to execution to Trino was a big and complicated project for anyone interested in more details check here: https://trino.io/blog/2022/05/05/tardigrade-launch.html


I have. I am about 2x faster than trino with fault tolerance. But I didn't put the numbers on that plot because this trino feature is still really new and I might not be benchmarking it in the best way.


That’s awesome. For anything Trino - project Tardigrade related, reach out to any of the maintainers, they’ll be happy to help.


One significant disadvantage of PySpark is its reliance on py4j to serialize and deserialize objects between Java and Python when using Python UDFs. This constant overhead can become burdensome as data volume increases in such an exchange. However, I am glad to see efforts to create a data pipeline framework using Python and Ray.

~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~ Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect.


There is also pandas udfs, which uses arrow as the exchange format. I assume it still has to copy the data (?), but it makes the (de)serializarion fast, and allows for vectorized operations.

https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-...


None of the benchmarks involved any UDFs.


Funnily enough I'm currently writing a NoSQL database in Python.


Name?


Why?


Most likely an autodidactic exercise. I'd love to see the source, even if, or especially if, it's half-finished. Seeing someone work their way through the snarls in such a project would be useful.


Partly that.


> A library to parse and optimize SQL,

That's like saying "a library to parse and optimize computer programs", except probably even harder, since a compiler and runtime library can't make any assumptions about the programs they need to make run, so they're limited in the potential of utilizing all that context information.

Countless person-years have been spent on this and it's still a very active fields of research and engineering.

> 2x SparkSQL performance

Ah, ok, so it can be slow. Never mind then, carry on :-P


I'm the author of SQLGlot. Optimizing SQL is quite common and means things like projection / predicate push downs and logical simplification. SQL can be optimized relatively easily compared to a programming language because it is declarative. For example

SELECT * FROM (SELECT * FROM x) WHERE z = 1

can be optimized into

SELECT * FROM x where z = 1


Oh, you mean optimizing the SQL as SQL, not optimizing the query execution?

Well, some of these transformations are useful, like the one you presented. But any non-trivial transformation may be either beneficial or detrimental, depending on a myriad of factors including memory layout, compression, distribution of data, computing hardware etc.


sql can be arbitrarily nested/deep - so how does the code 'know' what to do



Does this only optimize things like `select * from table where false` type stuff or can it optimize access paths as well? such as whether an index can be used? and what ranges? or should a sequential scan be used


it can do everything that can be done with the query and schema. it doesn't yet do indices or physical statistics


SQLGlot is really great, check it out!


The core of it is Rust:

> Very fast kernels for SQL primitives like joins, filtering and aggregations. Quokka uses Polars to implement these. (I sponsor Polars on Github and you should too.) I am also exploring DuckDB, but I have found Polars to be faster so far.


PyArrow is also a CPython extension wrapping a C++ library.

I agree that it’s a little disingenuous to call it “pure Python” when the two libraries doing the heavy lifting are non-Python; but it’s not a lie that the entirety of the Quokka-specific codebase is Python.

Personally, what I would be more interested in (and what I thought this would be from the title) is a full SQL engine wholesale coded in Python, a la SQLite. Even if it wasn’t super performant or functional.


> Personally, what I would be more interested in (and what I thought this would be from the title) is a full SQL engine wholesale coded in Python, a la SQLite. Even if it wasn’t super performant or functional.

Here you go!

https://news.ycombinator.com/item?id=34192138


that's certainly one way to look at it


Am I incorrect? Elsewhere in the thread you say this project is pure Python and then in the post you say you use a Rust library for queries. Am I misunderstanding something?


Technically it’s worded that “[they] only wrote python”, which is true, in a literal sense.

I agree that saying “only” python gives an inaccurate implication, but it seems non-malicious and strictly correct. I don’t think they are trying to be misleading.


Can't a pure Python project use any library that's not a pure Python project?

Is python-non-purity something that taints all dependants for you?


When people say "pure X", to me, it normally means they didn't involve an FFI or external compiler. This is an often beneficial thing (though also often at the expense of performance and maybe correctness) since it simplifies your build process.

For example, here [0] is a "pure Python postgres driver" and the implication is that it doesn't use libpg.

Or see also this discussion [1].

I don't mean that the author should or shouldn't do whatever they want. I just wouldn't use the term "pure" here or even say this is an engine written in Python since a large chunk of it is not written in Python. That doesn't mean this is not a cool project. It's a cool project!

[0] https://github.com/tlocke/pg8000

[1] https://www.reddit.com/r/learnpython/comments/nktut1/eli5_th...


Pure-python commonly is used to refer to code that has no external dependencies.


I only wrote Python. I am using C++ and Rust libraries


If I write a shell script looking something like:

  #!/bin/sh
  exec rg $@
Then I did not actually write ripgrep in pure shell.


your example clearly has nothing to do with the OP's claims.


"Pure Python" ultimately means that it all runs within the python interpreter, with no dependencies on code that runs outside of the interpreter (standard library notwithstanding).


I'm interested in this just for the use case of 'a sql frontend for polars' - I wonder if just that part could be used independently?


Polars has basic SQL support already, it’s just not that well documented. You’ll need to compile Polars with the ‘sql’ feature flag.

Not sure what the process looks like through the Python API. Maybe @ritchie46 can chime in?


Yes, we have basic support.

Here are some examples of how to use it in python:

https://github.com/pola-rs/polars/blob/91a419acaf024e64410e7...

However, full sql support is on the roadmap. It's just a matter of hours in a day...


Polars is adding that feature itself. You probably just have to wait a couple weeks


I have previously used Dask to handle larger data sets, and that made me wonder - how does Quokka compare to Dask?


I wrote a toy distributed SQL, cypher graph, dynamodb style and document storage Python database but it's more for experimentation than serious use. It's not ready for use it's more a show of how little code you can use to write a database.

https://GitHub.com/samsquire/hash-db


Always wanted to have something like the crafting interpreters style book but implementing a database from scratch, and bonus points for a distributed system with a leader/replica model, leader election and things like partitioning, resizing partitions, handling node failure.

Something like building a toy dynamodb variant. Would pay good money for this.

Is one thing to read about and guesstimate implementation choices from white paper. Totally another if databases/distributed systems expert walks you through propagating write ahead logs to replicas. Hehe

And what it means to have eventual consistent writes vs strong writes in practice. A way to teach by doing.


I ought to spend some time documenting the code.

I prefer reading a description of an algorithm than the code of the algorithm. From the description I can work out how to implement the code myself. I might look at someone's code for ideas or compare how I solved a problem.

I too would enjoy a whitepaper of database design. I document all whitepapers read on GitHub profile (see my HN profile)

I have a simple btree here https://GitHub.com/samsquire/btree

Farley Knight added an AVL tree to the repository that I need to change the unbalanced tree to use it.

I would like to know how SQLite VM works so I can write a database VM. I learned some things from crafting interpreters book and started writing my own programming language here

https://GitHub.com/samsquire/multiversion-concurrency-contro...


I like good sql projects like this. I use duckdb for almost everything.


Python interpreter in SQL - the reverse of what was done - would've been really impressive. Terrible idea of course but impressive nonetheless.



I don't really compile python functions to sql -- I just execute sql. It's just that the execution engine is written in python


Have you tested this with Jepsen?


The focus is olap and etl not oltp. In fact I assume data doesn't change :-)


I haven't looked into this in detail, and it seems like a fine project at a glance, but this caught my attention from the introduction:

> When I set out, I had several objectives:

> Easy to install and run, especially for distributed deployments.

> [...]

> The first two objectives strongly scream Python as the language of choice for Quokka.

Python is probably one of the last languages I'd consider if ease of deployment is a priority. Packaging has historically been a mess, and deploying standalone binaries across platforms is a pain. State of the art solutions are 3rd party and involve bundling the intepreter for each platform. It's been a few years since I last used it for anything serious, but I believe this is still the case.

Whereas something like Go actually makes this infinitely easier, for both the developer and the user. One native Go command builds a standalone binary for each platform. It couldn't be simpler.

The other objective of supporting Python UDFs necessarily ties you to Python. And since this is solving a data science problem, it makes sense for it to be written in Python.


Yeah it's just that the UDFs I see people need require Python packages like sklearn and PyTorch that really don't have good alternatives in other libraries.


Isn't there should be Show HN in the title?


> Having lost all the money I made from my startup on shitcoins and the stock market, I returned to my PhD program to build a better distributed query engine, Quokka

Casual README slip of the year...


Let's make this top-rated comment :-P


I will be the first to admit I didn't add much of value, so if it actually happens I will have to downgrade my opinion of HN audience




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: