Please do run this on your own workloads! It's fairly easy to set up and run. I tried it a few weeks ago against a large test suite and saw huge perf benefits, but also found a memory allocation regression. In order for this v2 to be a polished release in 1.26, it needs a bit more testing.
Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.
Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.
I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].
I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.
I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.
> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.
That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.
Disclaimer: I work for Polars on said query execution.
The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.
Not sure what you mean by this. The table concept is the same age as computers. Here is a table, do something with it -> this is the high level df api. All the functions make sense, what is hard to read, write or debug here?
I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.
Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)
The problem with the dataframe API is that whenever you want to change a small part of your logic, you usually have to rethink and rewrite the whole solution. It is too difficult to write reusable code. Too many functions that try to do too many things with a million kwargs that each have their own nuances. This is because these libraries tend to favor fewer keystrokes over composable design. So the easy stuff is easy and makes for pretty docs, but the hard stuff is obnoxious to reason through.
With all due respect, have you actually used the Polars expression API? We actually strive for composability of simple functions over dedicated methods with tons of options, where possible.
The original comment I responded to was confusing Pandas with Polars, and now your blog post refers to Numpy, but Polars takes a completely different approach to dataframes/data processing than either of these tools.
I have used numpy, but don't understand what it has to do with dataframe apis
Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.
Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.
In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.
Why is the dataframe approach getting hate when you’re talking about runtime details?
That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.
If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.
As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.
Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.
You were talking about data engineering. If you do not write tests as a data engineer what are you doing then? Just hoping that you don't fuck up editing a 1000 > line SQL script?
If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.
Who cares about JVM versions nowadays? No one is hosting Spark themselves.
Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python
I was talking about data engineering, because that was my job and all analysts were downstream of me. And I could see them struggle with handling infrastructure and way too many toggles that our platform provided them (Databricks at the time).
Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)
But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.
I really doubt that Polars Cloud targets analysts doing ad-hoc analyses. It is much more likely towards people who build data pipelines for downstream tasks (ML etc).
> analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies,
I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.
Analysts don’t because it’s not part of the training & culture. If you’re writing tests you’re doing engineering.
That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.
I find it much more beneficial to lower the barrier for entry (oftentimes without any sacrifices) instead of spending time and money on upskilling everyone, just because I like engineering.
Right but nobody is saying polars or data frames is to replace SQL or is even for the masses. It’s a tool for skilled folks. I personally think the api makes sense but SQL is easier to pick up. Use whatever tools work best.
But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.
Read my posts again, I'm not complaining it's not for the masses, I know it isn't. I'm complaining that it's being forced upon people when there are simpler alternatives that help people focus on business problems rather than setting up virtual environments.
So I'm very much advocating for people to "[u]se whatever tools work best".
(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)
Who is doing the forcing? I’ve not found a place in my decade as a data engineer that such places forced dataframes on would be and capable SQL analysts.
Fun aside - I actually used polars for a bit - first time I tried it, I actually thought it was broken, because it finished processing so quickly I thought it silently exited or something.
So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.
Polars is very nice. I’ve used it off and on. The option to write rust udf’s for performance, easy integration of rust with Python with pyo3 will make it a real contender.
Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.
I think your argument focuses a lot on the scenario where you already have cleaned data (i.e., data warehouse). I and many other data engineers agree, you're better off with hosting it on SQL RDBMS.
However, before that, you need a lot of code to clean the data and raw data does not fit well into a structured RDBMS. Here you choose to either map your raw data into row view or a table view. You're now left with the choice of either inventing your own domain object (row view) or use a dataframe (table view).
I agree, but there are other possibilities in between those two extremes, like Quivr [1]. Schemas are good, but they can be defined in Python and you get a lot more composability and modularity than you would find in SQL (or pandas, realistically).
100% agree. I've also worked as a data engineer and came to the same conclusion. I wrote up a blog which went into a bit more depth on the topic here: https://www.robinlinacre.com/recommend_sql/
For those not familiar with this kind of computing challenges, I must link this wonderful video about TypeScript types running... DOOM. https://www.youtube.com/watch?v=0mCsluv5FXA
Whoa, I've been battling with an issue where my new M4 Mac Mini, disconnected from all peripherals, just sips power and gets rather warm overnight. Cools itself when I wake it up. I wonder if the changes suggested in the post and the comments here will help me resolve this issue. Thanks for upvoting this.
I remember just adding one letter to the instance type in EMR to switch from Intel to ARM... and saw a 20% speedup as well as additional 20% hourly cost saving, all in all 1/3 cost decrease from a single character.
We saw similar gains going from Intel to Graviton 2 on EMR: 15% to 40% speedups depending on the job. Saw similar gains in EMR serverless switching from x86 to arm as well.
The only issue I've had is actually getting enough on-demand instances in certain regions during peak times.
I can't wait for Graviton 3 to be available on EMR.
I think they are quite different it seems. Postlite expects you to connect over the postgres wire protocal. Sqld is compiled into your application so your application behaves like it's talking to an embedded sqlite, the calls are then made over the network (using one of three available transports) before being returning to your application.
Dumb question, with all of this newfangled WASM stuff, why couldn't we also bake the Postgres server (and then client) into the code? I know the WASM runtime would need to expose the low-level ability to do filesystem operations, mmap()ing, network sockets, etc.
I encourage anyone to study these protocols, they are sometimes a lot more prevalent than one might think. Take the Postgres wire protocol, it's supported by a number of databases out there.
Some years I built a data grepping tool, but didn't want to build a UI around it, so I started looking into the Postgres protocol, so that I could plug in existing tools. Getting the basics right was not that difficult, the docs are pretty clear. https://www.postgresql.org/docs/current/protocol.html
Thanks for mentioning Postgres! I am actually pretty curious about the difference between MySQL and Postgres when it comes to their docs about the protocol. My superficial impression is that the Postgres docs are more thorough, leaving less room for ambiguity (some pages of the MySQL docs seem sloppy in comparison, e.g. those documenting the text encoding of values). I would be interested to read an analysis of the development culture of both projects, based on their docs.
For me the main deterrent from Gitea has been their approach to UI design. The shameless inspiration (if you can call it that) they took from Github is pretty startling, the two interfaces look pretty much identical (well, Github has recently redesigned its repo page, but many can remember the old design).
Sure, get inspired, make something similar here or there, but this is an outright copy of the whole product design.
I just compared Gitea and Gitlab side-by-side and they're very, very similar. So either there's also copying between GitHub and Gitlab, or the design of git lends itself to a very specific interface.
> So either there's also copying between GitHub and Gitlab
Isn't it likely that there's just a certain kind of interface and functionality that both works and people have also gotten used to it over time?
It might make a lot of sense to copy it to at least some degree, instead of reinventing the wheel: if you look at MS Office and something like LibreOffice, you'll notice that both of the spreadsheet apps are rather similar (and you can even enable a ribbon interface in LibreOffice, if you want).
I think the same more or less applies to every piece of software, from how phone OSes look, to how desktop environments look and work, as well as why the majority of websites out there look a bit samey.
For me this is a benefit, GitHub's UI in my opinion is really good and I'm glad I don't have to learn a new one for Gitea.
Whenever I need to report a bug on a GitLab instance, its UI is a big source of friction (especially since GitLab's UI is still annoyingly slow when loading things like issue comments).
Anyone can share their experience with the somewhat new STRICT mode? Does it help? I tend to use Postgres when available, primarily for the added strictness, but I'd surely prefer SQLite in more scenarios as it's easier to operate.
I use strict tables. I’m now realising it did help when I migrated tables from one data type to an other, because I’d missed updating some of the code.
I didn’t realise because it was just what I’d expected, but without strict tables I’d have had to debug strange errors on retrieval rather than the type error on insertion I got.
We have extremely heavy SQLite usage throughout. Strict mode would only cause trouble for us. Most of our SQLite access is managed through a domain-specific C# library we developed. It handles all of the parsing & conversion internally. We don't touch the databases directly in most cases.
I've run into situations where I need a db, but all I have is Python, so the embedded SQLite is my best option. It would be ideal if the interface was fully compatible with postgres (or a practical subset), even if SQLite was handling storage under the hood.
- It is annoying to have it per table and not database
- No alter table (to convert existing tables)
- No backward support (maybe through connection string or something)
It's wonderful tool and has simplified workflows for many, just be mindful of its one giant security implication. Should your database credentials ever leak in any way (lost/stolen property, incorrect git commit, screencasting mishap, ...), having an Adminer/phpMyAdmin instance running on an otherwise unrestricted address, you're opening up (quite literally) your server to a world of trouble.
(I speak from experience. I've seen a lot of credentials leak together with source code. This meant I saw "secret" paths where Adminer was hosted on a given site. Having database credentials meant I didn't have to somehow circumvent network security to get to the database itself. [Yes, I disclosed all of this to the server operator and even got a measly $50 gift card out of it :-)])
Yeah, at previous jobs there has been a db.devopsdomain.com (etc) which acted like a bastion into the network from which you could connect with DB credentials to the actual databases.
It's convenient I will say that. I've tried not to replicate this exactly and instead have adminer running on a server on the network but only listening to localhost connections. To get access to it you can then SSH tunnel the ports locally:
> # Route 8082 on your machine to localhost:8080 on the server:
> ssh -N -L 8082:localhost:8080 db.devopsdomain.com
I've found the biggest issue with this is that people forget they've deployed it. It just gets left there until it's sufficiently out of date that a security issue pops up.