More

drej · 2025-09-10T09:27:19 1757496439

Please do run this on your own workloads! It's fairly easy to set up and run. I tried it a few weeks ago against a large test suite and saw huge perf benefits, but also found a memory allocation regression. In order for this v2 to be a polished release in 1.26, it needs a bit more testing.

drej · 2025-09-04T10:51:40 1756983100

Having done a bit of data engineering in my day, I'm growing more and more allergic to the DataFrame API (which I used 24/7 for years). From what I've seen over the past ~10 years, 90+% of use cases would be better served by SQL, both from the development perspective as well as debugging, onboarding, sharing, migrating etc.

Give an analyst AWS Athena, DuckDB, Snowflake, whatever, and they won't have to worry about looking up what m6.xlarge is and how it's different from c6g.large.

mrtimo · 2025-09-04T14:01:49 1756994509

I agree with this 100%. The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here [1].

I've been using Malloy [2], which compiles to SQL (like Typescript compiles to Javascript), so instead of editing a 1000 line SQL script, it's only 18 lines of Malloy.

I'd love to see a blog post comparing a pandas approach to cleaning to an SQL/Malloy approach.

[1] https://www.youtube.com/watch?v=PFUZlNQIndo [2] https://www.malloydata.dev/

orlp · 2025-09-04T15:24:53 1756999493

> The creator of duckdb argues that people using pandas are missing out of the 50 years of progress in database research, in the first 5 minutes of his talk here.

That's pandas. Polars builds on much of the same 50 years of progress in database research by offering a lazy DataFrame API which does query optimization, morsel-based columnar execution, predicate pushdown into file I/O, etc, etc.

Disclaimer: I work for Polars on said query execution.

phailhaus · 2025-09-04T21:08:43 1757020123

The DataFrame interface itself is the problem. It's incredibly hard to read, write, debug, and test. Too much work has gone into reducing keystrokes rather than developing a better tool.

dev_l1x_be · 2025-09-04T22:02:02 1757023322

Not sure what you mean by this. The table concept is the same age as computers. Here is a table, do something with it -> this is the high level df api. All the functions make sense, what is hard to read, write or debug here?

I have used Polars to process 600M of xml files (with a bit of a hack) and the polars part of the code is readable with minimal comments.

Polars has a better api than pandas, at least the intent is easier to understand. (lazyness, yay)

phailhaus · 2025-09-04T23:01:26 1757026886

The problem with the dataframe API is that whenever you want to change a small part of your logic, you usually have to rethink and rewrite the whole solution. It is too difficult to write reusable code. Too many functions that try to do too many things with a million kwargs that each have their own nuances. This is because these libraries tend to favor fewer keystrokes over composable design. So the easy stuff is easy and makes for pretty docs, but the hard stuff is obnoxious to reason through.

This article explains it pretty well: https://dynomight.net/numpy/

orlp · 2025-09-04T23:39:33 1757029173

With all due respect, have you actually used the Polars expression API? We actually strive for composability of simple functions over dedicated methods with tons of options, where possible.

The original comment I responded to was confusing Pandas with Polars, and now your blog post refers to Numpy, but Polars takes a completely different approach to dataframes/data processing than either of these tools.

closed · 2025-09-05T02:54:47 1757040887

I have used numpy, but don't understand what it has to do with dataframe apis

Take two examples of dataframe apis, dplyr and ibis. Both can run on a range of SQL backends because dataframe apis are very similar to SQL DML apis.

Moreover, the SQL translation for tools for pivot_longer in R are a good illustration of complex dynamics dataframe apis can support, that you'd use something like dbt to implement in your SQL models. duckdb allows dynamic column selection in unpivot. But in some SQL dialects this is impossible. dataframe apis -> SQL tools (or dbt) enable them in these dialects.

doctaj · 2025-09-05T13:37:19 1757079439

Assuming you’re comparing polars/data frames to sql… SQL has literally the worst debugging experience imaginable.

entropicdrifter · 2025-09-04T18:23:48 1757010228

Just wanted to say I'm a huge fan of your work. Been using Polars for my team's main project for years and it just keeps getting better.

fumeux_fume · 2025-09-04T15:15:39 1756998939

In the same talk, Mark acknowledges that "for data science workflows, database systems are frustrating and slow." Granted DuckDB is an attempt to fix that, most data scientists don't get to choose what database the data is stored in.

willvarfar · 2025-09-04T15:42:33 1757000553

(I use duckdb to query data stored in parquet files)

mrtimo · 2025-09-04T17:54:03 1757008443

Same. But, I use Malloy which uses duckdb to query data stored in hundreds of parquet files (as if they were one big file).

willvarfar · 2025-09-05T05:21:42 1757049702

I haven't looked at Mallory, but I do regularly scan lots of parquet files using wildcards etc from duckdb. Its a neat builtin duckdb feature.

esafak · 2025-09-04T14:34:10 1756996450

Have you used Malloy in a pipeline, e.g., with Airflow? If so, how was the experience?

robertkoss · 2025-09-04T11:01:53 1756983713

That is a false dichotomy. You can use SQL tools but still have to choose the instance type.

Especially when considering testability and composability, using a DataFrame API inside regular languages like Python is far superior IMO.

gigatexal · 2025-09-04T12:47:08 1756990028

Yeah it makes no sense.

Why is the dataframe approach getting hate when you’re talking about runtime details?

That folks understand the almost conversational aspect of SQL vs. that of the dataframe api but the other points make no difference.

If you’re a competent dev/data person and are productive with the dataframe then yay. Also setup and creating test data and such it’s all objects and functions after all — if anything it’s better than the horribad experience of ORMs.

drej · 2025-09-04T11:16:21 1756984581

As a user? No, I don't have to choose. What I'm saying is that analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies, JVM versions, cross-AZ pricing etc. In most cases, they should just get a connection string and/or a web UI to run their queries, everything abstracted from them.

Sure, Python code is more testable and composable (and I do love that). Have I seen _any_ analysts write tests or compose their queries? I'm not saying these people don't exist, but I have yet to bump into any.

robertkoss · 2025-09-04T11:27:49 1756985269

You were talking about data engineering. If you do not write tests as a data engineer what are you doing then? Just hoping that you don't fuck up editing a 1000 > line SQL script?

If you use Athena you still have to worry about shuffling and joining, it is just hidden.. It is Trino / Presto under the hood and if you click explain you can see the execution plan, which is essentially the same as looking into the SparkUI.

Who cares about JVM versions nowadays? No one is hosting Spark themselves.

Literally every tool now supports DataFrame AND SQL APIs and to me there is no reason to pick up SQL if you are familiar with a little bit of Python

datadrivenangel · 2025-09-04T14:36:57 1756996617

Way too many data engineers are running in clown mode just eyeballing the results of 1000 line SQL scripts....

https://ludic.mataroa.blog/blog/get-me-out-of-data-hell/

drej · 2025-09-04T11:54:10 1756986850

I was talking about data engineering, because that was my job and all analysts were downstream of me. And I could see them struggle with handling infrastructure and way too many toggles that our platform provided them (Databricks at the time).

Yes, I did write tests and no, I did not write 1000-line SQL (or any SQL for that matter). But I could see analysts struggle and I could see other people in other orgs just firing off simple SQL queries that did the same as non-portable Python mess that we had to keep alive. (Not to mention the far superior performance of database queries.)

But I knew how this all came to be - a manager wanted to pad their resume with some big data acronyms and as a result, we spent way too much time and money migrating to an architecture, that made everyone worse off.

ritchie46 · 2025-09-04T11:46:43 1756986403

With Polars Cloud you don't have to choose those either. You can pick cpu/memory and we will offer autoscaling in a few months.

Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.

ayhanfuat · 2025-09-04T11:22:55 1756984975

I really doubt that Polars Cloud targets analysts doing ad-hoc analyses. It is much more likely towards people who build data pipelines for downstream tasks (ML etc).

ritchie46 · 2025-09-04T11:40:57 1756986057

We also target ad-hoc analysis. If your data doesn't fit on your laptop, you can spin up a larger box or a cluster and run interactive queries.

riku_iki · 2025-09-04T21:47:12 1757022432

> analysts (who this Polars Cloud targets, just like Coiled or Databricks) shouldn't worry about instance types, shuffling performance, join strategies,

I think this part(query optimizations) in general not solved/solvable, and it is sometimes/often(depending on domain) necessary to digg into details to make data transformation working.

mr_toad · 2025-09-04T12:14:53 1756988093

Analysts don’t because it’s not part of the training & culture. If you’re writing tests you’re doing engineering.

That said the last Python code I wrote as a data engineer was to run tests on an SQL database, because the equivalent in SQL would have been tens of thousands of lines of wallpaper code.

gigatexal · 2025-09-04T12:47:41 1756990061

Again the issue you’re having is the skill level of the audience you keep bringing up not the tool.

drej · 2025-09-04T12:58:54 1756990734

I find it much more beneficial to lower the barrier for entry (oftentimes without any sacrifices) instead of spending time and money on upskilling everyone, just because I like engineering.

gigatexal · 2025-09-04T14:07:23 1756994843

Right but nobody is saying polars or data frames is to replace SQL or is even for the masses. It’s a tool for skilled folks. I personally think the api makes sense but SQL is easier to pick up. Use whatever tools work best.

But coming into such a discussion dunking on a tool cuz it’s not for the masses makes no sense.

drej · 2025-09-04T14:30:36 1756996236

Read my posts again, I'm not complaining it's not for the masses, I know it isn't. I'm complaining that it's being forced upon people when there are simpler alternatives that help people focus on business problems rather than setting up virtual environments.

So I'm very much advocating for people to "[u]se whatever tools work best".

(That is - now I'm doing this. In the past I taught a course on pandas data analytics and spoke at a few PyData conferences and meetups, partly about dataframes and how useful they are. So I'm very much guilty of what all of the above.)

gigatexal · 2025-09-04T15:02:34 1756998154

Who is doing the forcing? I’ve not found a place in my decade as a data engineer that such places forced dataframes on would be and capable SQL analysts.

fumeux_fume · 2025-09-04T15:41:05 1757000465

We all have allergies. I'm allergic to 1000 line SQL queries which include functions that are only usable for a specific flavor and version of SQL.

drej · 2025-09-04T12:00:31 1756987231

Fun aside - I actually used polars for a bit - first time I tried it, I actually thought it was broken, because it finished processing so quickly I thought it silently exited or something.

So I'm definitely a fan, IF you need the DataFrame API. My point was that most people don't need it and it's oftentimes standing in the way. That's all.

orochimaaru · 2025-09-04T15:50:35 1757001035

Polars is very nice. I’ve used it off and on. The option to write rust udf’s for performance, easy integration of rust with Python with pyo3 will make it a real contender.

Yes, I know spark and scala exist. I use it. But the underlying Java engines and the tacky Python gateway impact performance and capacity usage. Having your primary processing engine in the same process compiled natively always helps.

ketozhang · 2025-09-05T03:58:55 1757044735

I think your argument focuses a lot on the scenario where you already have cleaned data (i.e., data warehouse). I and many other data engineers agree, you're better off with hosting it on SQL RDBMS.

However, before that, you need a lot of code to clean the data and raw data does not fit well into a structured RDBMS. Here you choose to either map your raw data into row view or a table view. You're now left with the choice of either inventing your own domain object (row view) or use a dataframe (table view).

spenczar5 · 2025-09-04T15:07:08 1756998428

I agree, but there are other possibilities in between those two extremes, like Quivr [1]. Schemas are good, but they can be defined in Python and you get a lot more composability and modularity than you would find in SQL (or pandas, realistically).

1: https://github.com/B612-Asteroid-Institute/quivr

RobinL · 2025-09-04T14:54:20 1756997660

100% agree. I've also worked as a data engineer and came to the same conclusion. I wrote up a blog which went into a bit more depth on the topic here: https://www.robinlinacre.com/recommend_sql/

drej · 2025-08-04T07:03:51 1754291031

For those not familiar with this kind of computing challenges, I must link this wonderful video about TypeScript types running... DOOM. https://www.youtube.com/watch?v=0mCsluv5FXA

drej · 2025-08-01T07:15:10 1754032510

Whoa, I've been battling with an issue where my new M4 Mac Mini, disconnected from all peripherals, just sips power and gets rather warm overnight. Cools itself when I wake it up. I wonder if the changes suggested in the post and the comments here will help me resolve this issue. Thanks for upvoting this.

drej · on March 2, 2023

I remember just adding one letter to the instance type in EMR to switch from Intel to ARM... and saw a 20% speedup as well as additional 20% hourly cost saving, all in all 1/3 cost decrease from a single character.

Some migrations are easier than others.

time0ut · on March 2, 2023

We saw similar gains going from Intel to Graviton 2 on EMR: 15% to 40% speedups depending on the job. Saw similar gains in EMR serverless switching from x86 to arm as well.

The only issue I've had is actually getting enough on-demand instances in certain regions during peak times.

I can't wait for Graviton 3 to be available on EMR.

drej · on Jan 25, 2023

Note that this already exists on top of SQLite proper - authored by Ben Johnson (Litestream, Fly.io etc.) - https://github.com/benbjohnson/postlite

maxmcd · on Jan 25, 2023

I think they are quite different it seems. Postlite expects you to connect over the postgres wire protocal. Sqld is compiled into your application so your application behaves like it's talking to an embedded sqlite, the calls are then made over the network (using one of three available transports) before being returning to your application.

MuffinFlavored · on Jan 25, 2023

Dumb question, with all of this newfangled WASM stuff, why couldn't we also bake the Postgres server (and then client) into the code? I know the WASM runtime would need to expose the low-level ability to do filesystem operations, mmap()ing, network sockets, etc.

maxmcd · on Jan 25, 2023

Then you'd need to run and maintain Postgres, which is much more complicated, not just a single database file.

Postgres also can't be embedded (according to some brief googling), so you'd need to run it as a separate process.

stuaxo · on Jan 25, 2023

This would be handy for apps that use Postgres features, that would; nonetheless work in WASM.

drej · on Dec 23, 2022

I encourage anyone to study these protocols, they are sometimes a lot more prevalent than one might think. Take the Postgres wire protocol, it's supported by a number of databases out there.

Some years I built a data grepping tool, but didn't want to build a UI around it, so I started looking into the Postgres protocol, so that I could plug in existing tools. Getting the basics right was not that difficult, the docs are pretty clear. https://www.postgresql.org/docs/current/protocol.html

wofo · on Dec 23, 2022

Thanks for mentioning Postgres! I am actually pretty curious about the difference between MySQL and Postgres when it comes to their docs about the protocol. My superficial impression is that the Postgres docs are more thorough, leaving less room for ambiguity (some pages of the MySQL docs seem sloppy in comparison, e.g. those documenting the text encoding of values). I would be interested to read an analysis of the development culture of both projects, based on their docs.

drej · on Oct 26, 2022

For me the main deterrent from Gitea has been their approach to UI design. The shameless inspiration (if you can call it that) they took from Github is pretty startling, the two interfaces look pretty much identical (well, Github has recently redesigned its repo page, but many can remember the old design).

Sure, get inspired, make something similar here or there, but this is an outright copy of the whole product design.

tpxl · on Oct 26, 2022

I just compared Gitea and Gitlab side-by-side and they're very, very similar. So either there's also copying between GitHub and Gitlab, or the design of git lends itself to a very specific interface.

KronisLV · on Oct 26, 2022

> So either there's also copying between GitHub and Gitlab

Isn't it likely that there's just a certain kind of interface and functionality that both works and people have also gotten used to it over time?

It might make a lot of sense to copy it to at least some degree, instead of reinventing the wheel: if you look at MS Office and something like LibreOffice, you'll notice that both of the spreadsheet apps are rather similar (and you can even enable a ribbon interface in LibreOffice, if you want).

I think the same more or less applies to every piece of software, from how phone OSes look, to how desktop environments look and work, as well as why the majority of websites out there look a bit samey.

tpxl · on Oct 26, 2022

> Isn't it likely that there's just a certain kind of interface and functionality that both works and people have also gotten used to it over time?

This is what I meant by 'design of git lends itself to a very specific interface'.

Git has branches, commits, merging, rebasing, ... It makes sense for a UI interface over git to show data git itself shows.

I also find little wrong with copying UI elements, where it works. The result is better software all around.

TheBrokenRail · on Oct 26, 2022

For me this is a benefit, GitHub's UI in my opinion is really good and I'm glad I don't have to learn a new one for Gitea.

Whenever I need to report a bug on a GitLab instance, its UI is a big source of friction (especially since GitLab's UI is still annoyingly slow when loading things like issue comments).

techknowlogick · on Oct 26, 2022

Hi, I'm the author of the blog post. Thank you so much for your comment. Yes, the UI needs an overhaul and that is one thing we hope to improve.

drej · on Oct 20, 2022

Anyone can share their experience with the somewhat new STRICT mode? Does it help? I tend to use Postgres when available, primarily for the added strictness, but I'd surely prefer SQLite in more scenarios as it's easier to operate.

masklinn · on Oct 20, 2022

I use strict tables. I’m now realising it did help when I migrated tables from one data type to an other, because I’d missed updating some of the code.

I didn’t realise because it was just what I’d expected, but without strict tables I’d have had to debug strange errors on retrieval rather than the type error on insertion I got.

bob1029 · on Oct 20, 2022

We have extremely heavy SQLite usage throughout. Strict mode would only cause trouble for us. Most of our SQLite access is managed through a domain-specific C# library we developed. It handles all of the parsing & conversion internally. We don't touch the databases directly in most cases.

rch · on Oct 20, 2022

I've run into situations where I need a db, but all I have is Python, so the embedded SQLite is my best option. It would be ideal if the interface was fully compatible with postgres (or a practical subset), even if SQLite was handling storage under the hood.

bags43 · on Oct 20, 2022

It does not really help in our scenario. I really hope that this feature will evolve more in next versions.

srcreigh · on Oct 20, 2022

What's missing for your scenario?

bags43 · on Oct 21, 2022

- It is annoying to have it per table and not database - No alter table (to convert existing tables) - No backward support (maybe through connection string or something)

drej · on May 30, 2022

It's wonderful tool and has simplified workflows for many, just be mindful of its one giant security implication. Should your database credentials ever leak in any way (lost/stolen property, incorrect git commit, screencasting mishap, ...), having an Adminer/phpMyAdmin instance running on an otherwise unrestricted address, you're opening up (quite literally) your server to a world of trouble.

(I speak from experience. I've seen a lot of credentials leak together with source code. This meant I saw "secret" paths where Adminer was hosted on a given site. Having database credentials meant I didn't have to somehow circumvent network security to get to the database itself. [Yes, I disclosed all of this to the server operator and even got a measly $50 gift card out of it :-)])

nadinengland · on May 30, 2022

Yeah, at previous jobs there has been a db.devopsdomain.com (etc) which acted like a bastion into the network from which you could connect with DB credentials to the actual databases.

It's convenient I will say that. I've tried not to replicate this exactly and instead have adminer running on a server on the network but only listening to localhost connections. To get access to it you can then SSH tunnel the ports locally:

> # Route 8082 on your machine to localhost:8080 on the server: > ssh -N -L 8082:localhost:8080 db.devopsdomain.com

linedash · on May 30, 2022

I've found the biggest issue with this is that people forget they've deployed it. It just gets left there until it's sufficiently out of date that a security issue pops up.

https://www.cvedetails.com/vulnerability-list/vendor_id-1775...