One of the interesting components of Polars that I've been watching is the use of the Apache Arrow memory format, which is a standard layout for data in memory that enables processing (querying, iterating, calculating, etc) in a language agnostic way, in particular without having to copy/convert it into the local object format first. This enables cross-language data access by mmaping or transferring a single buffer, with zero [de]serialization overhead. Something genius and obvious in hindsight.
For some history, there has been a bit of contention between the official arrow-rs implementation and the arrow2 implementation created by the polars team which includes some extra features that they find important. I think the current status is that everyone agrees that having two crates that implement the same standard is not ideal, and they are working to port any necessary features to the arrow-rs crate and plan on eventually switching to it and deprecating arrow2, but it will take some time to get there.
That's not how I read it. I think the arrow-rs team wants a merger, and the arrow2/polars team does not oppose as long as they can keep using the arrow2 API. The responsibility to port the changes over would lie with the arrow-rs team, however, as the arrow2/polars team won't spend any effort on it.
Note that there are about a dozen relevant threads across various repos, which I omitted for obvious reasons. Yes, arrow-rs contributors have driven the effort and made code changes in the arrow-rs repo, but it has absolutely been a collaboration; besides creating new useful apis, arrow2 devs have spent considerable effort to communicate and justify the api differences and the benefits thereof. And switching polars from arrow2 to arrow-rs in the end will not be zero effort. I used neutral language to describe the situation with the explicit intention to avoid spawning unnecessarily divisive commentary here. I don't see how your reading is helpful to understand the broad context.
Oh no, that doesn't bode well. Congrats for them for getting funding, but personally I've learned to stay away form VC-funded OSS projects. Sooner or later, they need to focus investor returns, which inevitable results in locking more and more features and maintenance behind "products" - Of course, they never start out with that intention and promise that they'll always focus on OSS. Never happens.
It’s also simultaneously exactly the opposite. If the authors never get money for their work, eventually they will quit or burn out, and then the project is in a worse spot.
If companies making money have to / want to pay for support and some extra metrics, it seems very reasonable to me. You can always fork the open source project and add the enterprise features yourself of course!
>If the authors never get money for their work, eventually they will quit or burn out, and then the project is in a worse spot
I don't think your conclusion is axiomatic. There are plenty of flourishing open source projects, bug and small, that got handed over from the original author to new maintainers - sometimes multiple times. All without VC funding.
I am worried about VC money specifically, not money in general. There are other ways to fund OSS projects that don't lead to perverse incentives. Plenty of projects make money through sponsors, partnerships, or additional products without VC incentives.
What is it about the funding environment that makes VCs suddenly willing to fund companies built around open source python libraries? They generally aren't beating the AI drum, but are they benefiting by proximity to AI via python?
It’s not sudden and it’s not limited to python libraries.
The popular open source projects make great targets for low-multiple acqui-hires, derisking the investment. They also have huge established branding and generally obvious opportunities for displacing existing players. In a non-zero interest rate environment, those factors make established open source projects a more appealing bet.
Since they mentioned the company, undercutting and eating the market share of DataBricks is enough to appeal to some investors.
Python is growing well, even relative to Java, for $$$ data stack stuff that VCs have a reason to like
Compare:
Funding databases is some of the most appealing for sw infra VC b/c as business fundamentals like monetization (pay for hosting, data, etc), growth, retention, are some of the most successful & low-risk
Funding the sw compute tier is a peg down but appealing for similar reasons. Basically same-but-weaker than DBs on the above dimensions, but still worth it as customers struggle w/ compute at scale (technical + business), so still works. Think early databricks vs snowflake, and how databricks grew to owning more than compute to data lakehouse, dashboards, etc: started as pure compute and now closer to snowflake.
Python is popular for a lot of these compute tier stacks. Orchestrators, AI, ETL, etc. The technical, social, & economic reasons are all interesting & relevant for why.
Where’s the profit in funding a library, though? The others examples you gave are SaaSable, I don’t see that for a library. Paid support, sure, maybe, but does that get the kind of return a VC needs?
They're not selling the library but a managed cloud runtime for a scaleout compute tier. The clientside library is part of the freemium, and as workloads need to go bigger, they want to be the easy button for that.
That's similar to anyscale (ray), coiled & saturncloud (dask), and early databricks (spark). Managing infra for that kind of thing is annoying. These companies don't OSS their cloud stack.
Wishing them luck! A lot of arrow-core compute tier & db co's emerging, so cool to see the many years paying off.
> They're not selling the library but a managed cloud runtime for a scaleout compute tier. The clientside library is part of the freemium, and as workloads need to go bigger, they want to be the easy button for that.
They already have distribution; the question of whether the product is wanted is answered, derisking the investment. All that's missing is monetization.
Congratulations! I use Polars to process my banking csvs into a monthly report. Speed isn't a priority, but ergonomics is- I love Polars consistent and readable Python API!
If you're looking for an easy way to build an HTML report using Python, you might find Datapane (https://github.com/datapane/datapane) helpful. I'm one of the people building it! We don't support polars (yet, on the roadmap) but we do support pandas so you can convert to a pandas DataFrame and include your data and any plots, etc.
Thanks for sharing! I think everyone has their own expectations/needs from a tool like this, so just for the sake of comparison:
For classification and visualization I prefer tags (not necessarily mutually exclusive) rather than categories (too rigid and therefore arbitrary and therefore hard to keep consistent), but then the database management becomes more complicated (need a transactions-tags table or equivalent).
I want to automate the classification as much as possible, so I set up merchant-tags, which are automatically applied to all matching transactions. Again, this means managing a merchants table. It turned out that both the number of repeat merchants is high (so lots of patterns to maintain; these are stored in the DB as part of the merchant record, rather than in the code) and the number of novel merchants is high (so still spend a lot of time on the manual classification).
Part of the motivation for the above complexity was a vision of some kind of dynamic, auto-generated, multi-level breakdown of cost categories. For example, I might want to see a "date night" tag in both food>dining>date night, and entertainment>date night.
Anyway, all the above complexity was fragile, and too much work on top of the manual classification. A few years ago something in the DB broke and I just didn't get around to fixing it. I probably need to restart with something simpler.
> We are aiming to deliver a Rust-based compute platform that will efficiently run Polars at any scale.
> We believe that the Polars API can be used for both local and cloud/distributed environments. Our API is designed to work well on multiple cores, this design also makes it well poised for a distributed environment. We also believe that a Rust based columnar OLAP engine (Polars), is perfectly suited for efficient distributed computing.
I suspect they will sell "cloud-scale distributed computation" systems. Perhaps something like snowflake?
Every time I think to myself no way there is a business for X I need to remember there quite frankly could be a business for most things, as long as the problem domain being solved for saves labor vs cost, enables new use cases, or business expansion etc.
I've sat too long on far too many things that I'm like oh no one would pay for this when in fact, I bet at least one of my projects could be revenue generating.
Super happy, I think Polars is just an amazing tool! We are in the process of kicking Pandas out of our codebase to replace it with Polars, and I can't tell how satisfactory this has been. The API is so clean and nicely designed compared to Pandas... and with the extra speed/laziness, this is just a miracle ;-)
I would love to use Polars today (I have a lot of thoughts about good dataframe API design, and Polars gets a lot of the big ones right - including removing indexes).
However we have a huge amount (and growing) of Pandas code - is there an easy way to convert that in small pieces to Polars code?
I am very happy for you, egoistically I want Polars to become a the best tool it can. On the other hand it is not going to be easy to build a company around it, I think you need a different/better business model compared to Databricks/Spark.
Great stuff. After using datatables in Python, I was excited by the performance but disappointed by the support (hey, everyone has different priorities). I had started my own DataFrame library in Rust that could be used as a pandas corollary/drop in. I wanted out-of-core, efficiency, threading, memmapping, and sql also! My long-abandoned implementation was a SQLite wrapper, as opposed to Arrow, but this looks exciting. Will be watching.
We're building a no-code data cleaning platform (https://nonan.io) and we recently switched from Pandas to Polars. Transformation pipelines are ~10x faster. Saving us roughly the same factor in compute costs. Polars is amazing.
This is really exciting stuff; gives polars a stable base to build on. Given how far it's already come without this kind of backing, the sky's the limit :))
Seems like that isn't directly referenced in the article, but the first sentence is: "The suboptimal state of DataFrame implementations can mostly be attributed to:", so I take that section to provide justification for the need for polars in the first place, in a world where there exist several other popular DataFrame implementations.
Off topic pet peeve: can we stop posting graphs and images with transparency to sites that offer dark mode? This seriously just leads to an unreadable mess. This seems to be a common thing (wikipedia...) and I'm not quite sure why, but then again I'm not sure how people live with light mode.
For some history, there has been a bit of contention between the official arrow-rs implementation and the arrow2 implementation created by the polars team which includes some extra features that they find important. I think the current status is that everyone agrees that having two crates that implement the same standard is not ideal, and they are working to port any necessary features to the arrow-rs crate and plan on eventually switching to it and deprecating arrow2, but it will take some time to get there.
https://github.com/apache/arrow-rs/issues/1176
https://github.com/jorgecarleitao/arrow2/pull/1476