Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Polars: Company Formation Announcement (pola.rs)
190 points by mmastrac on Aug 3, 2023 | hide | past | favorite | 52 comments


One of the interesting components of Polars that I've been watching is the use of the Apache Arrow memory format, which is a standard layout for data in memory that enables processing (querying, iterating, calculating, etc) in a language agnostic way, in particular without having to copy/convert it into the local object format first. This enables cross-language data access by mmaping or transferring a single buffer, with zero [de]serialization overhead. Something genius and obvious in hindsight.

For some history, there has been a bit of contention between the official arrow-rs implementation and the arrow2 implementation created by the polars team which includes some extra features that they find important. I think the current status is that everyone agrees that having two crates that implement the same standard is not ideal, and they are working to port any necessary features to the arrow-rs crate and plan on eventually switching to it and deprecating arrow2, but it will take some time to get there.

https://github.com/apache/arrow-rs/issues/1176

https://github.com/jorgecarleitao/arrow2/pull/1476


That's not how I read it. I think the arrow-rs team wants a merger, and the arrow2/polars team does not oppose as long as they can keep using the arrow2 API. The responsibility to port the changes over would lie with the arrow-rs team, however, as the arrow2/polars team won't spend any effort on it.


Note that there are about a dozen relevant threads across various repos, which I omitted for obvious reasons. Yes, arrow-rs contributors have driven the effort and made code changes in the arrow-rs repo, but it has absolutely been a collaboration; besides creating new useful apis, arrow2 devs have spent considerable effort to communicate and justify the api differences and the benefits thereof. And switching polars from arrow2 to arrow-rs in the end will not be zero effort. I used neutral language to describe the situation with the explicit intention to avoid spawning unnecessarily divisive commentary here. I don't see how your reading is helpful to understand the broad context.


I just joined Polars as a fresh hire, I'm excited to help make it easier to use and faster :)


Congrats! I was wondering where you'd go after announcing that you were looking for a job at FOSDEM.


Oh no, that doesn't bode well. Congrats for them for getting funding, but personally I've learned to stay away form VC-funded OSS projects. Sooner or later, they need to focus investor returns, which inevitable results in locking more and more features and maintenance behind "products" - Of course, they never start out with that intention and promise that they'll always focus on OSS. Never happens.


It’s also simultaneously exactly the opposite. If the authors never get money for their work, eventually they will quit or burn out, and then the project is in a worse spot.

If companies making money have to / want to pay for support and some extra metrics, it seems very reasonable to me. You can always fork the open source project and add the enterprise features yourself of course!


>If the authors never get money for their work, eventually they will quit or burn out, and then the project is in a worse spot

I don't think your conclusion is axiomatic. There are plenty of flourishing open source projects, bug and small, that got handed over from the original author to new maintainers - sometimes multiple times. All without VC funding.


I am worried about VC money specifically, not money in general. There are other ways to fund OSS projects that don't lead to perverse incentives. Plenty of projects make money through sponsors, partnerships, or additional products without VC incentives.


What is it about the funding environment that makes VCs suddenly willing to fund companies built around open source python libraries? They generally aren't beating the AI drum, but are they benefiting by proximity to AI via python?


It’s not sudden and it’s not limited to python libraries.

The popular open source projects make great targets for low-multiple acqui-hires, derisking the investment. They also have huge established branding and generally obvious opportunities for displacing existing players. In a non-zero interest rate environment, those factors make established open source projects a more appealing bet.

Since they mentioned the company, undercutting and eating the market share of DataBricks is enough to appeal to some investors.


Python is growing well, even relative to Java, for $$$ data stack stuff that VCs have a reason to like

Compare:

Funding databases is some of the most appealing for sw infra VC b/c as business fundamentals like monetization (pay for hosting, data, etc), growth, retention, are some of the most successful & low-risk

Funding the sw compute tier is a peg down but appealing for similar reasons. Basically same-but-weaker than DBs on the above dimensions, but still worth it as customers struggle w/ compute at scale (technical + business), so still works. Think early databricks vs snowflake, and how databricks grew to owning more than compute to data lakehouse, dashboards, etc: started as pure compute and now closer to snowflake.

Python is popular for a lot of these compute tier stacks. Orchestrators, AI, ETL, etc. The technical, social, & economic reasons are all interesting & relevant for why.


Where’s the profit in funding a library, though? The others examples you gave are SaaSable, I don’t see that for a library. Paid support, sure, maybe, but does that get the kind of return a VC needs?


They're not selling the library but a managed cloud runtime for a scaleout compute tier. The clientside library is part of the freemium, and as workloads need to go bigger, they want to be the easy button for that.

That's similar to anyscale (ray), coiled & saturncloud (dask), and early databricks (spark). Managing infra for that kind of thing is annoying. These companies don't OSS their cloud stack.

Wishing them luck! A lot of arrow-core compute tier & db co's emerging, so cool to see the many years paying off.


> They're not selling the library but a managed cloud runtime for a scaleout compute tier. The clientside library is part of the freemium, and as workloads need to go bigger, they want to be the easy button for that.

That's what I had missed - thanks.


They already have distribution; the question of whether the product is wanted is answered, derisking the investment. All that's missing is monetization.


Databricks was started by the creators of Spark which was/is an open source library. This looks to be aiming for a similar goal.


VCs have been doing this for a while, but not limited to python


This is great news. Makes me proud to be a co-author of the forthcoming Polars book [0]. Congrats to Ritchie, Chiel, and the rest of the team!

[0]: https://jeroenjanssens.com/pp


I am so glad you are focusing on Python Polars! When do you think the book will be available?


Thanks! Our goal is to have it published in Q3 2024.


Congratulations! I use Polars to process my banking csvs into a monthly report. Speed isn't a priority, but ergonomics is- I love Polars consistent and readable Python API!


Are you willing to share more information about this? What does the report look like or include? Curious to learn more.


If you're looking for an easy way to build an HTML report using Python, you might find Datapane (https://github.com/datapane/datapane) helpful. I'm one of the people building it! We don't support polars (yet, on the roadmap) but we do support pandas so you can convert to a pandas DataFrame and include your data and any plots, etc.


Seconding this request, I do something similar and it is quite janky


Mine is janky as well, but here's a writeup of what I do - https://www.bbkane.com/blog/monthly-banking-report/

I'm very interested in improvement suggestions or comments you may have (just reply to this comment).


Thanks for sharing! I think everyone has their own expectations/needs from a tool like this, so just for the sake of comparison:

For classification and visualization I prefer tags (not necessarily mutually exclusive) rather than categories (too rigid and therefore arbitrary and therefore hard to keep consistent), but then the database management becomes more complicated (need a transactions-tags table or equivalent).

I want to automate the classification as much as possible, so I set up merchant-tags, which are automatically applied to all matching transactions. Again, this means managing a merchants table. It turned out that both the number of repeat merchants is high (so lots of patterns to maintain; these are stored in the DB as part of the merchant record, rather than in the code) and the number of novel merchants is high (so still spend a lot of time on the manual classification).

Part of the motivation for the above complexity was a vision of some kind of dynamic, auto-generated, multi-level breakdown of cost categories. For example, I might want to see a "date night" tag in both food>dining>date night, and entertainment>date night.

Anyway, all the above complexity was fragile, and too much work on top of the manual classification. A few years ago something in the DB broke and I just didn't get around to fixing it. I probably need to restart with something simpler.

Also, I'm considering trying this: https://lunchmoney.app/


Yeah I use tags for browser bookmarks, but I find that it's very hard to remember the names of tags over time.

+1 on restarting with something similar. If you can get away without a db it might be easier


Oh yeah, that's the other thing, tags effectively require an auto complete interface, and I didn't have that.


What's your business model? Contract development of features? Paid support? Paid-only features?


Highlighted for your convenience:

> We are aiming to deliver a Rust-based compute platform that will efficiently run Polars at any scale.

> We believe that the Polars API can be used for both local and cloud/distributed environments. Our API is designed to work well on multiple cores, this design also makes it well poised for a distributed environment. We also believe that a Rust based columnar OLAP engine (Polars), is perfectly suited for efficient distributed computing.

I suspect they will sell "cloud-scale distributed computation" systems. Perhaps something like snowflake?


Congratulations team! Big win for Polars

This has me thinking:

Every time I think to myself no way there is a business for X I need to remember there quite frankly could be a business for most things, as long as the problem domain being solved for saves labor vs cost, enables new use cases, or business expansion etc.

I've sat too long on far too many things that I'm like oh no one would pay for this when in fact, I bet at least one of my projects could be revenue generating.


Super happy, I think Polars is just an amazing tool! We are in the process of kicking Pandas out of our codebase to replace it with Polars, and I can't tell how satisfactory this has been. The API is so clean and nicely designed compared to Pandas... and with the extra speed/laziness, this is just a miracle ;-)


I would love to use Polars today (I have a lot of thoughts about good dataframe API design, and Polars gets a lot of the big ones right - including removing indexes).

However we have a huge amount (and growing) of Pandas code - is there an easy way to convert that in small pieces to Polars code?



I am very happy for you, egoistically I want Polars to become a the best tool it can. On the other hand it is not going to be easy to build a company around it, I think you need a different/better business model compared to Databricks/Spark.


Oh boy, a giant seed round with BCV leading. I'll stick with data.table + Clickhouse.


Great stuff. After using datatables in Python, I was excited by the performance but disappointed by the support (hey, everyone has different priorities). I had started my own DataFrame library in Rust that could be used as a pandas corollary/drop in. I wanted out-of-core, efficiency, threading, memmapping, and sql also! My long-abandoned implementation was a SQLite wrapper, as opposed to Arrow, but this looks exciting. Will be watching.


We're building a no-code data cleaning platform (https://nonan.io) and we recently switched from Pandas to Polars. Transformation pipelines are ~10x faster. Saving us roughly the same factor in compute costs. Polars is amazing.


This is really exciting stuff; gives polars a stable base to build on. Given how far it's already come without this kind of backing, the sky's the limit :))


Congrats! I'm confused by the Appendix in the post though. When it says "Ignoring database research", what is that connected to in the article?


Seems like that isn't directly referenced in the article, but the first sentence is: "The suboptimal state of DataFrame implementations can mostly be attributed to:", so I take that section to provide justification for the need for polars in the first place, in a world where there exist several other popular DataFrame implementations.


Can I use rust with polars for explorative data analysis?

I feel like its the data exploration that locks me into Pandas, and I kinda want out.


  We are hiring ... We are looking for +- 4 CET.
This rules out people in North America, right?


Yes, initially we want to hire a bit closer to our base (the Netherlands). Eventually that might change.


Unless you are a developer in Newfoundland during daylight savings time (UTC-2.5)


Good luck, I ran into Polars by coincidence just yesterday and already a big fan.

Will you also offer paid support?


We are looking into some sort of support system. Once our new website is out there will be more info on that.

You can also email us info@polars.tech to get more info now.


Off topic pet peeve: can we stop posting graphs and images with transparency to sites that offer dark mode? This seriously just leads to an unreadable mess. This seems to be a common thing (wikipedia...) and I'm not quite sure why, but then again I'm not sure how people live with light mode.


Congratulations! Exciting to see more companies in the Rust data space.


Exciting! What is the argument for Polars on clusters instead of Dask?


Chiel and Ritchie I’m very proud, good luck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: