It provides central place to store and query data. A big org might have a few hu...

bokenator · 2025-05-06T04:14:40 1746504880

Which open source option did you end up going with? I'm in the same boat and would like to evaluate my options.

rogermavis · 2025-05-06T04:39:40 1746506380

My stack atm is neovim, python/R, an EC2 and postgres (sometimes Sql Server). Some use of arrow and duckdb. For queries on less than few hundred GB this stack does great. Fast, familiar, the ec2 is running 24/7 so it's there when I need it and can easily schedule overnight jobs, and no time wasted waiting for it to boot.

creeksai · 2025-05-06T04:45:08 1746506708

You mentioned earlier about how long it would take to acquire a new cluster in Databricks, but you are comparing it here to something that's always on here. In a much larger environment, your setup is not really practical to have a lot of people collaborating.

Note that Databricks SQL Serverless these days can be provisioned in a few seconds.

rogermavis · 2025-05-06T04:57:29 1746507449

> you are comparing it here to something that's always on

That's the point. Our org was told databricks would solve problems we just didn't have. Serverful has some wonderful advantages: simplicity, (ironically) cheaper (than something running just 3-4 hours a day but which costs 10x), familiarity, reliability. Serverless also has advantages, but only if it runs smoothly, doesn't take an eternity to boot, isn't prohibitively expensive, and has little friction before using it - databricks meets 0/4 of those critera, with the additional downside of restrictive SQL due to spark backend, adding unnecessary refactoring/complexity to queries.

> your setup is not really practical to have a lot of people collaborating

Hard disagree. Our methods are simple and time-tested. We use git to share code (100x improvement on databricks' version of git). We share data in a few ways, the most common are by creating a table in a database or in S3. It doesn't have to be a whole lot more complicated.

creeksai · 2025-05-06T05:06:59 1746508019

I totally understand if Databricks doesn't fit your use cases.

But you are doing a disingenuous comparison here because one can keep a "serverful" cluster up without shutting it down, and in that case, you'd never need to wait for anything to boot up. If you shut down your EC2 instances, it will also take time to boot up. Alternatively, you can use the (relatively new) serverless offering from them that gets you compute resources in seconds.

rogermavis · 2025-05-06T05:24:24 1746509064

To ensure I'm not speaking incorrectly (as I was going from memory), I grep'ed my several years' of databricks notes. Oh boy.. the memories came flooding back!

We had 8 data engineers onboarding the org to databricks, it was only after 2 solid years before they got to working on serverless (it was because users complained of user unfriendliness of 'nodes', and managers of cost). But then, there were problems. A common pattern through my grep of slack convos is "I'm having this esoteric error where X doesn't work on serverless databricks, can you help".. a bunch of back and forth (sometimes over days) and screenshots followed by "oh, unfortunately, serverless doesn't support X".

Another interesting note is someone compared serverless databricks to bigquery, and bigquery was 3x faster without the databricks-specific cruft (all bigquery needs is an authenticated user and a sql query).

Databricks isn't useless. It's just a swiss army knife that doesn't do anything well, except sales, and may improve the workflows for the least advanced data analysts/scientists at the expense of everyone else.

datadrivenangel · 2025-05-06T12:57:58 1746536278

This matches my experiences as well. Databricks is great if 1. your data is actually big (processing 10s/100s of terabytes daily), and 2. you don't care about money.

thr0w · 2025-05-06T13:04:23 1746536663

> Fast > ec2

Are you doing this on EBS? Honest question.

walamaking · 2025-05-06T05:59:15 1746511155

Dumb question - how is this different from Snowflake?

pm90 · 2025-05-06T07:18:06 1746515886

they are competitors and are similar. Snowflake popularized the cloud datawarehouse concept (after aws fumbled it big with Redshift). DB is the hot new tool.

levanten · 2025-05-06T06:17:34 1746512254

They are very similar; with various similar solutions at differing stages of maturity.