It provides central place to store and query data. A big org might have a few hundred databases for various purposes - databricks lets data engineers set up pipelines to ETL that data into databricks and when the data is there it can be queried (using spark, so there's some downsides - namely a more restrictive SQL variant - but some advantages - better performance across very large datasets).
Personally, I hated databricks, it caused endless pain. Our org has less than 10TB of data and so it's overkill. Good ol' Postgres or SQL Server does just fine on tables of a few hundred GB, and bigquery chomps up 1TB+ without breaking a sweat.
Everything in databricks - everything - is clunky and slow. Booting up clusters can take 15 minutes whereas something like bigquery is essentially on-demand and instant. Data ETL'd into databricks usually differs slightly from its original source in subtle but annoying ways. Your IDE (which looks like jupyter notebook, but is not) absolutely suck (limited/unfamiliar keyboard shortcuts, flakey, can only be edited in browser), and you're out of luck if you want to use your favorite IDE, vim etc.
Almost every databricks feature makes huge concessions on the functionality you'd get if you just used that feature outside of databricks. For example databricks has it's own git-like functionality (which is the 5% of git that gets most used, but no way to do the less common git operations).
My personal take is databricks is fine for users who'd otherwise use their laptop's computer/memory - this gets them an environment where they can access much more, at about 10x the cost of what you'd pay for the underlying infra if you just set it up yourself. Ironically, all the databricks-specific cruft (config files, click ops) that's required to get going will probably be difficult for that kind of user anyway, so it negates its value.
For more advanced users (i.e. those that know how to start an ec2 or anything more advanced), databricks will slow you down and be endlessly frustrating. It will basically 2-10x the time it takes to do anything, and sap the joy out of it. I almost quit my job of 12 years because the org moved to databricks. I got permission to use better, faster, cheaper, less clunky, open-source tooling, so I stayed.
My stack atm is neovim, python/R, an EC2 and postgres (sometimes Sql Server). Some use of arrow and duckdb. For queries on less than few hundred GB this stack does great. Fast, familiar, the ec2 is running 24/7 so it's there when I need it and can easily schedule overnight jobs, and no time wasted waiting for it to boot.
You mentioned earlier about how long it would take to acquire a new cluster in Databricks, but you are comparing it here to something that's always on here. In a much larger environment, your setup is not really practical to have a lot of people collaborating.
Note that Databricks SQL Serverless these days can be provisioned in a few seconds.
> you are comparing it here to something that's always on
That's the point. Our org was told databricks would solve problems we just didn't have. Serverful has some wonderful advantages: simplicity, (ironically) cheaper (than something running just 3-4 hours a day but which costs 10x), familiarity, reliability. Serverless also has advantages, but only if it runs smoothly, doesn't take an eternity to boot, isn't prohibitively expensive, and has little friction before using it - databricks meets 0/4 of those critera, with the additional downside of restrictive SQL due to spark backend, adding unnecessary refactoring/complexity to queries.
> your setup is not really practical to have a lot of people collaborating
Hard disagree. Our methods are simple and time-tested. We use git to share code (100x improvement on databricks' version of git). We share data in a few ways, the most common are by creating a table in a database or in S3. It doesn't have to be a whole lot more complicated.
I totally understand if Databricks doesn't fit your use cases.
But you are doing a disingenuous comparison here because one can keep a "serverful" cluster up without shutting it down, and in that case, you'd never need to wait for anything to boot up. If you shut down your EC2 instances, it will also take time to boot up. Alternatively, you can use the (relatively new) serverless offering from them that gets you compute resources in seconds.
To ensure I'm not speaking incorrectly (as I was going from memory), I grep'ed my several years' of databricks notes. Oh boy.. the memories came flooding back!
We had 8 data engineers onboarding the org to databricks, it was only after 2 solid years before they got to working on serverless (it was because users complained of user unfriendliness of 'nodes', and managers of cost). But then, there were problems. A common pattern through my grep of slack convos is "I'm having this esoteric error where X doesn't work on serverless databricks, can you help".. a bunch of back and forth (sometimes over days) and screenshots followed by "oh, unfortunately, serverless doesn't support X".
Another interesting note is someone compared serverless databricks to bigquery, and bigquery was 3x faster without the databricks-specific cruft (all bigquery needs is an authenticated user and a sql query).
Databricks isn't useless. It's just a swiss army knife that doesn't do anything well, except sales, and may improve the workflows for the least advanced data analysts/scientists at the expense of everyone else.
This matches my experiences as well. Databricks is great if 1. your data is actually big (processing 10s/100s of terabytes daily), and 2. you don't care about money.
they are competitors and are similar. Snowflake popularized the cloud datawarehouse concept (after aws fumbled it big with Redshift). DB is the hot new tool.
Personally, I hated databricks, it caused endless pain. Our org has less than 10TB of data and so it's overkill. Good ol' Postgres or SQL Server does just fine on tables of a few hundred GB, and bigquery chomps up 1TB+ without breaking a sweat.
Everything in databricks - everything - is clunky and slow. Booting up clusters can take 15 minutes whereas something like bigquery is essentially on-demand and instant. Data ETL'd into databricks usually differs slightly from its original source in subtle but annoying ways. Your IDE (which looks like jupyter notebook, but is not) absolutely suck (limited/unfamiliar keyboard shortcuts, flakey, can only be edited in browser), and you're out of luck if you want to use your favorite IDE, vim etc.
Almost every databricks feature makes huge concessions on the functionality you'd get if you just used that feature outside of databricks. For example databricks has it's own git-like functionality (which is the 5% of git that gets most used, but no way to do the less common git operations).
My personal take is databricks is fine for users who'd otherwise use their laptop's computer/memory - this gets them an environment where they can access much more, at about 10x the cost of what you'd pay for the underlying infra if you just set it up yourself. Ironically, all the databricks-specific cruft (config files, click ops) that's required to get going will probably be difficult for that kind of user anyway, so it negates its value.
For more advanced users (i.e. those that know how to start an ec2 or anything more advanced), databricks will slow you down and be endlessly frustrating. It will basically 2-10x the time it takes to do anything, and sap the joy out of it. I almost quit my job of 12 years because the org moved to databricks. I got permission to use better, faster, cheaper, less clunky, open-source tooling, so I stayed.