IMO, it would have been better to donate the repos to a shared org and motivate the community to continue maintaining them.
But pretty awesome this individual is retiring from programming / taking a sabbatical. There is nothing wrong with taking some time off and pursuing other interests when you lose your passion.
> A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.
Lots of engines like Polars, PyTorch, Spark, and Ray can read structured data from databases, but Lakehouses are more efficient.
Databases aren't as good for storing unstructured data.
Databases can also be much more expensive than a Data Lakehouse.
Databases are awesome and have lots of amazing use cases of course. Like you mentioned, data lakehouses are great for high data volume and throughput, but there are other use cases as well IMO.
Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.
The continued use of C++ is not exactly something to be proud of, although in this case at least it presumably is for short-running jobs, not for long-running services that accumulate leaks.
There is a ton of reliable load-bearing software out there written in C++. I don't think the fact that a piece of software is written in C++ is enough to presume that it has memory leaks.
Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.
The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.
One of the challenges is that most Spark users don't care if you 2x performance.
We are in the enterprise with large cloud budgets and can simply change instance types. If you're 20x then that is a different story but then (a) you need to have feature parity and (b) need support from cloud vendors which Spark has.
For the longest time, searching for Ballista linked to its old archived repo that didn't even have a link to the new repo. There was no search result for the new repo. This misled people into thinking that Ballista is a dead project but it wasn't. It wasted so much opportunity.
I don't think it's a fair criticism of Ballista to say that it failed in any way. It just looks to need substantial effort to bring it on par with Spark. The performance benefits are meaningful. Ballista can then not only take the crown from Spark, but also revalidate Rust as a language.
I do see a new opportunity for Ballista. By leveraging all of the Spark-compatible operators and expressions being built in Comet, it would be able to support a wider range of queries much more quickly.
Ballista already uses protobuf for sending plans to executors and Comet accepts protobuf plans (in a similar, but different format).
There seems to be a history of data technologies requiring a serious corporate sponsor. Arrow gets so much dev and marketing effort from Voltron, Spark from Databricks, etc. Did Ballista have anything’s similar? I loved the project but it never seemed to move very fast on integrating with other tools and platforms.
I love Medellin and lived there for many years, but the air quality is terrible and getting worse. You can talk with any locals and they say that the climate is noticeably different than it was in the past.
Medellin is surrounded by mountains and the contaminated air cannot escape. There didn't used to be a lot of cars, but now there is financing so the number of cars is growing significantly.
The hills are steep and old busses spew black smoke.
Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.
> I love Medellin and lived there for many years, but the air quality is terrible and getting worse
The good thing about hill cities such as Medellin (sadly not a format available for big cities say in Europe or the US) is that you can choose your altitude, and at around 2000 thousand meters (the city starts at ~1500m) the air quality is not so bad, used to be worse years ago (maybe you lived there 2 or 3 years ago), but now it's much better.
> You can talk with any locals and they say that the climate is noticeably different than it was in the past.
Yeah, the city is much warmer compared to say 10 years ago, whether this is due to the city growing into previously-forest areas or /global/ warming I don't know... but yeah, locals agree it was MUCH colder 10 years ago...
> Medellin is surrounded by mountains and the contaminated air cannot escape.
See comments above about living at 2000m altitude (up in the mountains a bit away from the high-rise buildings and such, think of Beverly Hills or something like that.).
> The hills are steep and old busses spew black smoke.
As of now, there's almost no remaining old busses spewing black smoke anymore, but there's some cargo trucks still doing it.
> Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.
I wouldn't know, but locals do say that it was a much more colder city in the past...
I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.
You are right there will be a flamewar, and others will discount some of what you say because of your bias, you will get criticism and personal remarks (mostly off base) and you will suffer tremendous heat for it. I have been there in a past life re: unix wars.
But, particularly if you acknowledge opposing views in your content and don't hide counterarguments via cherry picking, you will really add value to the data community in exposing the truth, and educating people both on your team and the other team which ultimately spurs improvements where both sides have gaps and performs a greater benefit for the broader community.
It takes courage and care to put a controversial rigorous viewpoint out there; you do risk your "reputation". But, particularly if you make corrections where appropriate, people will recognize you as genuine.
It is not bad to have a point of view. What is bad is to hide your bias or counterarguments to deceive people.
Be part of the thesis + antithesis-> synthesis Hegelian dialog that brings progress. Ultimately as you advocate for your customers (developers/data users), not "your team", you will perform a true service to the community, even if only you and a few others recognize it.
Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.
Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.
There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.
> Lots of Parquet files in the same directory are typically referred to as a "Parquet table".
This is my point though? This is an apples to oranges comparison. A directory of Parquet files is not a table format. Comparing Delta to Hive or Iceberg is a more apt comparison. I have worked with all types of companies and I have yet to work with one that is just using a directory of Parquet files and calling it a day without using something like Hive with it.
Yea, comparing Delta Lake to Iceberg is more apt, but I've been shying away from that content cause I don't wanna flamewar. Another poster is asking for this post tho, so maybe I should write it.
I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered. If you persist a Spark DataFrame in Delta with save it's not registered in the Hive metastore. If you persist it with saveAsTable it is registered. I've been meaning to write a blog post on this, so you're motivating me again.
I've seen a bunch of enterprises that are still working with Parquet tables that aren't registered in Hive. I worked at an org like this for many years and didn't even know Hive was a thing, haha.
> I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered.
You are right about Delta tables in the Hive metastore but if you are writing from the perspective of "there are companies that don't know what Hive is" then I feel the next step up is "there are companies that just stuff files in S3 and query them with Athena(which handles all the Hive stuff for you when you make tables). Explaining what Delta gives them over that I feel is something worth explaining.
Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.
When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.
You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.
Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.
The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.
Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.
* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.
But pretty awesome this individual is retiring from programming / taking a sabbatical. There is nothing wrong with taking some time off and pursuing other interests when you lose your passion.