More

MrPowers · on June 19, 2024

IMO, it would have been better to donate the repos to a shared org and motivate the community to continue maintaining them.

But pretty awesome this individual is retiring from programming / taking a sabbatical. There is nothing wrong with taking some time off and pursuing other interests when you lose your passion.

MrPowers · on June 18, 2024

> A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.

Lots of engines like Polars, PyTorch, Spark, and Ray can read structured data from databases, but Lakehouses are more efficient.

Databases aren't as good for storing unstructured data.

Databases can also be much more expensive than a Data Lakehouse.

Databases are awesome and have lots of amazing use cases of course. Like you mentioned, data lakehouses are great for high data volume and throughput, but there are other use cases as well IMO.

MrPowers · on May 31, 2024

Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.

OutOfHere · on May 31, 2024

The continued use of C++ is not exactly something to be proud of, although in this case at least it presumably is for short-running jobs, not for long-running services that accumulate leaks.

_bohm · on May 31, 2024

There is a ton of reliable load-bearing software out there written in C++. I don't think the fact that a piece of software is written in C++ is enough to presume that it has memory leaks.

threeseed · on May 31, 2024

Python would be just another PHP level language if it wasn't for C++.

It's what powers all of the DE/ML/AI libraries.

MrPowers · on May 31, 2024

The OP is the original creator of Ballista, so he's well aware of the project.

Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.

andygrove · on May 31, 2024

Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.

The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.

threeseed · on May 31, 2024

One of the challenges is that most Spark users don't care if you 2x performance.

We are in the enterprise with large cloud budgets and can simply change instance types. If you're 20x then that is a different story but then (a) you need to have feature parity and (b) need support from cloud vendors which Spark has.

OutOfHere · on May 31, 2024

For the longest time, searching for Ballista linked to its old archived repo that didn't even have a link to the new repo. There was no search result for the new repo. This misled people into thinking that Ballista is a dead project but it wasn't. It wasted so much opportunity.

I don't think it's a fair criticism of Ballista to say that it failed in any way. It just looks to need substantial effort to bring it on par with Spark. The performance benefits are meaningful. Ballista can then not only take the crown from Spark, but also revalidate Rust as a language.

andygrove · on May 31, 2024

I wish I'd known about the search issue.

I do see a new opportunity for Ballista. By leveraging all of the Spark-compatible operators and expressions being built in Comet, it would be able to support a wider range of queries much more quickly.

Ballista already uses protobuf for sending plans to executors and Comet accepts protobuf plans (in a similar, but different format).

OutOfHere · on May 31, 2024

Did Databricks sponsor Comet?

andygrove · on June 1, 2024

spenczar5 · on May 31, 2024

There seems to be a history of data technologies requiring a serious corporate sponsor. Arrow gets so much dev and marketing effort from Voltron, Spark from Databricks, etc. Did Ballista have anything’s similar? I loved the project but it never seemed to move very fast on integrating with other tools and platforms.

MrPowers · on March 6, 2024

I love Medellin and lived there for many years, but the air quality is terrible and getting worse. You can talk with any locals and they say that the climate is noticeably different than it was in the past.

Medellin is surrounded by mountains and the contaminated air cannot escape. There didn't used to be a lot of cars, but now there is financing so the number of cars is growing significantly.

The hills are steep and old busses spew black smoke.

Here is some more info on pollution in Medellin: https://medellinguru.com/medellin-pollution/

Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.

jp191919 · on March 6, 2024

Air quality in Bogota is terrible as well.

I think a good first step would be ditching all the diesel vehicles that have minimal/non-existant exhaust emissions systems.

gfarah · on March 6, 2024

Many factories are relocating outside the valley, and the use of electric vehicles (including cars and motorcycles) is increasing.

danlugo92 · on March 9, 2024

> I love Medellin and lived there for many years, but the air quality is terrible and getting worse

The good thing about hill cities such as Medellin (sadly not a format available for big cities say in Europe or the US) is that you can choose your altitude, and at around 2000 thousand meters (the city starts at ~1500m) the air quality is not so bad, used to be worse years ago (maybe you lived there 2 or 3 years ago), but now it's much better.

> You can talk with any locals and they say that the climate is noticeably different than it was in the past.

Yeah, the city is much warmer compared to say 10 years ago, whether this is due to the city growing into previously-forest areas or /global/ warming I don't know... but yeah, locals agree it was MUCH colder 10 years ago...

> Medellin is surrounded by mountains and the contaminated air cannot escape.

See comments above about living at 2000m altitude (up in the mountains a bit away from the high-rise buildings and such, think of Beverly Hills or something like that.).

> The hills are steep and old busses spew black smoke.

As of now, there's almost no remaining old busses spewing black smoke anymore, but there's some cargo trucks still doing it.

> Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.

I wouldn't know, but locals do say that it was a much more colder city in the past...

MrPowers · on Jan 19, 2024

I work at Databricks, but am pretty much just an OSS nerd, mainly focusing on Delta Rust recently: https://github.com/delta-io/delta-rs

I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.

MrPowers · on Jan 19, 2024

Yea, it is fair feedback.

I respect the Iceberg team & their work.

I've been shying away from that post cause I don't wanna start a flamewar, but I will reflect on this and reconsider. Thank you.

gregw2 · on Jan 19, 2024

You are right there will be a flamewar, and others will discount some of what you say because of your bias, you will get criticism and personal remarks (mostly off base) and you will suffer tremendous heat for it. I have been there in a past life re: unix wars.

But, particularly if you acknowledge opposing views in your content and don't hide counterarguments via cherry picking, you will really add value to the data community in exposing the truth, and educating people both on your team and the other team which ultimately spurs improvements where both sides have gaps and performs a greater benefit for the broader community.

It takes courage and care to put a controversial rigorous viewpoint out there; you do risk your "reputation". But, particularly if you make corrections where appropriate, people will recognize you as genuine.

It is not bad to have a point of view. What is bad is to hide your bias or counterarguments to deceive people.

Be part of the thesis + antithesis-> synthesis Hegelian dialog that brings progress. Ultimately as you advocate for your customers (developers/data users), not "your team", you will perform a true service to the community, even if only you and a few others recognize it.

MrPowers · on Jan 19, 2024

Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.

Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.

Iceberg is cool too.

BadHumans · on Jan 19, 2024

There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.

MrPowers · on Jan 19, 2024

Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

Yes, Parquet can be compressed with zip, but snappy is much more common because it's splittable.

Parquet tables can be registered in a Hive metastore. Delta metadata can be added to a Parquet table to make it a Delta table.

BadHumans · on Jan 19, 2024

> Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

This is my point though? This is an apples to oranges comparison. A directory of Parquet files is not a table format. Comparing Delta to Hive or Iceberg is a more apt comparison. I have worked with all types of companies and I have yet to work with one that is just using a directory of Parquet files and calling it a day without using something like Hive with it.

MrPowers · on Jan 19, 2024

Yea, comparing Delta Lake to Iceberg is more apt, but I've been shying away from that content cause I don't wanna flamewar. Another poster is asking for this post tho, so maybe I should write it.

I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered. If you persist a Spark DataFrame in Delta with save it's not registered in the Hive metastore. If you persist it with saveAsTable it is registered. I've been meaning to write a blog post on this, so you're motivating me again.

I've seen a bunch of enterprises that are still working with Parquet tables that aren't registered in Hive. I worked at an org like this for many years and didn't even know Hive was a thing, haha.

BadHumans · on Jan 19, 2024

> I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered.

You are right about Delta tables in the Hive metastore but if you are writing from the perspective of "there are companies that don't know what Hive is" then I feel the next step up is "there are companies that just stuff files in S3 and query them with Athena(which handles all the Hive stuff for you when you make tables). Explaining what Delta gives them over that I feel is something worth explaining.

chimerasaurus · on Jan 19, 2024

I agree with the points you make above.

MrPowers · on Jan 19, 2024

Yea, Spark works best with "right-sized" files.

Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.

When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

adolph · on Jan 19, 2024

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

MrPowers · on Jan 19, 2024

Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.

MrPowers · on Jan 19, 2024

Looking at this now.

* Delta Lake supports merge-on-read via deletion vectors: https://delta.io/blog/2023-07-05-deletion-vectors/

* Why doesn't Delta Lake have efficient bulk load? Lots of the biggest datasets in the world are in Delta tables.

* Delta Lake definitely supports compaction: https://delta.io/blog/2023-01-25-delta-lake-small-file-compa...

* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.

I didn't do a detailed review of the post.