Isn't delta lake using parquet files? I don't understand the comparison. Also > ...

MrPowers · on Jan 19, 2024

Yea, Spark works best with "right-sized" files.

Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.

When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

adolph · on Jan 19, 2024

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

MrPowers · on Jan 19, 2024

Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.

jaltekruse · on Jan 19, 2024

If the format is splittable you generally can get similar benefits, and parquet files have metadata to point a given reader at a specific chunk of the file that can be read independently. In the case of parquet the writer decides when to finish writing a block/RowGroup, so manually creating smaller files than that can increase parallelism. But you can only go so far as I'm pretty sure I've seen spark combine together very small files into a single threaded read task.