Parquet is a wonderful file format and is a dream to work with compared to CSV. Parquet embeds the schema in the footer metadata, so the query engines don't need to guess what the column names / data types are.
Delta Lake (Parquet files + a transaction log) makes it quite pleasant to manage a large lake of data stored in Parquet. There are tons of other features Delta allows for like time travel, schema enforcement, schema evolution, etc.
Parquet is still relatively poorly supported in the JVM world unless this changed in the last year? Yes, you can use Spark but that's an absolutely huge dependency just to read a file representing a table of data. The alternative - trying to use poorly documented Hadoop libraries - was only marginally better. Maybe the story has changed in the last year?
The other problem with Parquet is that it's overly flexible/supports application-specific metadata. It's all fine when you use a single tool/library for reading and writing files but cross-platform is problematic. Saving a Pandas dataframe to parquet, for example, will include a bunch of Pandas-specific metadata which is ignored or skipped by other libraries.
In this case, it required converting timestamp/datetime columns to a nano time int64 representation before writing data from Pandas (for example), otherwise you could not read those columns using anything that wasn't Pandas.
But maybe this has changed but at the time I last used parquet as a table format?
And DuckDB (DuckDB.org) is a lightweight and super fast library/CLI for working with Parquet.
It’s SQLite for column formats, and for Python users it’s only a pip install away. I use it on the command line to inspect and work with Parquet. It’s aliso Pandas compatible and actually more performant than Pandas.
No need to use Spark, which is heavy and has tons of boilerplate.
I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.
For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.
Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.
There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.
Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?
My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?
Parquet used to be poorly supported, but now it's well supported by almost all languages. You can even view Parquet files in text editors now, but that's not something I've ever needed (https://blog.jetbrains.com/blog/2020/02/25/update-on-big-dat...).
Parquet column pruning & predicate pushdown filtering allow for great query performance improvements, see this blog post I wrote for benchmarks: https://coiled.io/blog/parquet-file-column-pruning-predicate...
Delta Lake (Parquet files + a transaction log) makes it quite pleasant to manage a large lake of data stored in Parquet. There are tons of other features Delta allows for like time travel, schema enforcement, schema evolution, etc.