Parquet is a wonderful file format and is a dream to work with compared to CSV. ...

kylebarron · on May 4, 2022

I've recently been working on a WebAssembly version of Parquet, to bring Parquet to the Web too!

[0]: https://github.com/kylebarron/parquet-wasm

[1]: https://observablehq.com/@kylebarron/geoparquet-on-the-web

derriz · on May 4, 2022

Parquet is still relatively poorly supported in the JVM world unless this changed in the last year? Yes, you can use Spark but that's an absolutely huge dependency just to read a file representing a table of data. The alternative - trying to use poorly documented Hadoop libraries - was only marginally better. Maybe the story has changed in the last year?

The other problem with Parquet is that it's overly flexible/supports application-specific metadata. It's all fine when you use a single tool/library for reading and writing files but cross-platform is problematic. Saving a Pandas dataframe to parquet, for example, will include a bunch of Pandas-specific metadata which is ignored or skipped by other libraries.

In this case, it required converting timestamp/datetime columns to a nano time int64 representation before writing data from Pandas (for example), otherwise you could not read those columns using anything that wasn't Pandas.

But maybe this has changed but at the time I last used parquet as a table format?

wenc · on May 4, 2022

And DuckDB (DuckDB.org) is a lightweight and super fast library/CLI for working with Parquet.

It’s SQLite for column formats, and for Python users it’s only a pip install away. I use it on the command line to inspect and work with Parquet. It’s aliso Pandas compatible and actually more performant than Pandas.

No need to use Spark, which is heavy and has tons of boilerplate.

losvedir · on May 4, 2022

I've been pretty impressed with parquet lately. One thing I've missed is a way to group tables. Is there a standard for that? While parquet is generally column oriented it has support for metadata about tables of multiple columns. However, I'm not aware of any format that groups the tables, short of just zipping a bunch of files.

For context, this would be for an application that passes sqlite files around. So naturally it has good support for the database level of storage. But parquet is so fast for some applications as well as so compressed.

101011 · on May 4, 2022

Is Spark what you're looking for? You can do all sorts of joins, groupings, and aggregations with parquet(s) acting as your source(s)

BFLpL0QNek · on May 4, 2022

You want to search for “DataFrame” libraries.

Another commenter mentioned Spark, Panda’s is another popular one, not used it but think it’s lighter weight where Spark is more for large distributed computation even though can run locally.

There’s a bunch of these tools which lets you treat parquet files as tables doing joins, aggregations etc.

mountainriver · on May 4, 2022

Arrow is really the future here

BFLpL0QNek · on May 4, 2022

Isn’t Apache Arrow an in memory format that the various DataFrame libraries can standardise on to interact with each other? inter-process communication (IPC)?

My understanding is your raw data on disk is still a format such as Parquet, but when you load that Parquet in to your application it’s stored as Arrow in-memory for processing?

hiyer · on May 4, 2022

Arrow also has its own on-disk format called Feather - https://arrow.apache.org/docs/python/feather.html

neves · on May 4, 2022

It's it possible to diff a parquet file?

speedgoose · on May 4, 2022

I would load the parquet files in Python Pandas and do the diff you want using Pandas.

o_nate · on May 4, 2022

Its possible but you can't just diff the file bytes. Because you will get spurious differences due to metadata, etc.

ViViDboarder · on May 4, 2022

Is it possible to diff an image?

yawaramin · on May 4, 2022

Yes, multiple tools exist to diff images, e.g. https://github.com/dmtrKovalenko/odiff