ianmcook's comments

ianmcook · on Jan 29, 2025

Arrow developer here, we've invested a lot in seamless DuckDB interop, great to see it getting traction.

Recent blog post here that breaks down why the Arrow format (which underlies Arrow Flight) is so fast in applications like this: https://arrow.apache.org/blog/2025/01/10/arrow-result-transf...

chrisjc · on Jan 29, 2025

Thank you for all the work you guys do. The Arrow ecosystem is just absolutely incredible.

My few gripes related to interop with duckdb are related to Arrow scanning/pushdowns. And this extends to interop with other projects like pyiceberg too.

Registering an Arrow Dataset (or pyiceberg scan) as a "duckdb relation" (virtual view) is still a little problematic. Querying these "relations" does not always result in an optimal outcome.

For Arrow datasets, you can intercept the duckdb pushdown, but duckdb will have already "optimized" the plan to its liking, and any scanning restrictions that may have been more advantageous based on the nuances of the dataset might have been lost. Eg:

    WHERE A IN (3, 5, 7)

is presented to the Arrow scanner (pushdown) as "A is between 3 and 7 inclusive" (https://duckdb.org/docs/guides/performance/indexing.html#zon...).

Perhaps in a similar way, turning an pyiceberg scan into a relation for duckdb effectively takes the entire scan and creates an Arrow Table rather than some kind of pushdown/"scan plan" for duckdb to potentially make more efficient with its READ_PARQUET() functionality.

Most of this is probably dependent on duckdb development, but all of the incredible interop work done across communities/ecosystems so far gives me a lot of confidence that these will soon be matters of the past.

1egg0myegg0 · on Jan 30, 2025

IN list filter predicate pushdown is much improved in DuckDB 1.2, coming in about a week! I am not sure if it applies to Arrow yet or not. Disclaimer: I work at MotherDuck and DuckDB Labs

ianmcook · on Jan 30, 2025

@1egg0myegg0 that's great to hear. I'll check to see if it applies to Arrow.

Another performance issue with DuckDB/Arrow integration that we've been working to solve is that Arrow lacked a canonical way to pass statistics along with a stream of data. So for example if you're reading Parquet files and passing them to DuckDB, you would lose the ability to pass the Parquet column statistics to DuckDB for things like join order optimization. We recently added an API to Arrow to enable passing statistics, and the DuckDB devs are working to implement this. Discussion at https://github.com/apache/arrow/issues/38837.

praptak · on Jan 29, 2025

Congratulations! Now you can say you have your ducks in Arrow.

ianmcook · on Aug 22, 2023

Anyone know what format they are serializing the data in to move it between Excel and Python? Are they using Apache Arrow?

ianmcook · on Feb 4, 2021

Thanks for the heads up. The post is intended to be up but there's an intermittent error happening. It's been reported to the Apache infrastructure team.

ianmcook · on Feb 3, 2021

Parquet is not based on Arrow. The Parquet libraries are built into Arrow, but the two projects are separate and Arrow is not a dependency of Parquet.

mr_toad · on Feb 4, 2021

Arrow has definitely influenced the design of Parquet, they’re meant to compliment each other.

ianmcook · on Feb 3, 2021

From https://arrow.apache.org/faq/: "Parquet files cannot be directly operated on but must be decoded in large chunks... Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is... laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed."

ianmcook · on Feb 3, 2021

The Arrow Feather format is an on-disk representation of Arrow memory. To read a Feather file, Arrow just copies it byte for byte from disk into memory. Or Arrow can memory-map a Feather file so you can operate on it without reading the whole file into memory.

waynesonfire · on Feb 3, 2021

That's exactly how I read every data format.

The advantage you describe is in the operations that can performed against the data. It would be nice to see what this API looks like and how it compares to flatbuffers / pq.

To help me understand this benefit, can you talk through what it's like to add 1 to each record and write it back to disk?

ianmcook · on Feb 3, 2021

Re this second point: Arrow opens up a great deal of language and framework flexibility for data engineering-type tasks. Pre-Arrow, common kinds of data warehouse ETL tasks like writing Parquet files with explicit control over column types, compression, etc. often meant you needed to use Python, probably with PySpark, or maybe one of the other Spark API languages. With Arrow now there are a bunch more languages where you can code up tasks like this, with consistent results. Less code switching, lower complexity, less cognitive overhead.