I just don't believe you. My CPU doesn't understand Apache Arrow 3.0.

kristjansson · on Feb 3, 2021

Not GP post, but it might have been better stated as 'eliminating serde overhead'. Arrow's RPC serialization [1] is basically Protobuf, with a whole lot of hacks to eliminate copies on both ends of the wire. So it's still 'serde', but markedly more efficient for large blocks of tabular-ish data.

[1]: https://arrow.apache.org/docs/format/Flight.html

wesm · on Feb 3, 2021

> Arrow's serialization is Protobuf

Incorrect. Only Arrow Flight embeds the Arrow wire format in a Protocol Buffer, but the Arrow protocol itself does not use Protobuf.

kristjansson · on Feb 3, 2021

Apologies, off base there. Edited with a pointer to Flight :)

mumblemumble · on Feb 3, 2021

So, there are several components to Arrow. One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde.

Sadly, the latter isn't (yet) well supported anywhere but Python and C++. If you can/do use it, though, data are just kept as as arrays in memory. Which is exactly what the CPU wants to see.

superdimwit · on Feb 4, 2021

Shared memory format is supported in Julia too!

mumblemumble · on Feb 4, 2021

Oh, that's fantastic to hear. Right now I'm living in Python because that's the galactic center, but I've also been anxious to find a low-cost escape hatch that doesn't just lead to C++.

oscardssmith · on Feb 4, 2021

You should definitely check out Julia then. There are a few parts of the language that use C/C++ libraries (blas and mpfr are the main ones), but 95% of the time, your stack will be Julia all the way down.

macksd · on Feb 4, 2021

In the best case, your CPU needn't really be involved beyond a call to set up mapped memory. It is (among other things) a common memory format, so you can be working with it in one framework, and then start working with it in another, and it's location and format in memory hasn't changed at all.