Hacker News new | past | comments | ask | show | jobs | submit login

I just don't believe you. My CPU doesn't understand Apache Arrow 3.0.



Not GP post, but it might have been better stated as 'eliminating serde overhead'. Arrow's RPC serialization [1] is basically Protobuf, with a whole lot of hacks to eliminate copies on both ends of the wire. So it's still 'serde', but markedly more efficient for large blocks of tabular-ish data.

[1]: https://arrow.apache.org/docs/format/Flight.html


> Arrow's serialization is Protobuf

Incorrect. Only Arrow Flight embeds the Arrow wire format in a Protocol Buffer, but the Arrow protocol itself does not use Protobuf.


Apologies, off base there. Edited with a pointer to Flight :)


So, there are several components to Arrow. One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde.

Sadly, the latter isn't (yet) well supported anywhere but Python and C++. If you can/do use it, though, data are just kept as as arrays in memory. Which is exactly what the CPU wants to see.


Shared memory format is supported in Julia too!


Oh, that's fantastic to hear. Right now I'm living in Python because that's the galactic center, but I've also been anxious to find a low-cost escape hatch that doesn't just lead to C++.


You should definitely check out Julia then. There are a few parts of the language that use C/C++ libraries (blas and mpfr are the main ones), but 95% of the time, your stack will be Julia all the way down.


In the best case, your CPU needn't really be involved beyond a call to set up mapped memory. It is (among other things) a common memory format, so you can be working with it in one framework, and then start working with it in another, and it's location and format in memory hasn't changed at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: