There's no serde by design (aside from inspecting a tiny piece of metadata indic...

waynesonfire · on Feb 3, 2021

Of course there is. There is always deserialization. The data format is most definitely not native to the CPU.

wesm · on Feb 3, 2021

I challenge you to have a closer look at the project.

Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.

throwaway894345 · on Feb 3, 2021

Perhaps "zero-copy" is a more precise or well-defined term?

waynesonfire · on Feb 3, 2021

CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.

My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.

dlnovell · on Feb 3, 2021

I wish I had your confidence, to argue with Wes McKinney about the details of how Arrow works

mumblemumble · on Feb 4, 2021

Iunno. That looks like "where angels fear to tread" territory to me.

rscho · on Feb 4, 2021

Hiya, a bit of OT (again, last one promise!): I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.

ska · on Feb 3, 2021

You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.

But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.

So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?

johncolanduoni · on Feb 3, 2021

Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.