Hacker News new | past | comments | ask | show | jobs | submit login

Exactly. Some specific examples.

Read a Parquet file into a Pandas DataFrame. Then read the Pandas DataFrame into a Spark DataFrame. Spark & Pandas are using the same Arrow memory format, so no serde is needed.

See the "Standardization Saves" diagram here: https://arrow.apache.org/overview/




Is this merely planned, or does pandas now use Arrow’s format? I was under the impression that pandas was mostly numpy under the hood with some tweaks to handle some of the newer functionality like nullable arrays. But you’re saying that arrow data can be used by pandas without conversion or copying into new memory?


Pandas is still numpy under the hood, but you can create a numpy array that points to memory that was allocated elsewhere, so conversion to pandas can be done without copies in nice cases where the data model is the same (simple data type, no nulls, etc.): https://arrow.apache.org/docs/python/pandas.html#zero-copy-s...


I hope you don’t mind me asking dumb questions, but how does this differ from the role that say Protocol Buffers fills? To my ears they both facilitate data exchange. Are they comparable in that sense?


Better to compare it to Cap'n Proto instead. Arrow data is already laid out in a usable way. For example, an Arrow column of int64s is an 8-byte aligned memory region of size 8*N bytes (plus a bit vector for nullity), ready for random access or vectorized operations.

Protobuf, on the other hand, would encode those values as variable-width integers. This saves a lot of space, which might be better for transfer over a network, but means that writers have to take a usable in-memory array and serialize it, and readers have to do the reverse on their end.

Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout, and Protobuf as a lightweight purpose-built compression algorithm for structs.


> Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout

I just want to say thank you for this part of the sentence. I understand struct-of-arrays vs array-of-structs, and now I finally understand what the heck Arrow is.



Protobuf provides the fixed64 type and when combined with `packed` (the default in proto3, optional in proto2) gives you a linear layout of fixed-size values. You would not get natural alignment from protobuf's wire format if you read it from an arbitrary disk or net buffer; to get alignment you'd need to move or copy the vector. Protobuf's C++ generated code provides RepeatedField that behaves in most respects like std::vector, but in as much as protobuf is partly a wire format and partly a library, users are free to ignore the library and use whatever code is most convenient to their application.

TL;DR variable-width numbers in protobuf are optional.


protobufs still get encoded and decoded by each client when loaded into memory. arrow is a little bit more like "flatbuffers, but designed for common data-intensive columnar access patterns"


Arrow does actually use flatbuffers for metadata storage.



is that at the cost of the ability to do schema evolution?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: