Hacker News new | past | comments | ask | show | jobs | submit login

Can someone ELI5 what problems are best solved by apache arrow?



The premise around arrow is that when you want share data with another system, or even on the same machine between processes, most of the compute time spent is in serializing and deserializing data. Arrow removes that step by defining a common columnar format that can be used in many different programming languages. Theres more to arrow than just the file format that makes working with data even easier like better over the wire transfers (arrow flight). How this would manifest for your customers using your applications? They'd like see speeds increase. Arrow makes a lot of sense when working with lots of data in analytical or data science use cases.


Not only in between processes, but also in between languages in a single process. In this POC I spun up a Python interpreter in a Go process and pass the Arrow data buffer between processes in constant time. https://github.com/nickpoorman/go-py-arrow-bridge


How is this any better than something like flatbuffers though?


For one, Arrow is columnar.

The idea of zero-copy serialization is shared between Arrow and FlatBuffers.


This is explained in the FAQS for Arrow, saying they use flatbuffers internally. Arrow supports more complex data types


> saying they use flatbuffers internally

Only for IPC support - Arrow data format does not use flatbuffers.


Exactly. Some specific examples.

Read a Parquet file into a Pandas DataFrame. Then read the Pandas DataFrame into a Spark DataFrame. Spark & Pandas are using the same Arrow memory format, so no serde is needed.

See the "Standardization Saves" diagram here: https://arrow.apache.org/overview/


Is this merely planned, or does pandas now use Arrow’s format? I was under the impression that pandas was mostly numpy under the hood with some tweaks to handle some of the newer functionality like nullable arrays. But you’re saying that arrow data can be used by pandas without conversion or copying into new memory?


Pandas is still numpy under the hood, but you can create a numpy array that points to memory that was allocated elsewhere, so conversion to pandas can be done without copies in nice cases where the data model is the same (simple data type, no nulls, etc.): https://arrow.apache.org/docs/python/pandas.html#zero-copy-s...


I hope you don’t mind me asking dumb questions, but how does this differ from the role that say Protocol Buffers fills? To my ears they both facilitate data exchange. Are they comparable in that sense?


Better to compare it to Cap'n Proto instead. Arrow data is already laid out in a usable way. For example, an Arrow column of int64s is an 8-byte aligned memory region of size 8*N bytes (plus a bit vector for nullity), ready for random access or vectorized operations.

Protobuf, on the other hand, would encode those values as variable-width integers. This saves a lot of space, which might be better for transfer over a network, but means that writers have to take a usable in-memory array and serialize it, and readers have to do the reverse on their end.

Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout, and Protobuf as a lightweight purpose-built compression algorithm for structs.


> Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout

I just want to say thank you for this part of the sentence. I understand struct-of-arrays vs array-of-structs, and now I finally understand what the heck Arrow is.



Protobuf provides the fixed64 type and when combined with `packed` (the default in proto3, optional in proto2) gives you a linear layout of fixed-size values. You would not get natural alignment from protobuf's wire format if you read it from an arbitrary disk or net buffer; to get alignment you'd need to move or copy the vector. Protobuf's C++ generated code provides RepeatedField that behaves in most respects like std::vector, but in as much as protobuf is partly a wire format and partly a library, users are free to ignore the library and use whatever code is most convenient to their application.

TL;DR variable-width numbers in protobuf are optional.


protobufs still get encoded and decoded by each client when loaded into memory. arrow is a little bit more like "flatbuffers, but designed for common data-intensive columnar access patterns"


Arrow does actually use flatbuffers for metadata storage.



is that at the cost of the ability to do schema evolution?


Yes.

A second important point is the recognition that data tooling often re-implements the same algorithms again and again, often in ways which are not particularly optimised, because the in-memory representation of data is different between tools. Arrow offers the potential to do this once, and do it well. That way, future data analysis libraries (e.g. a hypothetical pandas 2) can concentrate on good API design without having to re-invent the wheel.

And a third is that Arrow allows data to be chunked and batched (within a particular tool), meaning that computations can be streamed through memory rather than the whole dataframe needing to be stored in memory. A little bit like how Spark partitions data and sends it to different nodes for computation, except all on the same machine. This also enables parallelisation by default. With the core count of CPUS this means Arrow is likely to be extremely fast.


Re this second point: Arrow opens up a great deal of language and framework flexibility for data engineering-type tasks. Pre-Arrow, common kinds of data warehouse ETL tasks like writing Parquet files with explicit control over column types, compression, etc. often meant you needed to use Python, probably with PySpark, or maybe one of the other Spark API languages. With Arrow now there are a bunch more languages where you can code up tasks like this, with consistent results. Less code switching, lower complexity, less cognitive overhead.


> most of the compute time spent is in serializing and deserializing data.

This is to be viewed in light how hardware evolves now. CPU compute power is no longer growing as much (at least for individual cores).

But one thing that's still doubling on a regular basis is memory capacity of all kinds (RAM, SSD, etc) and bandwidth of all kinds (PCIe lanes, networking, etc). This divide is getting large and will only continue to increase.

Which brings me to my main point:

You can't be serializing/deserializing data on the CPU. What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.

Short of having your RAM doing compute work*, you would be leaving performance on the table.

----

* Which is starting to appear (https://www.upmem.com/technology/), but that's not quite there yet.


> What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.

Isn't that what DMA is supposed to be?

Also, there's work in getting GPUs to load data straight from NVME drives, bypassing both the CPU and system memory. So you could certainly do similar things with the PCIE bus.

https://developer.nvidia.com/blog/gpudirect-storage/

A big problem is that a lot of data isn't laid out in a way that's ready to be stuffed in memory. When you see a game spending a long time loading data, that's usually why. The CPU will do a bunch of processing to map on disk data structures to a more efficient memory representation.

If you can improve the on-disk representation to more closely match what's in memory, then CPUs are generally more than fast enough to copy bytes around. They are definitely faster than system RAM.


This is backward -- this sort of serialization is overwhelmingly bottlenecked on bandwidth (not CPU). (Multi-core) compute improvements have been outpacing bandwidth improvements for decades and have not stopped. Serialization is a bottleneck because compute is fast/cheap and bandwidth is precious. This is also reflected in the relative energy to move bytes being increasingly larger than the energy to do some arithmetic on those bytes.


An interesting perspective on the future of computer architecture but it doesn't align well with my experience. CPUs are easier to build and although a lot of ink has been spilled about the end of Moore's Law, it remains the case that we are still on Moore's curve for number of transistors, and since about 15 years ago we are now also on the same slope for # of cores per CPU. We also still enjoy increasing single-thread performance, even if not at the rates of past innovation.

DRAM, by contrast, is currently stuck. We need materials science breakthroughs to get beyond the capacitor aspect ratio challenge. RAM is still cheap but as a systems architect you should get used to the idea that the amount of DRAM per core will fall in the future, by amounts that might surprise you.


I'm curious too. Does this mean data is first normalized into this "columnar format" as the primary source and all applications are purely working off this format?

I do see yet clearly how the data is being transferred if no serializing/deserializing is taking place if someone here can help fill in further. It almost sounds like there is some specialized bridge for the data transfer and I don't have the right words for it.


I think you've got it. Data is shared by passing a pointer to it, so the data doesn't need to be copied to different spots in memory (or if it is it's an efficient block copy not millions of tiny copies).


is there a strong use case around passing data from a backend to a frontend, e.g. from pandas data frame on the server into a js implementation on the client side, to use in a UI? As opposed to data pipelining among processing servers.


Yes definitely. For example, the perspective pivoting engine (https://github.com/finos/perspective) supports arrow natively, so you can stream arrow buffers from a big data system directly to be manipulated in your browser (treating the browser as a thick client application). It's a single copy from network buffer into the webassembly heap to get it into the library.


Not as much because you are still paying the overhead of doubt shipping across the internet and through the browser.


Question, doesn't Parquet already do that?


From https://arrow.apache.org/faq/: "Parquet files cannot be directly operated on but must be decoded in large chunks... Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is... laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed."


Yes. But parquet is now based on Apache Arrow.


Parquet is not based on Arrow. The Parquet libraries are built into Arrow, but the two projects are separate and Arrow is not a dependency of Parquet.


Arrow has definitely influenced the design of Parquet, they’re meant to compliment each other.


In Perspective (https://github.com/finos/perspective), we use Apache Arrow as a fast, cross-language/cross-network data encoding that is extremely useful for in-browser data visualization and analytics. Some benefits:

- super fast read/write compared to CSV & JSON (Perspective and Arrow share an extremely similar column encoding scheme, so we can memcpy Arrow columns into Perspective wholesale instead of reading a dataset iteratively).

- the ability to send Arrow binaries as an ArrayBuffer between a Python server and a WASM client, which guarantees compatibility and removes the overhead of JSON serialization/deserialization.

- because Arrow columns are strictly typed, there's no need to infer data types - this helps with speed and correctness.

- Compared to JSON/CSV, Arrow binaries have a super compact encoding that reduces network transport time.

For us, building on top of Apache Arrow (and using it wherever we can) reduces the friction of passing around data between clients, servers, and runtimes in different languages, and allows larger datasets to be efficiently visualized and analyzed in the browser context.


To really grok why this is useful, put yourself in the shoes of a data warehouse user or administrator. This is a farm of machines with disks that hold data too big to have in a typical RDBMS. Because the data is so big and stored spread out across machines, a whole parallel universe of data processing tools and practices exists that run various computations over shards of the data locally on each server, and merge the results etc. (Map-reduce originally from Google was the first famous example). Then there is a cottage industry of wrappers around this kind of system to let you use SQL to query the data, make it faster, let you build cron jobs and pipelines with dependencies, etc.

Now, so far all these tools did not really have a common interchange format for data, so there was a lot of wheel reinvention and incompatibility. Got file system layer X on your 10PB cluster? Can't use it with SQL engine Y. And I guess this is where Arrow comes in, where if everyone uses it then interop will get a lot better and each individual tool that much more useful.

Just my naive take.


I recently found it useful for the dumbest reason. A dataset was about 3GB as a CSV and 20MB as a parquet file created and consumed by arrow. The file also worked flawlessly across different environments and languages.

So it’s a good transport tool. It also happens to be fast to load and query, but I only used it because of the compact way it stores data without any hoops to jump through.

Of course one might say that it’s stupid to try to pass around a 2GB or 20MB file, but in my case I needed to do that.


Parquet is not Arrow. Parquet has optimizations for storage size at the expense of compute readiness. Arrow maximizes computation efficiency, at the expense of storage size.

Read more here: https://stackoverflow.com/questions/56472727/difference-betw...


I used arrow to read and write parquets across environments. Thats my dumb reason for finding arrow so useful.


Did you have to reformat the data to get that size saving?


Not at all. It was just swapping out readcsv with readparquet in R and Python. It was painless. Granted my dataset was mostly categorical so that’s why it compressed down so much, but it was a real lifesaver.

It was also nice to be able to read while bundles of parquet a into a single dataframe easily. So is nice for “sharding” really big parquets over multiple files. Or fitting under file size limits on git repos.


gzip probably would have been fine for that case even without parquet?


Gzip did help, but not as much as parquet. I don’t remember the exact sizes but I think parquet was was 3GB->20MB and gzip was like 3GB->180MB or something.

Also gzip was an extra step of unzip then read.


The big thing is that it is one of the first standardized, cross language binary data formats. CSV is an OK text format, but parsing it is really slow because of string escaping. The files it produces are also pretty big since it's text.

Arrow is really fast to parse (up to 1000x faster than CSV), supports data compression, enough data-types to be useful, and deals with metadata well. The closest competitor is probably protobuf, but protobuf is a total pain to parse.


CSV is a file format and Arrow is an in memory data format.

The CSV vs Parquet comparison makes more sense. Conflating Arrow / Parquet is a pet peeve of Wes: https://news.ycombinator.com/item?id=23970586


To be fair, arrow absorbed parquet-cpp, so a little confusion is to be expected.


The worst part about CSV is that it appears to be so simple that many people don't even try to understand how complicated it is and just roll their own broken dialect instead of just using a library. There is also no standardized encoding. Delimiters have to be negotiated out of band and people love getting them wrong.


Mostly agree, but RFC-4180 does specify delimiters. I've used pipe-delimited files for years to ensure explicitly-defined conformity, but recently due to the plethora of libraries have been thinking that CSV may be simpler.


The closest competitor would be HDF5, not Protobuf.


Seems nice. How does it compare to hdf5?


HDF5 is pretty terrible as a wire format, so it's not a 1-1 comparison to Arrow. Generally people are not going to be saving Arrow data to disk either (though you can with the IPC format), but serializing to a more compact representation like Parquet.


As I understand, arrow is particularly interesting since it’s wire format can be immediately queried/operated on without deserialization. Would saving an Arrow-structure as parquet not defeat that purpose, since your would need the costly deserialization step again on read? Honest question


The FAQ [1] and this SO answer [2] explain it better than I can, but basically yes. However, the (de)serialization overhead is probably better than most alternative formats you could save to.

[1] https://arrow.apache.org/faq/ [2] https://stackoverflow.com/questions/56472727/difference-betw...


I wrote a bit about this from the perspective of a data scientist here: https://www.robinlinacre.com/demystifying_arrow/

I cover some of the use cases, but more importantly try and explain how it all fits together, justifying why - as another commenters has said - it's the most important thing happening in the data ecosystem right now.

I wrote it because i'd heard a lot about Arrow, and even used it quite a lot, but realised I hadn't really understood what it was!


The press release on the first version of Apache Arrow discussed here 5 years ago is actually a good introduction IMO:

https://news.ycombinator.com/item?id=11118274


Rather curious myself.

https://en.wikipedia.org/wiki/Apache_Arrow was interesting, but I think many of us would benefit from a broader, problem focused description of Arrow from someone in the know.


When you want to process large amounts of in-memory tabular data from different languages.

You can save it to disk too using Apache Parquet but I evaluated Parquet and it is very immature. Extremely incomplete documentation and lots of Arrow features are just not supported in Parquet unfortunately.


Do you mean the Parquet format? I don't think Parquet is immature, it is used in so many enterprise environments, it's is one of the few columnar file format for batch analysis and processing. It preforms so well... But I'm curious to know your opinion on this, so feel free to add some context to your position!


Yeah I do. For example Apache Arrow supports in memory compression. But Parquet does not support that. I had to look through the code to find that out, and I found many instances of basically `throw "Not supported"`. And yeah as I said the documentation is just non-existent.

If you are already using Arrow, or you absolutely must use a columnar file format then it's probably a good option.


Is that a problem in the Parquet format or in PyArrow? My understanding is that Parquet is primarily meant for on-disk storage (hence the default on-disk compression), so you'd read into Arrow for in-memory compression or IPC.


I don't know whether it is a limitation of the format or of the implementation. I used the C++ library (or maybe it was C, I can't remember), which I assume PyArrow uses too.

> hence the default on-disk compression

No, Parquet doesn't support some compression formats that Arrow does.


Python Pandas can output to lots of things. R can read lots of things. Arrow lets you push (mostly*) seamlessly. (I hit one edge case with an older version with JSON as a datatype).

Not having to maintain data type definitions when sending around the data, nor caring whether my colleagues were using R or Python, worked great.


This is not the greatest answer, but one anecdote: I am looking at it as an alternative to MS SSIS for moving batches of data around between databases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: