I was just thinking how lousy a data format JSON is for tabular data. Would send...

ashvardanian · on March 6, 2023

Yes, we also constantly think about that! In the document collections of UKV, for example, we have interoperability between JSON, BSON, and MessagePack objects [1]. CSV is another potential option, but text-based formats aren't ideal for large scale transmissions.

One thing people do - use two protocols. That is the case with Apache Arrow Flight RPC = gRPC for tasks, Arrow for data. It is a viable path, but compiling gRPC is a nightmare, and we don't want to integrate it into our other libraries, as we generally compile everything from sources. Seemingly, UJRPC can replace gRPC, and for the payload we can continue using Arrow. We will see :)

[1]: https://github.com/unum-cloud/ukv/blob/main/src/modality_doc...

sally_glance · on March 6, 2023

Are there any synergies with capnproto [1] or is the focus here purely on huge payloads?

I'm just an interested hobbyist when it comes to performant RPC frameworks but had some fun benchmarking capnproto for a small gamedev-related project and it was pretty awesome.

[1] https://capnproto.org/

ZeroCool2u · on March 6, 2023

If anything you'd probably want to send it in Arrow[1] format. CSV's don't even preserve data types.

[1]: https://arrow.apache.org/

alchemist1e9 · on March 6, 2023

arrow/feather is really the best format these days for tabular data transmission.

anyone who disagrees I’d be very interested to hear your thoughts on alternatives.

pletnes · on March 6, 2023

What about compression - is this part of arrow itself?

adgjlsfhk1 · on March 6, 2023

It's not part of Arrow, but Arrow is columnar so just a basic LZ4/ZSTD will work pretty well.

pbreit · on March 6, 2023

Arrow looks super complicated.

Are data types useful for data to/fro web/mobile clients? Encode type into the column header?

adgjlsfhk1 · on March 6, 2023

data types are absolutely helpful. when you know a column stores Float64 data, you don't have to write out float to base 10 and parse it back. You just dump the bytes.

pletnes · on March 6, 2023

Or parquet, for compression?

FridgeSeal · on March 7, 2023

Arrow is meant as the “in-memory” dual to Parquet, which is meant as the “on disk serialisation format”.

Many parquet supporting libs will load Parquet files into an Arrow structure in memory for example.

loeg · on March 6, 2023

No, parsing CSV is also pretty slow. You want some sort of length prefixed and ideally fixed width column format.

mattewong · on March 7, 2023

Parsing CSV doesn't have to be slow if you use something like xsv or zsv (https://github.com/liquidaty/zsv) (disclaimer: I'm an author). The speed of CSV parsers is fast enough that unless you are doing something ultra-trivial such as "count rows", your bottleneck will be elsewhere.

The benefits of CSV are:

- human readable

- does not need to be typed (sometimes, data in the raw such as date-formatted data is not amenable to typing without introducing a pre-processing layer that gets you further from the original data)

- accessible to anyone: you don't need to be a data person to dbl-click and open in Excel or similar

The main drawback is that if your data is already typed, CSV does not communicate what the type is. You can alleviate this through various approaches such as is described at https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql..., though I wouldn't disagree that if you can be assured that your starting data conforms to non-text data types, there are probably better formats than CSV.

The main benefit of Arrow, IMHO, is less as a format for transmitting / communicating but rather as a format for data at rest, that would benefit from having higher performance column-based read and compression

hermitcrab · on March 7, 2023

If a CSV has quoting (e.g. because the data contains comma or quote chars) aren't you effectively forced to parse it in a single thread?

See also: 'Why isn’t there a decent file format for tabular data?' https://news.ycombinator.com/item?id=31220841

mattewong · on March 7, 2023

Good point. Though, if we are talking about something coming down a network pipe, then that network connection will be serialized anyway and during the parsing process can be sharded or converted to another format or indexed or whatnot. I would still say that, a situation where anything non-trivial gets bottlenecked by the CSV parsing remains exceptionally low. If you are reading the entire file, then the difference between starting, say, 4 threads directly in positions 0/25/50/75 versus a single CSV reader that dispatches chunks of rows to 4 threads (or whatever N instead of 4) is probably nil.

It is true there will be exceptions-- such as if you know you only want to read the second half the file only. In that case CSV with quoting does not give you a direct way to find that halfway point without parsing the first half.

I suppose whether this is worth the other pros/cons will be situation-dependent. For my use cases, which are daily, CSV parsing speed, when using something like xsv or zsv, has just, by itself, never been a material concern/impact on performance.

Where I think the CSV parsing downside is much greater than the fact that it must be serial (but which as described above does not prevent parallelized processing), is in type conversion not just of numbers but in particular of dates-- it can be expensive to convert the text "March 6, 2023" to a date variable. However, if you have control over the format, you could just as easily printed that as an integer such as 44991 and reduces the problem to one of integer conversion. Which is still always going to be slower than a binary format, but isn't so bad performance wise.

hermitcrab · on March 7, 2023

If you start threads at positions 0/25/50/75 inside a CSV, how do you know if the characters at 25, 50 & 75 are inside or outside quoted data values? You could start at a carriage return, but that could also be inside quoting.

mattewong · on March 9, 2023

Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format. But what I am saying is that, if you could do that, then your performance difference will be negligible, compared to using a single thread that parses the CSV into rows and passes chunks of rows to 4 separate threads.

In fact, the single-thread parser approach (with multi-thread processing) might even be better, because it is not trying to access your hard disk in 4 places at the same time. Then again, if your threads are doing some non-trivial task with each row, then IO will not be your bottleneck either way.

Obviously starts to break down if you aren't reading the whole file and you wanted to start some meaningful portion of the way in and never process what comes before it. The point is, the benefit of being able to, effectively, implicitly shard a file without saving as separate files-- might not be as impactful in practice as in theory

hermitcrab · on March 9, 2023

>Yes, that is exactly my point. You cannot start threads at 0/25/50/75 if your data is in CSV format.

My mistake, I misread your answer!

pbreit · on March 7, 2023

Slower than JSON?