Hacker News new | past | comments | ask | show | jobs | submit login

What happens when you need to encode the newline character in your data? That makes splitting _either_ CSV or LDJSON files difficult.



The new line character in a JSON string would always be \n. The new line in the record itself as whitespace would not be acceptable as that breaks the one line record contract.

Remember that this does not allow arbitrary representation of serialized JSON data. But it allows for any and all JSON data as you can always roundtrip valid JSON to a compact one line representation without extra whitespace.


Actually even whitespace-separated json would be a valid format and if you forbid json documents to be a single integer or float then even just concatenating json gives a valid format as JSON is a prefix free language.

That is[0] if a string s is a valid JSON then there is no substring s[0..i] for i < n that is a valid json.

So you could just consume as many bytes you need to produce a json and then start a new one when that one is complete. To handle malformed data you just need to throw out the partial data on syntax error and start from the following byte (and likely throw away data a few more times if the error was in the middle of a document)

That is [][]""[][]""[] is unambiguos to parse[1]

[0] again assuming that we restrict ourselves to string, null, boolean, array and objects at the root

[1] still this is not a good format as a single missing " can destroy the entire document.


« a single missing " can destroy the entire document » This is basically true for any data format, so really worse argument ever...


In jsonl a modified chunk will lose you at most the removed lines and the two adjacent ones (unless the noise is randomly valid json), in particular a single byte edit can destry at most 2 lines.

utf-8 is also similarly self-correcting and so is html and many media formats.

My point was that in my made-up concatenated json format

[]"""[][][][][][][][][][][]"""[]

and

[]""[][][][][][][][][][][]""[]

are both valid but have differ only for 2 bytes but have entirely different structures.

Also it is a made-up format nobody uses (if somebody were to want this they would likely disallow strings at the root level).


When you need to encode the newline character in your data, you say \n in the JSON. Unlike (the RFC dialect of) CSV, JSON has an escape sequence denoting a newline and in fact requires its use. The only reason to introduce newlines into JSON data is prettyprinting.


It's tricky, but simple enough, RFC states that " must be used, inserting a " is done with "". This makes knowing what a record is difficult, since you must keep a variable that keeps the entire string.

How do you do this simply? you read each line, and if there's an uneven number of ", then you have an incomplete record and you will keep all lines until there is an odd number of ". after having the string, parsing the fields correctly is harder but you can do it in regex or PEGs or a disgusting state machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: