Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"They’re parsing JSON. A whopping 10 megabytes worth of JSON with some 63k item entries."

Ahh. Modern software rocks.



Parsing 63k items in a 10 MB json string is pretty much a breeze on any modern system, including raspberry pi. I wouldn't even consider json as an anti-pattern with storing that much data if it's going over the wire (compressed with gzip).

Down a little in the article and you'll see one of the real issues:

> But before it’s stored? It checks the entire array, one by one, comparing the hash of the item to see if it’s in the list or not. With ~63k entries that’s (n^2+n)/2 = (63000^2+63000)/2 = 1984531500 checks if my math is right. Most of them useless.


Check out https://github.com/simdjson/simdjson

More than 3 GB/s are possible. Like you said 10 MB of JSON is a breeze.


The JSON patch took out more of the elapsed time. Granted, it was a terrible parser. But I still think JSON is a poor choice here. 63k x X checks for colons, balanced quotes/braces and so on just isn't needed.

  Time with only duplication check patch: 4m 30s
  Time with only JSON parser patch:       2m 50s


> But I still think JSON is a poor choice here.

It’s an irrelevant one. The json parser from the python stdlib parses a 10Mb json patterned after the sample in a few dozen ms. And it’s hardly a fast parser.


At least parse it into SQLite. Once.


They probably add more entries over time (and maybe update/delete old ones), so you’d have to be careful about keeping the local DB in sync.


So just have the client download the entire DB each time. Can’t be that many megabytes.


I did a very very ugly quick hack in python. Took the example JSON, made the one list entry a string (lazy hack), repeated it 56,000 times. That resulted in a JSON doc that weighed in at 10M. My initial guess at 60,000 times was a pure fluke!

Dumped it in to a very simple sqlite db:

    $ du -hs gta.db
    5.2M    gta.db
Even 10MB is peanuts for most of their target audience. Stick it in an sqlite db punted across and they'd cut out all of the parsing time too.


I think just using a length encoded serialization format would have made this work reasonably fast.


Or just any properly implemented JSON parser. That's a laughable small amount of JSON, which should easily be parsed in milliseconds.


why not embed node.js to do this efficiently :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: