Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just to mention a modification on the "simple and intuitive cardinality estimator" that's far more accurate and actually used for Google's SZL:

Given N objects, hash each object and store the lowest M (s.t. M << N) hashes, then calculate what percentage of hash space is covered and compare to how much of the hash space would be expected to be covered.

This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

[1] My naive Python implementation for tests on the NLP WSJ corpus -- https://github.com/Smerity/Snippets/blob/master/analytics/ap...

[2]: Szl's implementation -- http://code.google.com/p/szl/source/browse/trunk/src/emitter...



I actually cover that in the post (well, more or less - I talk about hashing to remove bias, after talking about the 'min element' algorithm. According to the papers cited, though, taking the count of leading zeroes is more space efficient, allowing you to have more buckets in the same amount of space.


>This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?


The hashing isn't the point. The leading zeroes method is more likely to produce unreliable results when you can't guarantee the dataset will be large in advance whilst the "keep min M elements" method works fine for unexpectedly small sets. That's why it's used for SZL where the user can request approximate unique counts on any data regardless of size.


Have you read the papers I linked in detail? Some of them, such as HyperLogLog, provide corrections to give better estimates for small sets, and although I can't follow the proof in its entirety, they claim to be more efficient than the alternatives, including the one you propose.


Do you know off the top of your head if this is more or less efficient/accurate than the algorithm in the post?


It's the same algorithm.


Does this method have a name/are there any reasonable papers out there describing it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: