Just to mention a modification on the "simple and intuitive cardinality estimato...

nicksdjohnson · on Sept 7, 2012

I actually cover that in the post (well, more or less - I talk about hashing to remove bias, after talking about the 'min element' algorithm. According to the papers cited, though, taking the count of leading zeroes is more space efficient, allowing you to have more buckets in the same amount of space.

shardling · on Sept 7, 2012

>This removes the issue of having to assume the N objects are distributed evenly as hash(object) should exhibit that property.

I'm a bit confused -- isn't that exactly how the article proceeds?

Smerity · on Sept 8, 2012

The hashing isn't the point. The leading zeroes method is more likely to produce unreliable results when you can't guarantee the dataset will be large in advance whilst the "keep min M elements" method works fine for unexpectedly small sets. That's why it's used for SZL where the user can request approximate unique counts on any data regardless of size.

nicksdjohnson · on Sept 8, 2012

Have you read the papers I linked in detail? Some of them, such as HyperLogLog, provide corrections to give better estimates for small sets, and although I can't follow the proof in its entirety, they claim to be more efficient than the alternatives, including the one you propose.

sophiebits · on Sept 7, 2012

Do you know off the top of your head if this is more or less efficient/accurate than the algorithm in the post?

robotresearcher · on Sept 7, 2012

It's the same algorithm.

asharp · on Sept 8, 2012

Does this method have a name/are there any reasonable papers out there describing it?