I'm a bit confused - why not distribute a serialized Bloom filter representing t...

Ajedi32 · on Feb 21, 2018

There are half a billion passwords in the list. A bloom filter with even a 1 in 10 false positive rate would still be 286.59 MB.

richdougherty · on Feb 22, 2018

You could do a Bloom filter on each bucket, each of which has about 500 items. This would reduce the size of the response from about 16k to < 1k. But it would be a lot harder to use since all clients would have to use the Bloom filter code correctly.

fhenneke · on Feb 21, 2018

A Bloom filter with >500M items, even when allowing for a comparatively high rate of false positives such as 1 in 100, is still in the hundreds of MBs, which would not be that much more accessible than the actual dump files.

KMag · on Feb 22, 2018

The compressed archive here is over 8 GB. An uncompressed 2 GB Bloom filter with 24 hash functions and half a billion entries has a false positive rate of less than 1 in 14 million.

75% space savings, with no decompression necessary for use, and a 1 in 14 million false positive rate is nothing to sneeze at.

rrobukef · on Feb 22, 2018

But no count of how often the hash is used. Counting bloom filters are till a bit harder to implement.

KMag · on Feb 22, 2018

Counting bloom filters are only marginally more difficult to implement. To increment a key, find the minimum value stored in all of the slots for the key, and then increment all of the stored values for that key that are equal to the minimum value. To read, return the minimum value for all of the values stored in slots for the key.

For these purposes, however, you probably instead want to store just separate Bloom filters for counts above different thresholds, since the common use case would be accept/reject decisions based upon a single threshold.

_wldu · on Feb 22, 2018

I agree. Just need to set some bits and test them. This is too big really for a tree or a hash table.