Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm a bit confused - why not distribute a serialized Bloom filter representing these passwords? That would seem to enable a compact representation (low Azure bill) and client-side querying (maximally preserving privacy).



There are half a billion passwords in the list. A bloom filter with even a 1 in 10 false positive rate would still be 286.59 MB.


You could do a Bloom filter on each bucket, each of which has about 500 items. This would reduce the size of the response from about 16k to < 1k. But it would be a lot harder to use since all clients would have to use the Bloom filter code correctly.


A Bloom filter with >500M items, even when allowing for a comparatively high rate of false positives such as 1 in 100, is still in the hundreds of MBs, which would not be that much more accessible than the actual dump files.


The compressed archive here is over 8 GB. An uncompressed 2 GB Bloom filter with 24 hash functions and half a billion entries has a false positive rate of less than 1 in 14 million.

75% space savings, with no decompression necessary for use, and a 1 in 14 million false positive rate is nothing to sneeze at.


But no count of how often the hash is used. Counting bloom filters are till a bit harder to implement.


Counting bloom filters are only marginally more difficult to implement. To increment a key, find the minimum value stored in all of the slots for the key, and then increment all of the stored values for that key that are equal to the minimum value. To read, return the minimum value for all of the values stored in slots for the key.

For these purposes, however, you probably instead want to store just separate Bloom filters for counts above different thresholds, since the common use case would be accept/reject decisions based upon a single threshold.


I agree. Just need to set some bits and test them. This is too big really for a tree or a hash table.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: