Assuming your keys are short strings, 1M elements is 32MB plus btree overhead. I...

cb321 · on Aug 11, 2020

> (everybody overlooks hash tables. I don't know why.)

It is probably worse than you think since a hash table can also replace the Bloom filter in its "false positives allowed" filtration role. Not sure if that's what you meant.

If all you save is the "fingerprint" (or a B-bit truncated hash code of the keys) then you get a very similar probabilistic data structure. Further, if you put these small integer fingerprints in an open addressed hash table with linear probing then you get a (very likely) single-cache miss test for membership. This alternate structure was, in fact, also analyzed by Burton Bloom (in the very same paper he introduced his famous filter). He found fingerprint filters to be slower only on very unrealistic memory systems with neither cache nor "word fetch" structure where access to every bit had unit cost no matter its location. If you can grab a big batch of bits in unit cost the analysis changes a lot. Modern systems grab 512..1024 bits at the same cost as 1...

This fingerprint filter alternative does take 2..4x more space depending on scale and allowed false positive rates, but that space is organized better in terms of caches. Most would choose to spend that space to save k-1 cache misses of time in today's era of 60-120 ns memory and multi GHz CPUs. It is slightly easier to implement a large bitvector than an array of B-bit numbers, depending on B. { It's pretty trivial for B=8*m, though. ;-) } It's kind of crazy that hash tables for this purpose fell into "also ran/obscure" status...

aidenn0 · on Aug 11, 2020

If you switch from linear probing to cuckoo hashing, then you end up with a cuckoo filter, which is more compact than a bloom filter for a given false-positive rate. The compactness comes from the fact that Cuckoo hashing allows load factors of over 90% which are impractical for linear-probing.

cb321 · on Aug 19, 2020

Well, sure. You can also switch to Robin-Hood linear probing to work well at high load factors and retain the almost never >1 DRAM hit. These are just various hash table options to claw back some load factor ratios of space usage.