The author says that with a 1.2GB bloom filter and 7 hash functions lookup is only 30ns. I have to assume this is because the cache has loaded all the benchmark values. My guess is that the benchmark is only testing a few elements. In real world scenarios with a bloom filter this big most of those 7 hash lookups will be into cold cache since 1.2 GB is too large. That means lookups are much longer than 30ns. Probably still faster than going to network or database though.
Updated the result to 80ns - thanks for flagging this. This grows with the size of the data (because more cache misses), and running the benchmark on the full billion takes a while.
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
A single lookup is going to take more than 30ns, the reason they only see that is that the OoO machinery of their CPU is good enough to run those lookups in parallel.
Yes, 30ns means it's in cache. But bloom filters are surprisingly fast for the amount of lookups they do, because they all happen in parallel and there is a lot of parallelism in modern memory subsystems, so that you essentially only pay the cost of a single random read for the entire lookup. If using 1GB pages, you can still realistically talk about <100ns lookups.
> so that you essentially only pay the cost of a single random read for the entire lookup
Why would you ever pay more than that for a bloom filter lookup? I mean, I don’t see how that has anything to do with parallelism in memory subsystems. But I may be missing something.
A bloom filter needs to multiple loads from different memory locations for each single lookup. (7, in the example 1.2GB filter.) But unlike, say, with a tree, it knows all the addresses after computing the hashes, without having to wait for results from the previous loads. So it can start all of them in parallel.
A simple extension of the Bloom filter called “block Bloom filters” fixes this. The idea is that the first hash is the index of a small constant-size block of your array, and the rest of the indices are within that block.
So a single query to the filter should only have one or two cache misses, depending on the size of your block. Or even if your block is larger than a cache line, you can probably issue all the loads at once and only pay for the latency of one memory access.
The downside of doing this is slightly more space usage relative to simple Bloom filters. I’d almost always reach for block Bloom filters, though, once the filter becomes a significant fraction of cache size.
I implemented block bloom filters for fairly large (~GB) arrays and saw about 35ns performance. They’re excellent data structures, pretty much as fast as you can get for approximate membership tests (though other filters have better space-time tradeoffs).