The author says that with a 1.2GB bloom filter and 7 hash functions lookup is on...

eliben · 2025-05-02T13:20:08 1746192008

Updated the result to 80ns - thanks for flagging this. This grows with the size of the data (because more cache misses), and running the benchmark on the full billion takes a while.

[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]

returningfory2 · 2025-05-02T13:07:12 1746191232

This is the benchmark they wrote: https://github.com/eliben/code-for-blog/blob/7278526923168d2...

The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.

Tuna-Fish · 2025-05-02T13:11:00 1746191460

A single lookup is going to take more than 30ns, the reason they only see that is that the OoO machinery of their CPU is good enough to run those lookups in parallel.

Tuna-Fish · 2025-05-02T13:08:54 1746191334

Yes, 30ns means it's in cache. But bloom filters are surprisingly fast for the amount of lookups they do, because they all happen in parallel and there is a lot of parallelism in modern memory subsystems, so that you essentially only pay the cost of a single random read for the entire lookup. If using 1GB pages, you can still realistically talk about <100ns lookups.

bjornsing · 2025-05-03T00:13:51 1746231231

> so that you essentially only pay the cost of a single random read for the entire lookup

Why would you ever pay more than that for a bloom filter lookup? I mean, I don’t see how that has anything to do with parallelism in memory subsystems. But I may be missing something.

Tuna-Fish · 2025-05-03T10:10:00 1746267000

A bloom filter needs to multiple loads from different memory locations for each single lookup. (7, in the example 1.2GB filter.) But unlike, say, with a tree, it knows all the addresses after computing the hashes, without having to wait for results from the previous loads. So it can start all of them in parallel.

bjornsing · 2025-05-08T02:30:02 1746671402

Yeah, sorry. I was thinking of a single hash function bloom filter, which is of course the exception. My bad.

vlmutolo · 2025-05-04T16:47:25 1746377245

A simple extension of the Bloom filter called “block Bloom filters” fixes this. The idea is that the first hash is the index of a small constant-size block of your array, and the rest of the indices are within that block.

So a single query to the filter should only have one or two cache misses, depending on the size of your block. Or even if your block is larger than a cache line, you can probably issue all the loads at once and only pay for the latency of one memory access.

The downside of doing this is slightly more space usage relative to simple Bloom filters. I’d almost always reach for block Bloom filters, though, once the filter becomes a significant fraction of cache size.

I implemented block bloom filters for fairly large (~GB) arrays and saw about 35ns performance. They’re excellent data structures, pretty much as fast as you can get for approximate membership tests (though other filters have better space-time tradeoffs).