Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for this interesting writeup; it's definitely thought provoking.

Because you can't index that bloom column, it seem's you'd always be doing full table scans.

In fact it doesn't appear any indexes were used throughout this whole exercise, is that right?



This is what I was wondering. What's the performance difference compared to using a junction table with indexes and what are the other trade-offs of indexes vs bloom filters. I'm guessing there are cases when you wouldn't want to maintain the indexes. I'm not sure what they would be though.


I believe that you can index the bloom column, to some extent, with some hacks.

A major problem is a lack of indexes in array elements, which then forces you to build an otherwise unneeded table.

That being said, it's definitely doable, to some extent.


FYI - the intarray contrib module for Postgres does support indexing of arrays of integers.

http://www.postgresql.org/docs/9.1/static/intarray.html#AEN1...

And it happens to use bloom filters under-the-hood to do it.

From the docs:

"Two GiST index operator classes are provided: gist__int_ops (used by default) is suitable for small- to medium-size data sets, while gist__intbig_ops uses a larger signature and is more suitable for indexing large data sets (i.e., columns containing a large number of distinct array values). The implementation uses an RD-tree data structure with built-in lossy compression."

That bit about the lossy compression in the implementation if referencing the use of a bloom filter. Also mentioned here, http://archives.postgresql.org/pgsql-hackers/2005-05/msg0147...


I saw that, but it doesn't seem to implement any of the indexes you'd need in order to run a bloom query efficiently.

That being said, if you stored your bitvector as an array of powers of two, then it would work. But that would be horribly inefficient in terms of space usage.


Which is why PostgreSQL is scriptable: the various contrib modules are often better looked at as examples of how to build your own indexes using GIN/GiST than "this is what we provide".

In your case, though, a strict immutable function mapping the bitvector to an int array as part of a functional index should be sufficient to use the existing contrib module: you don't need to store the things you index in the table.


I don't see how you could theoretically index this in a way that supports the bloom filter query. Indexes rely on data being ordered, and b-tree (similar to binary tree) lookups. A where clause that's like "where col & val = col" can never be supported by a b-tree style index... right? Am I missing something?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: