A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.
For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).
Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).
Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.
The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P
There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.
IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.
The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)
A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.
For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).
Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).
Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.
The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P
Here is an implementation I contributed to: https://github.com/sipa/minisketch/
There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.
IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.
The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)