You should be careful when using asymptotic bounds with more precision than abou...

senderista · 2025-03-07T04:41:14 1741322474

Succinct data structures require the extra space (above the information-theoretical minimum) to be an additive term of o(n) bits (little O, not big O). That just means that the extra space grows more slowly than n, as n approaches infinity, so their ratio (extra space)/n approaches 0 in the limit.

jltsiren · 2025-03-07T05:47:49 1741326469

That's a simplification. Succinct data structures are usually parameterized data structures. They have a number of parameters, such as block sizes and sampling rates, that govern the space overhead and query performance. The published version may use a parameterization that makes the space overhead sublinear while guaranteeing attractive asymptotic bounds for worst-case performance. But if you actually implement that, the performance is likely terrible.

Consider the rank data structure for bitvectors that is supposed to guarantee O(1) time queries with o(n) bits of space overhead. The bitvector is divided into superblocks of b1 bits and blocks of b2 bits. The top-level index, which stores the rank up to each superblock, uses O(n log n / b1) bits of space. The second-level index, which stores the rank within the superblock up to each block, uses O(n log b1 / b2) bits of space. And then you need to do linear search within the block, which is typically assumed to take O(b2 / log n) or O(b2 / w) time, where w is word size. If you choose b1 = log^2 n and b2 = log n, you get O(1) time with O(n log log n / log n) bits of space overhead. Which is technically sublinear but effectively indistinguishable from linear with realistic values of n.

Real-world implementations use constant values for the parameters. Typically the goal is to get the blocks and indexes align with cache lines and to replace arbitrary divisions with divisions by compile-time power-of-2 constants. Some values I've seen are (b1 = 512, b2 = 64) and (b1 = 65536, b2 = 512). In both cases, the overhead is linear.

And sometimes the implemented data structure is a simplification, because the nominally succinct data structure is too large and slow. For example, it's rare to see actual O(1)-time implementations of select on bitvectors. That would require three levels of indexes with many special cases. It's more common to use two levels of indexes, with queries that almost always constant-time but have (poly)logarithmic worst cases with adversarial inputs.

judofyr · 2025-03-07T07:57:25 1741334245

This is really good information! Thanks for writing it up.

Honestly, I never actually "trust" the complexity analysis. Whenever I find a paper I immediately look for their specific results on an experiment, and if I can't find that I will assume the paper is only of theoretical interest. There's of course still a lot to learn from a purely theoretical paper, but I've always ended up being disappointed when I've implemented something which only had a "good" asymptotic bounds.

senderista · 2025-03-07T21:52:33 1741384353

You're absolutely right and my response completely missed your point, thanks for clarifying further.