Here's a few fun ones: the Burrows–Wheeler transform, suffix arrays in general, Kanerva's fully distributed representation, reduced-affine-arithmetic numbers, LSM trees, binary array sets, sparse array sets, and gap buffers. (I'm not sure how nobody had mentioned gap buffers yet on the thread, but if they did I didn't see it.)
— ⁂ —
The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.
— ⁂ —
Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.
This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.
I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.
— ⁂ —
I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function a₀x₀ + a₁x₁ + a₂x₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)
Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!
— ⁂ —
LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.
Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).
— ⁂ —
A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < n ∧ A[B[i]] == i, from which it is straightforward to work out the other methods.
I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.
Oh, I see kcbanner posted this one too, with a link to Russ Cox's writeup, which is better than mine and explains that it was published by Briggs and Torczon in 01993, but probably not originally invented because it was an exercise in an Aho, Hopcroft, and Ullman textbook in 01974 and in Programming Pearls: https://research.swtch.com/sparse
— ⁂ —
A gap buffer is a data structure sort of similar to Huet's zipper but without the dynamic allocation; Emacs famously uses them for its editing buffers. It consists of an array containing of a leading chunk of text, followed by a gap, followed by a trailing chunk of text. Edits are always made in the gap; to start editing in a different position, you have to move some text from the end of the leading chunk to the beginning of the trailing chunk or vice versa.
Suppose you want to edit a 531-byte file; you allocate a larger buffer, say 1024 bytes, and read those 531 bytes into the beginning of the buffer. If the user starts inserting text after byte 200, you begin by moving bytes 201–530 (330 bytes) to the end of the 1024-byte buffer (positions 694–1024); then you can insert the user's text at positions 201, 202, 203, etc., without having to do any more work to move text. If the user moves up to byte 190 and inserts more text, you only have to move bytes 191–203 or whatever to the end of the buffer.
A gap buffer is very fast in the usual case of making a sequence of small edits at the same location or close-together locations, and because the constant factor of copying a large block of bytes from one place to another is so small, even moving the gap from one end of a buffer to another is generally plenty fast enough for interactive use. Moreover, operations like string search are very fast in a gap buffer, while they tend to be very slow in fancier data structures with better asymptotic complexity, such as piece tables and ropes; and it uses a predictable and relatively small amount of extra memory.
— ⁂ —
However, like FORTH, fancy data structures are a bit of an attractive nuisance: they tempt you to get into trouble. When I first came across Sedgewick's Algorithms in C it was a revelation to me: here were so many clever ways of breaking down problems that made apparently impossible problems easy. (I had come across sorting algorithms before but I didn't really understand them.) I immediately and erroneously concluded that this was the knowledge I had been missing to program all the things that I didn't know how to program, and I began my counterproductive data-structure-fetishism phase.
Much better advice is found in The Practice of Programming: almost all programs can be written without any data structures but arrays, hash tables, linked lists, and, for things like parsing, symbolic algebra, or filesystems, trees. So just use those if you can.
Occasionally, a program is only possible, or is dramatically better, using fancy data structures like LSM-trees, bloom filters, suffix arrays, skip lists, HAMTs, union find, zippers, treaps, BDDs, ropes, gap buffers, Fenwick trees, splay trees, and so on. And that can be super fun! But this situation is pretty rare, and it's important not to get yourself into the position where you're implementing a red-black tree or HAMTs when it would have been a lot easier to write your program in a brute-force way. Or where your successors are fixing bugs in your Fenwick tree implementation.
Also, it's important to be aware that, quite aside from the difficulty of writing and understanding the code, fancy data structures often have extra costs at runtime which can outweigh their benefits. It's easy to discover that your splay tree runs slower than sequential search over an unsorted array, for example, and of course it uses much more RAM.
> Much better advice is found in The Practice of Programming: almost all programs can be written without any data structures but arrays, hash tables, linked lists, and, for things like parsing, symbolic algebra, or filesystems, trees. So just use those if you can.
In general, as much as possible use stuff you already have good libraries for.
Often, you can get away with much simpler data structures with a bit of clever pre-sorting (or sometimes: random shuffling). Relatedly, batching your operations can also often make your data structure needs simpler.
I'm not that enthusiastic about libraries. Yes, using a good LSM-tree library will keep you and your successors from spending holiday weekends debugging your LSM-tree implementation — probably, because "good" doesn't mean "perfect". But it won't be a tenth as fast as an in-RAM hash table, if an in-RAM hash table can do the job. And CPython's standard hash table usually won't be a tenth as fast as a domain-specific hash table optimized for your needs.
That doesn't always matter very much, but it does always matter. Especially now with the advent of good property-based testing libraries like Hypothesis, if you can run a function call in 0.1 ms instead of 1.0 ms, you can do ten times as much testing with a given amount of resources.
I like Hypothesis just as much as the next guy. It's awesome.
You have some valid points, when you have specialised needs, you might need to write specialised software.
Though even for your example, I would suggest you write your domain-specific hash table as if it was a library, if possible, (instead of embedding it deep in your application code). Trying to make your code testable with hypothesis strongly encourages such an approach anyway.
(To show that my advice ain't trivial, I give a counterexample where you can't do this as easily: C folks sometimes write things like intrusive linked lists, or intrusive data structures in general. Almost no language gives you good tools to write these as a library, so testing properties in isolation is hard to do, too.)
— ⁂ —
The Burrows-Wheeler transform is the second character of each suffix in a (cyclic) suffix array, and there turns out to be a simple algorithm to recover the original text from this transform; live demo: http://canonical.org/~kragen/sw/bwt. But the new text is very easily compressible, which is the basis for bzip2. As mentioned by Blahah and dahart, this is only one of many interesting things you can do with suffix arrays; they can also do efficient regular expression search, for example. This is more interesting since three practical linear-time and -space suffix array construction algorithms were discovered in the early 02000s.
— ⁂ —
Pentti Kanerva devised a "fully distributed representation" in the 01990s, which has some features in common with bloom filters — every association stored in an FDR record is stored to an equal extent (probabilistically) in every bit of the record, so erasing some of the bits erases none of the associations but increases the error probability for all of them. You assign a random or pseudorandom bitvector (of, say, 16384 bits) to every atomic value to be stored, represent an association of two (or more) atoms as their XOR, and then take the bitwise majority rule of all the associations to be stored as the record. To retrieve an association, you XOR the record with an atom, then compute the correlation of the XOR result with all possible atoms; the correlation with the correct atoms will be many standard deviations away from the mean, unless the number of associations stored in the record is high enough that conventional data structures wouldn't have been able to store it either.
This and some related work has blossomed into a field called "hyperdimensional computing" and "vector symbolic architectures". But the problem of finding the nearest atom or atoms seems, on its face, to make this impractical: if you have 16384 possible atoms and 16384 bits in your bitvector, you would seem to need at least 16777216 bit operations to compute all the associations.
I realized the other night that if you use the rows of a Walsh–Hadamard matrix as the "pseudorandom" atom representations, you can compute all the correlations in linearithmic time with the fast Walsh–Hadamard transform. Moreover, Cohn and Lempel's result from 01977 "on fast M-sequence transforms" is that after a simple transposition you can also use the FWHT to compute all the correlations with the complete output of a maximal LFSR (a so-called "M-sequence") in linearithmic time, so you can also use the cyclic shifts of an LFSR output for your atoms without losing efficiency. Probably someone else has already figured this out previously, but I hadn't found it in the literature.
— ⁂ —
I'm not sure if the quantities used in reduced affine arithmetic count as a "data structure"? They're a data type, at least, but they're not a container. The idea is that you execute some arbitrary numerical algorithm, not merely on floating-point numbers, or even on dual numbers as in forward-mode automatic differentiation, nor even yet on (min, max) intervals as in ordinary interval arithmetic, but on a vector of coefficients [a₀, a₁, a₂...aₙ], which represents a linear function a₀x₀ + a₁x₁ + a₂x₂ + ... aₙxₙ, where x₀ = 1 and each of the other xᵢ ∈ [-1, 1], so this function is really a representation of an interval. Then you can assign each of the xᵢ (except x₀ and xₙ) to be the unknown error in one of the input parameters to your algorithm, whose magnitude is determined by the aᵢ value in the value of that parameter; and when you introduce rounding error, you increase aₙ enough to account for it. (The non-reduced kind of affine arithmetic just increases n without limit to account for new roundoff errors.)
Reduced affine arithmetic has two advantages over the usual kind of interval arithmetic. First, it cancels errors correctly; if you calculate (y+1)(y+2)/(y+3) with interval arithmetic, with some uncertainty in the value of y, you usually get an unnecessarily loose bound on the result because interval arithmetic assumes that the uncertainties in the numerator and the denominator are uncorrelated, when in fact for many values of y they partly cancel out. (You can't apply this cancelation to aₙ, the error stemming from all the roundoffs in your algorithm, just the other aᵢ.) But the more interesting result is that RAA gives you a linear function that approximates the calculation result as a linear function of the inputs, which is pretty similar to calculating the gradient — but with error bounds on the approximation!
— ⁂ —
LSM trees are only about as "obscure" as bloom filters, since they underlie several prominent filesystems, LevelDB, and most of the world's search engines, but if you don't know about bloom filters, you might not know about LSM trees either. I wrote a tutorial explanation of LSM trees in https://github.com/kragen/500lines/tree/master/search-engine, though the name "LSM tree" hadn't yet gone mainstream.
Binary array sets, as described in https://www.nayuki.io/page/binary-array-set, are closely related to LSM trees, to the point that you could describe them as an application of LSM trees. They are a simple data structure for the exact version of the problem bloom filters solve approximately: maintaining a set S of items supporting the operations S.add(item) and S.contains?(item). Insertion (.add) is amortized constant time, and membership testing (.contains?) is O(log² N).
— ⁂ —
A different and more efficient data structure for the exact set membership problem, limited to the case where the items to be stored are up to M integers in the range [0, N), supports constant-time creation, clearing, insertion, size measurement, and membership testing, and linear-time enumeration, but I think cannot be implemented in modern standard C because of its rules about reading uninitialized data. (By allowing creation to be linear time you can implement it in standard C.) It uses two arrays, A[M] and B[N] and a variable n, and i is a member of the set iff B[i] < n ∧ A[B[i]] == i, from which it is straightforward to work out the other methods.
I don't remember what this data structure is called, but I definitely didn't invent it. I described it and some uses for it in https://dercuano.github.io/notes/constant-time-sets-for-pixe.... You can associate additional values with this structure to use it as a sparse array.
Oh, I see kcbanner posted this one too, with a link to Russ Cox's writeup, which is better than mine and explains that it was published by Briggs and Torczon in 01993, but probably not originally invented because it was an exercise in an Aho, Hopcroft, and Ullman textbook in 01974 and in Programming Pearls: https://research.swtch.com/sparse
— ⁂ —
A gap buffer is a data structure sort of similar to Huet's zipper but without the dynamic allocation; Emacs famously uses them for its editing buffers. It consists of an array containing of a leading chunk of text, followed by a gap, followed by a trailing chunk of text. Edits are always made in the gap; to start editing in a different position, you have to move some text from the end of the leading chunk to the beginning of the trailing chunk or vice versa.
Suppose you want to edit a 531-byte file; you allocate a larger buffer, say 1024 bytes, and read those 531 bytes into the beginning of the buffer. If the user starts inserting text after byte 200, you begin by moving bytes 201–530 (330 bytes) to the end of the 1024-byte buffer (positions 694–1024); then you can insert the user's text at positions 201, 202, 203, etc., without having to do any more work to move text. If the user moves up to byte 190 and inserts more text, you only have to move bytes 191–203 or whatever to the end of the buffer.
A gap buffer is very fast in the usual case of making a sequence of small edits at the same location or close-together locations, and because the constant factor of copying a large block of bytes from one place to another is so small, even moving the gap from one end of a buffer to another is generally plenty fast enough for interactive use. Moreover, operations like string search are very fast in a gap buffer, while they tend to be very slow in fancier data structures with better asymptotic complexity, such as piece tables and ropes; and it uses a predictable and relatively small amount of extra memory.
— ⁂ —
However, like FORTH, fancy data structures are a bit of an attractive nuisance: they tempt you to get into trouble. When I first came across Sedgewick's Algorithms in C it was a revelation to me: here were so many clever ways of breaking down problems that made apparently impossible problems easy. (I had come across sorting algorithms before but I didn't really understand them.) I immediately and erroneously concluded that this was the knowledge I had been missing to program all the things that I didn't know how to program, and I began my counterproductive data-structure-fetishism phase.
Much better advice is found in The Practice of Programming: almost all programs can be written without any data structures but arrays, hash tables, linked lists, and, for things like parsing, symbolic algebra, or filesystems, trees. So just use those if you can.
Occasionally, a program is only possible, or is dramatically better, using fancy data structures like LSM-trees, bloom filters, suffix arrays, skip lists, HAMTs, union find, zippers, treaps, BDDs, ropes, gap buffers, Fenwick trees, splay trees, and so on. And that can be super fun! But this situation is pretty rare, and it's important not to get yourself into the position where you're implementing a red-black tree or HAMTs when it would have been a lot easier to write your program in a brute-force way. Or where your successors are fixing bugs in your Fenwick tree implementation.
Also, it's important to be aware that, quite aside from the difficulty of writing and understanding the code, fancy data structures often have extra costs at runtime which can outweigh their benefits. It's easy to discover that your splay tree runs slower than sequential search over an unsorted array, for example, and of course it uses much more RAM.