Beating hash tables with trees? The ART-ful radix trie

jamii · on Nov 10, 2018

My experience has been that the vast majority of papers on data-structures are at best misleading, and at worst deliberately biased.

For example:

> The hash table used by the authors of ART in their study was a chained hash table, but this kind of hash tables can be suboptimal in terms of space and performance due to their potentially high use of pointers.

> Our experiments strongly indicate that neither ART nor Judy are competitive to the aforementioned hashing schemes in terms of performance, and, in the case of ART, sometimes not even in terms of space.

https://www.victoralvarez.net/papers/A%20Comparison%20of%20A...

ltta_code · on Nov 10, 2018

Funny enough, even that paper seems biased as well to me.

This paper probably applies best when you use the HT or tree as a DB index data-structure without constant updates for huuuge datasets, probably not as well when used as an updated associative map in your average program. I'd guess hash-tables (HT) are still faster in day-to-day programmer use but not as much as pointed out here.

Caveats:

- HTs pre-allocated: In the two experiments that show those detailed bar-charts, the hash-tables are completely pre-allocated, so no re-allocation and re-hashing on inserts.

- Charts: The workloads IV-C and IV-E get by far the most attention (3/4 page, 9 bar charts - each), these best support the paper's statements that HTs are much faster and consume less mem than ART. The mixed-use case IV-D is reported with only a single small chart, the whole write-up is just about 1/4 page combined and ART does better (comparable, but still slower than fastest HT) - arguably IV-D is probably most realistic use-case for how most people use hashmaps day-to-day.

- The keys are 64-bit ints so aren't all that long, again maybe common in a DB index but not really what you use hashmaps for mostly in the real world. So ART can't play out some of its strengths.

- The data set skews large, larger than the original paper (16M to 1,000,000,000 (1Billion) keys), and most people probably foremost look at the charts on the right with the huge numbers that make HT look best. Again, makes the results not super applicable to day-to-day hashmap use imo.

- They only look up existing keys, probably not great for either ART nor HT but could hurt their CH*Bucket performance.

Most of these caveats are explained in the test, but some are kind of buried. You can't just look at the charts and conclude hash-tables are 5x as fast as ART, always.

leiroigh · on Nov 10, 2018

It is unsurprising that a structure designed for range queries loses out against hashes for random point look-ups.

Thanks for posting that addition and very nice link. It is completely unclear why authors in this field even attempt to compete with hash-tables for random point queries: Trees (whether comparison or radix) make only sense when using either range queries or when faced with a query distribution that prefers locality.

Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

HenryR · on Nov 11, 2018

Authors try to compare their work against hash tables because, usually, HT represent an upper bound on the performance of point-lookups; we don't know how to do much better in general.

So if your data structure supports range queries _and_ point lookups, you should measure against hash tables to understand how far off the ideal you are. If it's not far, and your data structure is strictly more general, that's compelling.

leiroigh · on Nov 11, 2018

Good point!

frankmcsherry · on Nov 11, 2018

> Something I'd like to see is benchmarks of batch-query performance: If we can queue a couple of queries, then trees should gain a lot from processing them in-order (then, the last query has already warmed the cache for the next; this probably can pay for approximately sorting the batch).

I'm not sure if this is your point, but when you actually look at the uses of trees like the above (often: query processing) then the point look-up times are less relevant for implementations that do not serialize the look-ups.

Another place this shows up is when folks switch from closed-loop testing to open-loop testing (i.e. testing under a fixed rate of look-ups, rather than one or some few at a time). The latency-throughput trade-off curve makes trees look a lot less bad, but they still have a large minimal latency. A hash map with a sorted key (e.g. RobinHood hashing) still works better except at peak loads.

You can take this just a bit further and you find that replacing trees with a sorted list also works great (or, LSM of sorted lists, say). "Great" here just means "has some tolerable minimum latency, and peak throughput".

bradleyjg · on Nov 10, 2018

> or when faced with a query distribution that prefers locality

Isn't this a fairly common pattern? How would we go about quantifying the preferring locality-ness of a query distribution?

whatshisface · on Nov 10, 2018

>How would we go about quantifying the preferring locality-ness of a query distribution?

For each level of the tree, treat the incoming stream of queries as a markov process where each state is a query that involves a certain node. So, if I have 7 nodes on level 2, I can build up a table of transition probabilities between vertices like "query involved node 3 on level 2" and "query involved node 7 on level 2." When the transitions between these vertices and themselves have high probability, the queries prefer locality. You can see which scale the locality is preferred on by doing this at each level of the tree.

leiroigh · on Nov 10, 2018

>Isn't this a fairly common pattern?

Depends on the application. It is certainly not uncommon that queries fired at close times share a prefix.

>How would we go about quantifying the preferring locality-ness of a query distribution?

I know it when I see it?

fulafel · on Nov 10, 2018

Some other good reasons to use trees/tries, besides the ones you mention:

- reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

- persistent data structures (eg HAMT), as used in Clojure, Scala & Immutable.js

And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

leiroigh · on Nov 10, 2018

>reasonable worst-case behaviour (think real-time apps, or adverserial scenarios)

I'm not sure I agree. When faced with potentially adversarial input, a hash should be keyed. And, depending on the specific tree, you can probably cause a lot of havoc by causing repeated tree reorganizations. Even worse, you have a timing side-channel, because node-split and node-merge are expensive, likely measurable over the network. This can leak info about existing adjacent keys in the database.

This kind of attack is afaik not as researched or well-known as hash collision based attacks, but in my book that is a point in favor of hash tables with keyed hashes.

On the other hand, both approaches shouldn't play in the same league: If you need to support range queries, then you need to support range queries.

>And the speed boost from locality gets more dramatic every passing year as the memory gap keeps widening.

This is true for scenarios where trees save cache misses (e.g. if there is a correlation between order of queries and location of keys). The linked comparison explicitly compared L3 misses between various trees and various hash tables for various scenarios, and the hash tables come out better in that metric.

All that being said, ART is pretty cool and node16's branch-free SIMD search is amazing.

jstimpfle · on Nov 10, 2018

Related: https://nothings.org/computer/judy/

acidx · on Nov 10, 2018

Nice article and analysis! I'm actually considering scrapping the trie used in my project to something based off of this one, with some modifications:

For instance, find_node(c, Node48) could avoid the branch if a non-existing index points to an additional entry in child_ptrs that's always NULL. Lookup would be comparable to the Node256 version.

Another thing that could be done, is to scrap the Node48 entirely, and implement two new structs to replace it: Node32 and Node64, and use respectively AVX2 and AVX512. These can be based off of the Node16 version. It remains to be seen if these will yield better performance than the branchless Node48 above, especially if power management kicks in when mixing AVX512 with older SIMD generations.

The trie in Lwan (https://lwan.ws) does an interesting trick to reduce the amount of memory used in the equivalent of a Node256: instead of 256 pointers to a node, it has only 8 pointers. Characters are hashed (MOD 8). The leaf node contains a linked list of key/value pair, and an actual string comparison is performed at the end. (Lwan cheats here by avoiding a string comparison if the linked list contains only 1 element.) Works pretty well, as it's part of the URL routing mechanism.

One other experiment I've been making with tries, is to use the idea of key compression and use it in a different way: slice it every 4 or 8 bytes, consider those bytes as an arbitrary integer, and add every chunk of it to a hashmap<int, some_struct>, building a chain for the next lookup in some_struct. The prototype I wrote works pretty well.

jules · on Nov 10, 2018

Another standard trick to reduce memory is to store a bitfield telling you which pointers are not null, and then store an array of only the pointers that are non null. For example for a Node64 you store a 64 bits bitfield plus only the pointers that are not null, so if that node has only 3 children you store 64 bits plus 3 pointers instead of 64 pointers. You can index that structure by doing a shift + popcount: if you want to find the pointer at index n you first check the n-th bit in the bitfield to see if the pointer is null, and if it isn't you count the number of set bits in bitfield[0..n] to find the index into the pointer array.

amluto · on Nov 10, 2018

The AVX512 power issues and startup latency likely mean that using it it will likely be a loss except in microbenchmarks or unusual workloads.

faragon · on Nov 10, 2018

A good point for both RB trees and linear-addressing hash tables is that they can be implemented with vectors( [1], [2]), allowing the case of initial reservation for N elements, so with a tricky implementation you could even have the data structure with one or zero allocations (e. g. allocate the tree or the hash table in the stack). For tries you could use many memory pools for the different node sizes, apply path compression, and even a LUT accelerator for reaching the Nth level, but hardly could be implemented using a vector.

[1] https://github.com/faragon/libsrt/blob/master/src/saux/stree...

[2] I'm implementing a key-value hash table that will be added to the same library as [1] with "srt_hmap" type, in one continuous allocation. Being able to use hash tables allocated both in the heap and in the stack (e.g. you could use a int32-int32 hash table allocated in the stack for computing the color frequency of a bitmap image). Being the HT performance 4 to 5x the performance of the RB-trees, including cost of rehashing - rehash implementation using techniques for avoiding moving all the data- (rehashing only available for the heap case).

jstimpfle · on Nov 10, 2018

You can implement most things with dynamically allocated vectors. Just use indices instead of pointers to link the elements. This can also bring advantage in space efficiency if you're able to do with 4 byte indices instead of 8 byte pointers.

kccqzy · on Nov 10, 2018

I tend to use C++'s std::deque as a poor man's arena allocator. I find it works even better than vectors with indices: indeed C++ says pushing at the end will validate iterators but not pointers to the elements themselves. This gets rid of the infrequent resizing in vectors.

jstimpfle · on Nov 10, 2018

I think you have a typo in there, "validate iterators" -> "invalidate iterators".

It should be easy to build a non-invalidating version of the iterator that has a pointer to the chunk and an offset within that chunk. That's more like 12 or 16 bytes, though.

What I like about vectors is that an index is not really about the storage that the vector holds at that position. The index is a mathematical identity of an object. It's the object itself. The vector is only a materialized function that holds some attribute for each object. The nice thing is we can have multiple independent vectors for each object type. This is true modularity.

We can get an address from a simple index with deques, too. But only with significant indirection. So I think if (1) no movement cost on reallocation is allowed, and/or (2) stable pointers are needed, deques are a fine choice. But vectors are my first choice.

Given that we have huge address spaces nowadays we could also consider preallocating large global vectors at stable addresses, and expanding them with mmap(). This way the copying when growing the vector is not needed.

faragon · on Nov 10, 2018

Yes, that's how works the RB-tree example I linked. However, it is much harder doing that for implementing tries, except if you work only with the full node case, which would imply wasting a huge amount of space. At least, from my experience.

vbezhenar · on Nov 10, 2018

You can also use vector of vectors. Reallocating would be cheaper. But access would incur extra indirection.

saagarjha · on Nov 10, 2018

You can consider the entire virtual memory space to be a big vector, where a pointer is just an index into it!

jstimpfle · on Nov 10, 2018

However that's a pretty poor vector. It's not homogeneous (you put things of different shapes inside). That also implies you need sophisticated memory management, leading to further overheads. You cannot meaningfully iterate it.

saagarjha · on Nov 10, 2018

I was mostly joking…

jstimpfle · on Nov 10, 2018

Yes and I'm serious :-)

photon-torpedo · on Nov 10, 2018

Yes, that's exactly how we do it in Fortran 77.

namibj · on Nov 10, 2018

Just a friendly reminder that B-trees are often faster on modern microprocessors than RB-trees. See kbtree.h for a simple, yet fast example. I didn't test it, but I'd assume B-tries would be rather efficient.

jules · on Nov 10, 2018

Red-Black trees are isomorphic to B-trees of node size 4. If you take a Red-Black tree and put each black node together with all its red children into a single node then you get a B-tree of node size 4. An insertion/deletion algorithm for Red-Black trees gives a corresponding insertion/deletion algorithm for B-trees of size 4 and vice versa. Putting those nodes together in a single node probably improves performance already because you have fewer pointers, and you can improve performance further by increasing the size of the B-tree nodes to, say, 32 instead of 4.

saagarjha · on Nov 10, 2018

They are isomorphic, but B-Trees are more cache friendly. B-Trees store more in each node, while red-black trees require pointer chasing for each element.

jules · on Nov 10, 2018

That's what I said.

jstimpfle · on Nov 10, 2018

I believe RB-trees are preferred e.g. in the linux kernel because of pointer stability.

marknadal · on Nov 10, 2018

Radix trees are one of the most under utilized data structures.

They are great and have fantastic performance!

I implemented a custom on disk storage engine with a Radix format, and am getting on a low end MacBook Air 2015 about 3K/acked writes to disk/second! It is now the default at https://github.com/amark/gun the code is pretty short too.

repsilat · on Nov 10, 2018

> This is superior to binary-search: no branches (except for the test when bitfield is 0), and all the comparisons are done in parallel

Branchless binary search isn't hard to implement if you know (or can bound) the number of elements statically. You just use the comparison result arithmetically instead of branching on it, and you unroll the loop.

Obviously a binary search can't do comparisons in parallel, though.

BeeOnRope · on Nov 10, 2018

> Obviously a binary search can't do comparisons in parallel, though.

Sure you can, although at reduced efficiency the "deeper" you go.

For example, while searching a range of size N, you know your next probe point will be at N/2. You also know the probe point after that will be at N/4 or 3N/4, and so on. You know up-front all the possible probe points. Of course, only one out of the two probe points N/4 or 3N/4 or will actually be useful, depending on the N/2 result - but that doesn't stop you from comparing all three in parallel.

You can get a reasonable speedup this way: the extra comparisons happen in parallel and for a moderate depth the unnecessary probes are more than compensated by doing comparisons in parallel.

laxk · on Nov 11, 2018

If somebody need a good/fast/well optimized Go implementation of the ART, check this out: https://github.com/plar/go-adaptive-radix-tree (disclosure: I'm the author)

saagarjha · on Nov 10, 2018

I had to implement a trie for an Aho-Corasick implementation a while back, and I just used a std::unordered_set<unichar, std::unique_ptr<trie_node>> to store the children (this was Objective-C++, so I was using UTF-16 characters taken from an NSString). Worked well enough for the effort I put into it.

rcfox · on Nov 10, 2018

So you ended up using a tree to hold your tree nodes. I guess that was good enough for your purpose, but the article is discussing an optimized implementation.

saagarjha · on Nov 10, 2018

std::unordered_map is general a hash table, is it not?

rcfox · on Nov 10, 2018

Sorry, I switched set and unordered set in my mind. Still, a generic hash table isn't going to match a tailored data structure.

saagarjha · on Nov 10, 2018

I mean, it might. I only had to construct one trie, and I used it hundreds of millions of times, so O(1) child (hash) lookup ended up being faster than the O(log(n)) binary search lookup, at best, from the data structure described in the article. And since the data structure in the article wastes some space as well, I may have even used a similar amount of storage.