Nice article and analysis! I'm actually considering scrapping the trie used in my project to something based off of this one, with some modifications:
For instance, find_node(c, Node48) could avoid the branch if a non-existing index points to an additional entry in child_ptrs that's always NULL. Lookup would be comparable to the Node256 version.
Another thing that could be done, is to scrap the Node48 entirely, and implement two new structs to replace it: Node32 and Node64, and use respectively AVX2 and AVX512. These can be based off of the Node16 version. It remains to be seen if these will yield better performance than the branchless Node48 above, especially if power management kicks in when mixing AVX512 with older SIMD generations.
The trie in Lwan (https://lwan.ws) does an interesting trick to reduce the amount of memory used in the equivalent of a Node256: instead of 256 pointers to a node, it has only 8 pointers. Characters are hashed (MOD 8). The leaf node contains a linked list of key/value pair, and an actual string comparison is performed at the end. (Lwan cheats here by avoiding a string comparison if the linked list contains only 1 element.) Works pretty well, as it's part of the URL routing mechanism.
One other experiment I've been making with tries, is to use the idea of key compression and use it in a different way: slice it every 4 or 8 bytes, consider those bytes as an arbitrary integer, and add every chunk of it to a hashmap<int, some_struct>, building a chain for the next lookup in some_struct. The prototype I wrote works pretty well.
Another standard trick to reduce memory is to store a bitfield telling you which pointers are not null, and then store an array of only the pointers that are non null. For example for a Node64 you store a 64 bits bitfield plus only the pointers that are not null, so if that node has only 3 children you store 64 bits plus 3 pointers instead of 64 pointers. You can index that structure by doing a shift + popcount: if you want to find the pointer at index n you first check the n-th bit in the bitfield to see if the pointer is null, and if it isn't you count the number of set bits in bitfield[0..n] to find the index into the pointer array.
For instance, find_node(c, Node48) could avoid the branch if a non-existing index points to an additional entry in child_ptrs that's always NULL. Lookup would be comparable to the Node256 version.
Another thing that could be done, is to scrap the Node48 entirely, and implement two new structs to replace it: Node32 and Node64, and use respectively AVX2 and AVX512. These can be based off of the Node16 version. It remains to be seen if these will yield better performance than the branchless Node48 above, especially if power management kicks in when mixing AVX512 with older SIMD generations.
The trie in Lwan (https://lwan.ws) does an interesting trick to reduce the amount of memory used in the equivalent of a Node256: instead of 256 pointers to a node, it has only 8 pointers. Characters are hashed (MOD 8). The leaf node contains a linked list of key/value pair, and an actual string comparison is performed at the end. (Lwan cheats here by avoiding a string comparison if the linked list contains only 1 element.) Works pretty well, as it's part of the URL routing mechanism.
One other experiment I've been making with tries, is to use the idea of key compression and use it in a different way: slice it every 4 or 8 bytes, consider those bytes as an arbitrary integer, and add every chunk of it to a hashmap<int, some_struct>, building a chain for the next lookup in some_struct. The prototype I wrote works pretty well.