In my last post I set out some peculiar requirements for a search data-structure and investigated how well a treap would fit those requirements. To re-cap, here were the requirements:

- Fast:
`O(log N)`

or better insertions and lookups. - No dynamic resizes: plays well with linear allocators.
- Simple: should be implementable in 12~20 lines of non-finicky C.

I alluded that there are data-structures that fulfill my needs while being even simpler and even faster. In this post I'll describe two of these:

- Hash-table of Binary Hash Trees
- Hash Tries

One of the key observations on the last post was the treap works due to a well
distributed heap `priority`

.
Another observations was that instead of a PRNG we can use a hash function since
"good hash functions" are supposed to have well distributed hashes.

Now in our case, we don't *really* need the tree to be sorted by the key since
we only care about insertions and lookups.
Which leads to the following observation: there's no need to maintain a heap
property.
Instead we can just use the hash of the key to order the BST directly.

Because the hashes are supposed to be fairly well distributed already, a BST built from sorting on the hash will thus end up fairly balanced too.

```
typedef struct BHTNode {
Str key;
u64 hash;
struct BHTNode *child[2];
} BHTNode;
static BHTNode *
bht_find(Arena *a, BHTNode **t, Str key)
{
u64 hash = str_hash(key);
BHTNode **insert = t;
while (*insert != NULL) {
u64 phash = insert[0]->hash;
switch ((hash > phash) - (hash < phash)) {
case 0: if (str_eq(key, insert[0]->key)) return insert[0]; // fallthrough
case -1: insert = insert[0]->child + 0; break;
case +1: insert = insert[0]->child + 1; break;
}
}
if (a != NULL) {
*insert = arena_alloc(a, sizeof **insert);
**insert = (BHTNode){ .key = key, .hash = hash };
}
return insert[0];
}
```

And this is it, the entirety of the code required to implement a "binary hash tree".

How does it perform compared to the treap from last post? Using the same benchmarking format as last time, here's some results:

Items-Name | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|

512-Treap | 53 | 63 | 39 | 16 |

512-BHTree | 34 | 51 | 58 | 20 |

2048-Treap | 250 | 288 | 226 | 64 |

2048-BHTree | 220 | 230 | 230 | 80 |

16384-Treap | 3008 | 3134 | 3168 | 512 |

16384-BHTree | 2267 | 2303 | 2543 | 640 |

131072-Treap | 40804 | 45579 | 66662 | 4096 |

131072-BHTree | 25849 | 30606 | 29721 | 5120 |

1048576-Treap | 650847 | 735479 | 1275253 | 32768 |

1048576-BHTree | 362686 | 418937 | 521011 | 40960 |

In pretty much all cases, the binary hash tree outperforms the treap.
The memory usage is a bit higher due to having to keep the `hash`

member in each
node since recomputing the hash can be expensive.

To get a bit of intuition into how this works, here's a picture of a binary hash tree with 64 elements.

You can see that the tree is far from being perfectly balanced. But it's not too far off either.

Here's another image, this time with 256 elements.

The tree isn't too off balance, although there's a bit of "cluster" forming at the middle. With hash-tables, I've noticed that a "decent" but fast hash function often makes the hash-table faster as a whole compared to a slower but higher quality hash function. But in case of a binary hash tree, getting good distribution out of your hash function seems to be much more important.

But assuming you do have a "good hash function", are you really guaranteed not to degrade into a bad O(n) search? It's not a strong guarantee, but the probability of degrading into massively unbalanced O(n) tree is fairly low.

One intuitive way I like to think about this is that if you have a largely
unbalanced tree, that's another way of saying that there were lots of more ideal
nodes in the "upper levels" of the tree which weren't taken.
Which is another way of saying that those nodes are *available* for future
insertions.
And so if you are to insert a random number into the tree, there's a higher
chance that it will take a more ideal branch on the upper level instead of
descending down to the worst case.

And conversely, if your tree is already fairly balanced, then there's less spots available on the upper levels increasing the chances of a random insert descending down to a bad branch.

The above is an animation of inserting 128 elements into a binary hash tree. You can see how it goes through "cycles" where it becomes more unbalanced leaving lots of holes and then those holes slowly gets filled up by later insertions.

Going along the theme of randomization and even distribution, how about we use
*multiple* trees instead of just a single one.
And we'll just pick a tree depending on the hash we've already computed.
In other words, a "hash-table of hash-trees" (or HTHTree in short).
Supporting it is trivial, simply change the `insert`

initialization to the
following:

```
BHTNode **insert = t + (keyhash & (NSLOTS - 1));
```

And change `t`

from a single tree pointer to an array of pointers.
`NSLOTS`

is the number of slots we have in our array, which must be a power of 2.
16 seems to be a good choice, anything above 32 starts providing diminishing
returns.

Also notice that I'm using the low bits of the hash to index into the hash-table. This is because the high bits will affect the binary tree comparison the most and the low bits are mostly going to be ignored. And so I'm making use of the low bits rather than the high.

Items-Name | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|

64-BHTree | 6 | 9 | 6 | 2 |

64-HTHTree | 3 | 5 | 4 | 2 |

128-BHTree | 10 | 18 | 12 | 5 |

128-HTHTree | 6 | 11 | 9 | 5 |

256-BHTree | 17 | 25 | 26 | 10 |

256-HTHTree | 14 | 23 | 21 | 10 |

512-BHTree | 36 | 52 | 47 | 20 |

512-HTHTree | 29 | 47 | 40 | 20 |

1024-BHTree | 76 | 111 | 100 | 40 |

1024-HTHTree | 64 | 100 | 82 | 40 |

This provides some decent speed up at small Ns (common case).
At large Ns, the "acceleration" provided by the hash-table becomes negligible
because the `O(log N)`

tree dominates.

Given how easy it is to support a "root level hash-table" and the fact that it's strictly a win rather than being a trade-off, it probably doesn't make much sense not to do it.

The idea of making a BST out of hash seemed like a simple but effective idea.
And so I started to search for any existing data-structure that does this but I
couldn't quite find any.
This *does* make sense because:

- If you're using a BST, that's likely because you
*require*the keys to be sorted and sorting based on hash obviously destroys that. - If you don't care about the keys being sorted then you likely don't want the
tree to be
*binary*. Instead you'd want to reduce pointers hops as much as possible.

Hash array mapped trie (or HAMT) is a data-structure that takes the 2nd approach. Instead of a sorted tree, it's a trie of hashes. Describing HAMT in details is a bit out of scope for this article, but here's a short description (of what I gathered out of the wikipedia article and this talk by Phil Nash):

- Each node is an array of some power of 2. You take some bits out of the hash to index into that array.

For example, if you picked 4 as the size of the array, then you'd take 2 bits
out of your hash each time you traverse a node as your index.
So if you're looking up or inserting "alpha" which has a 4-bit hash of `0b0111`

then you'll take the highest 2 bits (`01`

) as your index.

- In your indexed position, the slot may be empty, in which case you'd insert your key there (or signal a failure in case of lookup).
- Or the slot may be a key, in which case you'd compare against it.
- Or the slot may be a pointer to another array. In which case, you'd take the
*next*two bits from your hash (`11`

or 3 in our example) to index into that "next array". - Or the slot can be a linked list. Which means you've ran out of hash bits (i.e
there was a hash collision)
^{}.

Following the above idea, I went ahead and implemented a HAMT with some fairly finicky code.

Items-Name | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|

2048-HTable | 132 | 155 | 134 | 64 |

2048-HTHTree | 198 | 217 | 217 | 80 |

2048-HAMT | 142 | 177 | 138 | 185 |

16384-HTable | 1183 | 1433 | 1241 | 512 |

16384-HTHTree | 2180 | 2284 | 2497 | 640 |

16384-HAMT | 1330 | 1654 | 1318 | 1491 |

131072-HTable | 11864 | 19586 | 14878 | 4096 |

131072-HTHTree | 25756 | 31023 | 29675 | 5120 |

131072-HAMT | 14716 | 24269 | 18400 | 11730 |

The above was using an array size of 8. The performance here is impressive, it's competing head to head with a pre-allocated hash-table. Lookup performance is really good too.

But the problem is the insanely high memory usage.
The original HAMT uses 32 sized array - but it uses a *sparse array* with a
bitmap trick to save space.
But that requires resizes, which I have explicitly forbidden.
Reducing the array size to 4 makes the memory usage a bit less (at the cost of
some performance) but *even then* the memory usage was crazy high compared to
the other options.

Skimming through the original paper it doesn't mention linked-lists. Instead it seems to recommend rehashing the key if a collision occurs.

The performance of HAMT was really good, but the large and finicky implementation combined with the huge memory overhead made me not investigate the ideas behind it any further.

But in a weird turn of events Chris Wellons ended up misunderstanding what I
had meant by "hash table of hash trees" and thought that the "hash table" was
supposed to be embedded into the *leaf* rather that at *root level* as I had
intended.
And thus he ended up recreating a hash-trie (an idea which I had prematurely
abandoned) except it's vastly simpler than my over-engineered and finicky HAMT
implementation.
Here's the entirety of the code:

```
typedef struct HTrie {
Str key;
struct HTrie *slots[4];
} HTrie;
static HTrie *
htrie_traverse(Arena *a, HTrie **t, Str key)
{
for (u64 keyhash = str_hash(key); *t != NULL; keyhash <<= 2) {
if (str_eq(key, t[0]->key)) {
return t[0];
}
t = t[0]->slots + (keyhash >> 62);
}
if (a != NULL) {
*t = arena_alloc(a, sizeof **t);
**t = (HTrie){ .key = key };
}
return *t;
}
```

The hash of the key is used to traverse the trie and after each iteration the used 2 bits of the hash are shifted out. What happens when the hash runs out? It implicitly degrades into a linked list that always takes the 0th slot.

Keep in mind that unlike a chained hash-table, a full hash collisions doesn't necessarily trigger the linear linked list probe. In order for it to degrade, you'd need to exhaust all the nodes in the path of that hash first. That's 32 nodes for a 64-bit hash. In my synthetic tests with 64-bit hashes my system would run out of memory before I could trigger the linked-list condition.

How does it perform?

Items-Name | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|

2048-HTable | 132 | 156 | 139 | 64 |

2048-HTHTree | 205 | 214 | 217 | 80 |

2048-HTrie | 138 | 168 | 141 | 96 |

16384-HTable | 1163 | 1410 | 1229 | 512 |

16384-HTHTree | 2152 | 2225 | 2428 | 640 |

16384-HTrie | 1373 | 1615 | 1416 | 768 |

131072-HTable | 11816 | 19393 | 14526 | 4096 |

131072-HTHTree | 24673 | 30075 | 28696 | 5120 |

131072-HTrie | 16263 | 24299 | 22545 | 6144 |

Quite a lot better than HTHTree! But without consuming crazy amount of memory compared with HAMT. And the implementation is also really simple and straight forward, unlike my HAMT implementation. This makes Hash-Tries my new favorite search data-structure when array resizes are to be avoided.

Here's a 4-way Hash-Trie with 32 elements:

And here's one with 256 elements (click on the image to see full size):

Notice how wide (and short in height) the trie is compared to the binary hash tree? The significantly better performance shouldn't be a surprise.

I've tried tuning the number of leafs but anything above 4 results in negligible amount of performance win while consuming significantly more memory. And so 4 is going to be a good default.

Here are a couple more ideas and "extensions" which might be worth exploring.

On 64-bit systems, pointers are 64 bits (8 bytes), which is quite a lot. Instead you could use 32 bits (4 bytes) self relative pointers for the Trie node pointers. This should work well with linear allocators.

I haven't benchmarked this, but the following benchmark done by Chris Wellons shows compressed version not only saving a huge amount of memory, but actually outperforming uncompressed variant at larger Ns (perhaps due to better cache utilization?). However, keep in mind that Chris' benchmark is done with integer keys. Things might be a lot different with string keys.

Due to alignment requirements, the lowest couple bits of the node pointer is going to be 0-ed. Those bits can be used to smuggle extra bits of information. For example smuggling some of the hash bits for the hash-trie in order to avoid the string comparison in case of a hash mismatch may yield better performance.

Assuming your platform supports atomic operations on pointer sized objects, a fairly safe assumption, then it's pretty easy to turn a hash-trie into a lock-free concurrent hashmap using only an atomic load for loading the node and atomic compare-and-exchange for insertion.

How would you deal with potentially malicious input? The exact same way hash-tables deal with it, by using a secure hash-function (e.g siphash) with a secure seed.

Instead of shifting the hash, we could detect hash collisions and go for a rehash. In theory, this should improve cases where you're inserting large amount of items and hash collisions are likely. In my benchmarks with 64-bit hashes however, it made no difference.

Instead of shifting the hash, you could use the hash as the initial seed for a PRNG and use the PRNG's output sequence to traverse the trie. In my synthetic benchmarks with string keys, this didn't really make any difference.

Also beware not to do this when you have a hash function which doesn't produce collisions (e.g an integer hash). Since if you have collision-free hash, then the raw hash will provide a small worst case guarantee but a PRNG will not.

Here's a nice and pretty bar chart summarizing the benchmark data (also here's the benchmark code):

Here are my takeaways for the time being:

- If a reasonable upper-bound is known upfront, use "MSI" hash-table.
- Otherwise a 4-way Hash-Trie is a good default choice.
- If aiming for space-efficiency, use pointer compression on the Hash-Trie.
- HTHTree and Treap is usually not worth it over a Hash-Trie. But they're still worth keeping in mind.