Implementing and simplifying Treap in C

13 Sep 2023

Let's say that you're looking for a search data-structure with fast insertion and lookups. Hash-tables immediately comes to mind. However, unless you know the (maximum) number of elements up-front (or can compute it cheaply) then eventually you'll need to resize your hash-table. Resizing inherently comes with potential memory fragmentation issues and doesn't play nicely with simple linear allocators. And so I'm looking for a data-structure with the following requirements:

Efficient (at least O(log N)) lookups and insertion.
No resizing.
Easy to implement (roughly 12~20 lines of non-finicky C code).

The last point might seem odd, but I prefer data-structures that are easy to implement, because it usually implies ease of understanding. And if you understanding something deeply, then you can easily customize it to (better) fit your specific context.

Basic Treap

A tree like structure is an obvious choice for this. But most self-balancing trees with strong guarantees (such as red-black tree) tend to violate the ease of implementation requirement. Comparatively a treap is simpler to implement but doesn't have as strong worst case guarantees.

Here's a simple and basic treap implementation in C as a starting point. We'll work our way through simplifying and improving it step by step.

typedef struct Treap {
    Str key;
    struct Treap *parent;
    struct Treap *child[2];
    uint64_t priority;
} Treap;

static Treap *
treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
{
    Treap *parent = NULL, **insert = t;
    while (*insert != NULL) {
        parent = *insert;
        switch (str_cmp(parent->key, key)) {
        case  0: return parent; break;
        case -1: insert = parent->child + 0; break;
        case  1: insert = parent->child + 1; break;
        default: ASSERT(!"unreachable"); break;
        }
    }
    if (a == NULL) { // lookup
        return NULL;
    }
    Treap *new = arena_alloc(a, sizeof *new);
    // advance the LCG state
    *rng = *rng * 0x1337ACDCBABE5AAD + 7;
    uint64_t new_priority = (*rng >> (*rng >> 59)) ^ *rng; // permute the state, i.e PCG
    *new = (Treap){ .key = key, .parent = parent, .priority = new_priority };
    *insert = new;

    while (new->parent && new->parent->priority < new_priority) {
        // update grandpa
        Treap *grandpa = new->parent->parent;
        if (grandpa != NULL) {
            int d = grandpa->child[1] == new->parent;
            grandpa->child[d] = new;
        } else { // new is now the root
            *t = new;
        }

        // update parent
        Treap *oldparent = new->parent;
        int d = oldparent->child[1] == new;
        oldparent->parent = new;
        // reparent child
        oldparent->child[d] = new->child[!d];
        if (new->child[!d] != NULL) {
            new->child[!d]->parent = oldparent;
        }

        // update node
        new->child[!d] = oldparent;
        new->parent = grandpa;
    }

    return new;
}

Depending on whether an arena was provided or not, the routine acts either as an insertion routine or as a lookup routine. Random priorities are generated through a simple PCG generator (with ad-hoc constants).

While I'm calling it a "basic" treap, this is already an improvement over the usual implementations which usually make the mistake of naming the child nodes left and right instead of using an index-able child array. I call it a mistake because it forces you to handle left vs right tree rotations separately, whereas with an array they can be done via the same piece of code by just inverting the index.

It's possible to leverage C's anonymous struct + union trick to access the child nodes both as named members as well as array indexes. But for my purposes it wasn't worth the effort.

Understanding treaps

Before trying to improve something, we need to understand how and more importantly why something works in the first place. And to a lot of people, who treat pseudo random number generators (PRNG) as a black-box that's easily available in the standard library, the workings of a treap may not be immediately obvious.

The magic here lies in one of the properties of a PRNG: a well distributed output sequence. For example, if a 8-bit PRNG (i.e output is between [0, 255]) outputs most of it's numbers above 60 or outputs all even numbers, then it's not well distributed. A good PRNG will output sequences that are well distributed within the output range.

So if the output of a PRNG is well distributed, and we're rebalancing the tree based on that, then it follows that the tree will also end up fairly well distributed with a really high probability. And that's the key idea behind treaps.

With that out of the way, here are some of the problems I have with the above "basic treap":

It's too big (about ~40 loc) and the tree-rotation is too finicky and error prone.
For each element, there's a memory overhead of 3 pointer + 1 priority (on modern 64bit computers that's about 32 bytes of overhead per element!)
Passing the PRNG state around is an annoyance (but using globals is not desirable either).

Hash based priority

If the key idea behind a treap is well distributed priorities, then the PRNG is only a means to that end. The priorities may come from elsewhere too, as long as they're well distributed. What else provides well distributed numbers? A hash function! So instead of passing a PRNG state around, we can simply hash the key itself to get the priority.

This gets rid of having to pass around the PRNG state. And since the priority is now derived from the key, we can even drop the priority member from our Treap to save some space and recompute it later when needed. Rehashing isn't that bad if the length of your keys are small (4~12 bytes), but if your keys can be large then this starts to hurt performance quite noticeably.

Maybe instead of the key, we can hash something else that's unique to the key. Or something unique to the node, e.g the node pointer itself! The node pointer has to be unique since no other object can occupy that memory address and so a hash of the node pointer is a valid source of randomness. And since a pointer is a fixed length integer, hashing it is really fast.

@@ -17,11 +18,20 @@ typedef struct Treap {
     Str key;
     struct Treap *parent;
     struct Treap *child[2];
-    uint64_t priority;
 } Treap;
 
+static uintptr_t
+hashptr(void *p)
+{
+    uintptr_t m = 0x5555555555555555;
+    uintptr_t h = (uintptr_t)p;
+    h ^= h >> (sizeof h * 4 - 1);
+    h *= m;
+    h ^= h >> (sizeof h * 4 + 1);
+    return h;
+}
+
 static Treap *
-treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
+treap_traverse(Arena *a, Treap **t, Str key)
 {
@@ -38,13 +48,11 @@ treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
         return NULL;
     }
     Treap *new = arena_alloc(a, sizeof *new);
-    // advance the LCG state
-    *rng = *rng * 0x1337ACDCBABE5AAD + 7;
-    uint64_t new_priority = (*rng >> (*rng >> 59)) ^ *rng; // permute the state, i.e PCG
-    *new = (Treap){ .key = key, .parent = parent, .priority = new_priority };
+    *new = (Treap){ .key = key, .parent = parent };
     *insert = new;
 
-    while (new->parent->priority < new_priority) {
+    uint64_t new_priority = hashptr(new);
+    while (new->parent && hashptr(new->parent) < new_priority) {
         // update grandpa
         Treap *grandpa = new->parent->parent;
         if (grandpa != NULL) {

The hashptr() function is just a simple xor-shift, multiply, xor-shift (xmx) permutation with ad-hoc constants. On modern machines, it's really fast while providing some really high quality scrambling.

Aside from the performance benefits, one other upside of using a node-pointer hash compared to a key-hash is that malicious party cannot control your priority via feeding pathological keys. But the downside of hashing pointer is that you lose out on some determinism offered by hashing the keys.

One other cool trick you can do by hashing the node pointer is that you can nudge the hash function a bit to ensure that hashptr(NULL) maps to the max possible number, which would allow you do drop the new->parent NULL check from the while loop.

    uintptr_t h = (uintptr_t)p - (uintptr_t)(void *)0; // ensure NULL maps to 0
    // do the xmx permutation which will leave 0 at 0
    return h - 1; // 0 - 1 == UINTPTR_MAX

The first line ensures that h is initially set to 0 if the argument was NULL. After that, xor-shifting 0 will just result in 0. And multiplying anything with 0 also results in 0. Lastly, subtracting 1 from 0 in an unsigned type will give you the maximum of that unsigned type.

In practice, this doesn't really make much of a difference compared to just having the NULL check. But it's a cool party trick that's surely(!) to impress the ladies.

Removing the parent

With the priority member removed from the node, now we're at 3 pointer overhead per element (about 24 bytes usually). Maybe we can remove the parent node too? And indeed we can. Since we're only interested in insertion and lookups, the only time the parent node is used is when doing "sift-up" to correct the heap property. But instead of storing the parent pointer in the node itself, we can simply remember all the parents when descending and pull them out later when correcting the heap. A stack is a natural fit for this.

Let's remove the parent member and then create the "parent-stack" and push the parents onto it (avoiding pushes when we're only doing lookups):

 typedef struct Treap {
     Str key;
-    struct Treap *parent;
     struct Treap *child[2];
 } Treap;
 
@@ -34,9 +35,15 @@ hashptr(void *p)
 static Treap *
 treap_traverse(Arena *a, Treap **t, Str key)
 {
+    Treap *pstack[128];
+    int phead = 0;
     Treap *parent = NULL, **insert = t;
     while (*insert != NULL) {
         parent = *insert;
+        if (a != NULL) {
+            if (phead == ARRAY_LEN(pstack)) abort();
+            pstack[phead++] = parent;
+        }
         switch (str_cmp(parent->key, key)) {
         case  0: return parent; break;
         case -1: insert = parent->child + 0; break;
@@ -48,33 +55,20 @@ treap_traverse(Arena *a, Treap **t, Str key)
         return NULL;
     }
     Treap *new = arena_alloc(a, sizeof *new);
-    *new = (Treap){ .key = key, .parent = parent };
+    *new = (Treap){ .key = key };
     *insert = new;

The hard-coded 128 stack limit might stick out. But after some empirical testing, the treap depth seems to usually be around log2(N) * 2. So even when inserting 2²³ (~8 million) items, the treap depth usually sticks around at 55~57, never even going past 60. And so 128 is probably too much if you're not planning to insert millions of items (common case). If you are inserting millions of items, you can probably increase the limit to 192 (log2(2⁶⁴) * 3). Dynamically growing stack is also an option, but if your treap is reaching 192 depth, then something has likely gone wrong and you should probably bail (in my case, I'm using abort but you might want to keep your process alive depending on your context).

Now it's time to correct the heap property after an insertion. Remember that the old sift-up loop was around ~22 lines of error-prone code. Here's the new one:

    uint64_t new_priority = hashptr(new);
    while (phead > 0 && hashptr((parent = pstack[--phead])) < new_priority) {
        if (phead > 0) { // update grandpa
            Treap *gp = pstack[phead - 1];
            gp->child[gp->child[1] == parent] = new;
        } else { // otherwise new is now the root
            *t = new;
        }
        int d = parent->child[1] == new;
        parent->child[d] = new->child[!d]; // update parent
        new->child[!d] = parent; // update child
    }

That's it. It's half the size of the previous while also being significantly easier to implement from scratch. So aside from reducing memory usage, we've also improved the ease of implementation, killing two stones with one bird!

UPDATE: u/dongxy has also pointed out a recursive join based algorithm for treaps that doesn't require tree-rotations.

Benchmarks

Time for some benchmarks. The code used for benchmarks can be found here.

"Insert" is the total time to insert all the items. "SLookup" is the total time to successfully lookup all the items. "RLookup" is random (and mostly failed) lookups. "Mem" is the memory usage (excluding the string/key itself). Time is measured in micro-seconds and memory usage is measured in KiB.

The length of all the string is kept at 64 bytes since it's closer to what you will typically see in practice (8~32 bytes) while still being a somewhat larger. Half the input is generated randomly. The other half is duplicate of the first half but with one of their bytes flipped. This is to make the input more similar to real-world cases where there's some common prefix and suffix.

"BasicTreap" is the initial treap implementation, while "Treap" is the final version with node-pointer hash and no parent pointer in node. As a base-line for comparison, I've also added a hash-table that's pre-allocated to N * 2 size. This of course is utterly unfair for insertion since the tree needs to allocate it's node one at a time while inserting. But it's still interesting to see. The lookup numbers also slightly favor the hash-table since the input is (mostly) randomly generated and thus evenly distributed already.

First let's look at some small (but common) cases of 512 and 2048 items:

Name (items)	Insert	SLookup	RLookup	Mem (KiB)
HashTable (512)	24	29	36	16
BasicTreap (512)	47	56	40	24
Treap (512)	47	56	40	16
HashTable (2048)	125	112	130	64
BasicTreap (2048)	242	256	227	96
Treap (2048)	245	256	214	64

There's barely any difference between the "basic treap" and our simplified one. Aside from the memory usage being ~33% lower. That shouldn't be surprising since the simplified treap only has a 2 pointer overhead per element whereas the basic-treap has 3 pointer plus a priority integer worth of overhead.

The hash-table and treap's memory usage ended up being the same coincidentally since I'm using a N * 2 size for the hash-table and 2 pointer is 16 byte on my system, the same as my Str type (which is a sized string consisting of a pointer + length pair). In practice, you'll likely need a value member as well and so the hash-table size at N * 2 will be larger.

Now let's look at some larger Ns:

Name (items)	Insert	SLookup	RLookup	Mem (KiB)
HashTable (16384)	1155	1063	1221	512
BasicTreap (16384)	2843	2821	3248	768
Treap (16384)	3042	2673	3020	512
HashTable (131072)	11738	16917	14517	4096
BasicTreap (131072)	38759	44715	69502	6144
Treap (131072)	44449	39799	61664	4096
HashTable (1048576)	152390	224713	188419	32768
BasicTreap (1048576)	677757	785431	1384892	49152
Treap (1048576)	789008	688381	1198060	32768

The simpler treap routinely performs better than basic treap at lookups. But about 10~15% worse in insertion. From some casual testing, using a higher quality hash-function for the pointer hash instead of using ad-hoc constants tend to speed things up a tiny bit.

And surprising perhaps no one, the hash-table beats both variants of the treap out of the water for large Ns. But for smaller Ns, the difference of a couple micro-seconds is usually not going to be a show stopper.

Conclusions

I'm fairly happy with how much simpler the implementation is, though I'd like it to be even simpler. And the performance at least under ~8K items pretty okay, though I'd like it to be even faster. And as it turns out, a data-structure that meets my criteria while being even simpler and even faster does exist.

But this post has gotten quite long already, so I'll save that for my next article: "Hash based trees and tries".

Tags: [ c, dsa, performance ]

RSS Feed