Let's say that you're looking for a search data-structure with fast insertion and lookups. Hash-tables immediately comes to mind. However, unless you know the (maximum) number of elements up-front (or can compute it cheaply) then eventually you'll need to resize your hash-table. Resizing inherently comes with potential memory fragmentation issues and doesn't play nicely with simple linear allocators. And so I'm looking for a data-structure with the following requirements:
O(log N)
) lookups and insertion.The last point might seem odd, but I prefer data-structures that are easy to implement, because it usually implies ease of understanding. And if you understanding something deeply, then you can easily customize it to (better) fit your specific context.
A tree like structure is an obvious choice for this. But most self-balancing trees with strong guarantees (such as red-black tree) tend to violate the ease of implementation requirement. Comparatively a treap is simpler to implement but doesn't have as strong worst case guarantees.
Here's a simple and basic treap implementation in C as a starting point. We'll work our way through simplifying and improving it step by step.
typedef struct Treap {
Str key;
struct Treap *parent;
struct Treap *child[2];
uint64_t priority;
} Treap;
static Treap *
treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
{
Treap *parent = NULL, **insert = t;
while (*insert != NULL) {
parent = *insert;
switch (str_cmp(parent->key, key)) {
case 0: return parent; break;
case -1: insert = parent->child + 0; break;
case 1: insert = parent->child + 1; break;
default: ASSERT(!"unreachable"); break;
}
}
if (a == NULL) { // lookup
return NULL;
}
Treap *new = arena_alloc(a, sizeof *new);
// advance the LCG state
*rng = *rng * 0x1337ACDCBABE5AAD + 7;
uint64_t new_priority = (*rng >> (*rng >> 59)) ^ *rng; // permute the state, i.e PCG
*new = (Treap){ .key = key, .parent = parent, .priority = new_priority };
*insert = new;
while (new->parent && new->parent->priority < new_priority) {
// update grandpa
Treap *grandpa = new->parent->parent;
if (grandpa != NULL) {
int d = grandpa->child[1] == new->parent;
grandpa->child[d] = new;
} else { // new is now the root
*t = new;
}
// update parent
Treap *oldparent = new->parent;
int d = oldparent->child[1] == new;
oldparent->parent = new;
// reparent child
oldparent->child[d] = new->child[!d];
if (new->child[!d] != NULL) {
new->child[!d]->parent = oldparent;
}
// update node
new->child[!d] = oldparent;
new->parent = grandpa;
}
return new;
}
Depending on whether an arena was provided or not, the routine acts either as an insertion routine or as a lookup routine. Random priorities are generated through a simple PCG generator (with ad-hoc constants).
While I'm calling it a "basic" treap, this is already an improvement over the
usual implementations which usually make the mistake of naming the child nodes
left
and right
instead of using an index-able child array.
I call it a mistake because it forces you to handle left vs right tree rotations
separately, whereas with an array they can be done via the same piece of code by
just inverting the index.
It's possible to leverage C's anonymous struct + union trick to access the child nodes both as named members as well as array indexes. But for my purposes it wasn't worth the effort.
Before trying to improve something, we need to understand how and more importantly why something works in the first place. And to a lot of people, who treat pseudo random number generators (PRNG) as a black-box that's easily available in the standard library, the workings of a treap may not be immediately obvious.
The magic here lies in one of the properties of a PRNG: a well distributed
output sequence.
For example, if a 8-bit PRNG (i.e output is between [0, 255]
) outputs most of
it's numbers above 60 or outputs all even numbers, then it's not well
distributed.
A good PRNG will output sequences that are well distributed within the output
range.
So if the output of a PRNG is well distributed, and we're rebalancing the tree based on that, then it follows that the tree will also end up fairly well distributed with a really high probability. And that's the key idea behind treaps.
With that out of the way, here are some of the problems I have with the above "basic treap":
If the key idea behind a treap is well distributed priorities, then the PRNG is only a means to that end. The priorities may come from elsewhere too, as long as they're well distributed. What else provides well distributed numbers? A hash function! So instead of passing a PRNG state around, we can simply hash the key itself to get the priority.
This gets rid of having to pass around the PRNG state.
And since the priority is now derived from the key, we can even drop the
priority
member from our Treap
to save some space and recompute it later
when needed.
Rehashing isn't that bad if the length of your keys are small (4~12 bytes), but
if your keys can be large then this starts to hurt performance quite noticeably.
Maybe instead of the key, we can hash something else that's unique to the key. Or something unique to the node, e.g the node pointer itself! The node pointer has to be unique since no other object can occupy that memory address and so a hash of the node pointer is a valid source of randomness. And since a pointer is a fixed length integer, hashing it is really fast.
@@ -17,11 +18,20 @@ typedef struct Treap {
Str key;
struct Treap *parent;
struct Treap *child[2];
- uint64_t priority;
} Treap;
+static uintptr_t
+hashptr(void *p)
+{
+ uintptr_t m = 0x5555555555555555;
+ uintptr_t h = (uintptr_t)p;
+ h ^= h >> (sizeof h * 4 - 1);
+ h *= m;
+ h ^= h >> (sizeof h * 4 + 1);
+ return h;
+}
+
static Treap *
-treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
+treap_traverse(Arena *a, Treap **t, Str key)
{
@@ -38,13 +48,11 @@ treap_traverse(Arena *a, Treap **t, Str key, uint64_t *rng)
return NULL;
}
Treap *new = arena_alloc(a, sizeof *new);
- // advance the LCG state
- *rng = *rng * 0x1337ACDCBABE5AAD + 7;
- uint64_t new_priority = (*rng >> (*rng >> 59)) ^ *rng; // permute the state, i.e PCG
- *new = (Treap){ .key = key, .parent = parent, .priority = new_priority };
+ *new = (Treap){ .key = key, .parent = parent };
*insert = new;
- while (new->parent->priority < new_priority) {
+ uint64_t new_priority = hashptr(new);
+ while (new->parent && hashptr(new->parent) < new_priority) {
// update grandpa
Treap *grandpa = new->parent->parent;
if (grandpa != NULL) {
The hashptr()
function is just a simple xor-shift, multiply, xor-shift (xmx)
permutation with ad-hoc constants.
On modern machines, it's really fast while providing some really high
quality scrambling.
Aside from the performance benefits, one other upside of using a node-pointer hash compared to a key-hash is that malicious party cannot control your priority via feeding pathological keys. But the downside of hashing pointer is that you lose out on some determinism offered by hashing the keys.
One other cool trick you can do by hashing the node pointer is that you can
nudge the hash function a bit to ensure that hashptr(NULL)
maps to the max
possible number, which would allow you do drop the new->parent
NULL check from
the while loop.
uintptr_t h = (uintptr_t)p - (uintptr_t)(void *)0; // ensure NULL maps to 0
// do the xmx permutation which will leave 0 at 0
return h - 1; // 0 - 1 == UINTPTR_MAX
The first line ensures that h
is initially set to 0 if the argument was NULL.
After that, xor-shifting 0 will just result in 0.
And multiplying anything with 0 also results in 0.
Lastly, subtracting 1 from 0 in an unsigned type will give you the maximum of
that unsigned type.
In practice, this doesn't really make much of a difference compared to just having the NULL check. But it's a cool party trick that's surely(!) to impress the ladies.
With the priority
member removed from the node, now we're at 3 pointer
overhead per element (about 24 bytes usually).
Maybe we can remove the parent node too? And indeed we can.
Since we're only interested in insertion and lookups, the only time the parent
node is used is when doing "sift-up" to correct the heap property.
But instead of storing the parent pointer in the node itself, we can simply
remember all the parents when descending and pull them out later when correcting
the heap.
A stack is a natural fit for this.
Let's remove the parent
member and then create the "parent-stack" and push the
parents onto it (avoiding pushes when we're only doing lookups):
typedef struct Treap {
Str key;
- struct Treap *parent;
struct Treap *child[2];
} Treap;
@@ -34,9 +35,15 @@ hashptr(void *p)
static Treap *
treap_traverse(Arena *a, Treap **t, Str key)
{
+ Treap *pstack[128];
+ int phead = 0;
Treap *parent = NULL, **insert = t;
while (*insert != NULL) {
parent = *insert;
+ if (a != NULL) {
+ if (phead == ARRAY_LEN(pstack)) abort();
+ pstack[phead++] = parent;
+ }
switch (str_cmp(parent->key, key)) {
case 0: return parent; break;
case -1: insert = parent->child + 0; break;
@@ -48,33 +55,20 @@ treap_traverse(Arena *a, Treap **t, Str key)
return NULL;
}
Treap *new = arena_alloc(a, sizeof *new);
- *new = (Treap){ .key = key, .parent = parent };
+ *new = (Treap){ .key = key };
*insert = new;
The hard-coded 128 stack limit might stick out.
But after some empirical testing, the treap depth seems to usually be around
log2(N) * 2
.
So even when inserting 2²³ (~8 million) items, the treap depth usually sticks
around at 55~57, never even going past 60.
And so 128 is probably too much if you're not planning to insert millions of
items (common case).
If you are inserting millions of items, you can probably increase the limit to
192 (log2(2⁶⁴) * 3
).
Dynamically growing stack is also an option, but if your treap is reaching 192
depth, then something has likely gone wrong and you should probably bail (in my
case, I'm using abort
but you might want to keep your process alive depending
on your context).
Now it's time to correct the heap property after an insertion. Remember that the old sift-up loop was around ~22 lines of error-prone code. Here's the new one:
uint64_t new_priority = hashptr(new);
while (phead > 0 && hashptr((parent = pstack[--phead])) < new_priority) {
if (phead > 0) { // update grandpa
Treap *gp = pstack[phead - 1];
gp->child[gp->child[1] == parent] = new;
} else { // otherwise new is now the root
*t = new;
}
int d = parent->child[1] == new;
parent->child[d] = new->child[!d]; // update parent
new->child[!d] = parent; // update child
}
That's it. It's half the size of the previous while also being significantly easier to implement from scratch. So aside from reducing memory usage, we've also improved the ease of implementation, killing two stones with one bird!
UPDATE: u/dongxy
has also pointed out a recursive
join based algorithm for treaps that doesn't require tree-rotations.
Time for some benchmarks. The code used for benchmarks can be found here.
"Insert" is the total time to insert all the items. "SLookup" is the total time to successfully lookup all the items. "RLookup" is random (and mostly failed) lookups. "Mem" is the memory usage (excluding the string/key itself). Time is measured in micro-seconds and memory usage is measured in KiB.
The length of all the string is kept at 64 bytes since it's closer to what you will typically see in practice (8~32 bytes) while still being a somewhat larger. Half the input is generated randomly. The other half is duplicate of the first half but with one of their bytes flipped. This is to make the input more similar to real-world cases where there's some common prefix and suffix.
"BasicTreap" is the initial treap implementation, while "Treap" is the final
version with node-pointer hash and no parent pointer in node.
As a base-line for comparison, I've also added a hash-table that's pre-allocated
to N * 2
size.
This of course is utterly unfair for insertion since the tree needs to
allocate it's node one at a time while inserting.
But it's still interesting to see.
The lookup numbers also slightly favor the hash-table since the input is
(mostly) randomly generated and thus evenly distributed already.
First let's look at some small (but common) cases of 512 and 2048 items:
Name (items) | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|
HashTable (512) | 24 | 29 | 36 | 16 |
BasicTreap (512) | 47 | 56 | 40 | 24 |
Treap (512) | 47 | 56 | 40 | 16 |
HashTable (2048) | 125 | 112 | 130 | 64 |
BasicTreap (2048) | 242 | 256 | 227 | 96 |
Treap (2048) | 245 | 256 | 214 | 64 |
There's barely any difference between the "basic treap" and our simplified one.
Aside from the memory usage being ~33% lower.
That shouldn't be surprising since the simplified treap only has a 2 pointer
overhead per element whereas the basic-treap has 3 pointer plus a priority
integer worth of overhead.
The hash-table and treap's memory usage ended up being the same
coincidentally since I'm using a N * 2
size for the hash-table and 2
pointer is 16 byte on my system, the same as my Str
type (which is a sized
string consisting of a pointer + length pair).
In practice, you'll likely need a value
member as well and so the hash-table
size at N * 2
will be larger.
Now let's look at some larger Ns:
Name (items) | Insert | SLookup | RLookup | Mem (KiB) |
---|---|---|---|---|
HashTable (16384) | 1155 | 1063 | 1221 | 512 |
BasicTreap (16384) | 2843 | 2821 | 3248 | 768 |
Treap (16384) | 3042 | 2673 | 3020 | 512 |
HashTable (131072) | 11738 | 16917 | 14517 | 4096 |
BasicTreap (131072) | 38759 | 44715 | 69502 | 6144 |
Treap (131072) | 44449 | 39799 | 61664 | 4096 |
HashTable (1048576) | 152390 | 224713 | 188419 | 32768 |
BasicTreap (1048576) | 677757 | 785431 | 1384892 | 49152 |
Treap (1048576) | 789008 | 688381 | 1198060 | 32768 |
The simpler treap routinely performs better than basic treap at lookups. But about 10~15% worse in insertion. From some casual testing, using a higher quality hash-function for the pointer hash instead of using ad-hoc constants tend to speed things up a tiny bit.
And surprising perhaps no one, the hash-table beats both variants of the treap out of the water for large Ns. But for smaller Ns, the difference of a couple micro-seconds is usually not going to be a show stopper.
I'm fairly happy with how much simpler the implementation is, though I'd like it to be even simpler. And the performance at least under ~8K items pretty okay, though I'd like it to be even faster. And as it turns out, a data-structure that meets my criteria while being even simpler and even faster does exist.
But this post has gotten quite long already, so I'll save that for my next article: "Hash based trees and tries".
Tags: [ c, dsa, performance ]