ChibiHash: Small, Fast 64 bit hash function

16 Nov 2024

Update Nov 30: I've added a version 2, which makes the benchmarks and design overview section in this article outdated. See the github repo for updated benchmarks.


If you need a small and fast 64 bit hash function that can be copy-pasted easily, then here's one that I cooked up in an hour or so: chibihash64.h.

Some key features:

Here's some benchmark against other similar hash functions:

Name Large input (GiB/sec) Small input (Cycles/Hash)
chibihash64 18.08 49
xxhash64 12.59 50
city64 14.95 35
spooky64 13.83 59

It's the fastest of the bunch for large string throughput. For small string (< 32 bytes), cityhash beats it - worth noting that cityhash has hardcoded special cases for input below or equal 32 bytes.

Design overview

The function starts by declaring some constants and initializing a 32 byte state.

const uint64_t P1 = UINT64_C(0x2B7E151628AED2A5);
const uint64_t P2 = UINT64_C(0x9E3793492EEDC3F7);
const uint64_t P3 = UINT64_C(0x3243F6A8885A308D);

uint64_t h[4] = { P1, P2, P3, seed };

The constants are nothing special, it's just e (Euler's number), pi (π) and golden ratio written out in hex with the last digit adjusted to be odd (because multiply by even number isn't invertable).

for (; l >= 32; l -= 32) {
    for (int i = 0; i < 4; ++i, k += 8) {
        uint64_t lane = chibihash64__load64le(k);
        h[i] ^= lane;
        h[i] *= P1;
        h[(i+1)&3] ^= ((lane << 40) | (lane >> 24));
    }
}

This is the meat of the algorithm. It processes 8 bytes at a time mixing it in via xor and multiply. However, multiplication is an "upwards" operation where the low bits affect the high bits but the high bits don't get to affect the low ones. This creates some biases. So you'd want to mix the high bits into the lower ones to reduce the bias. This commonly achieved via an xor-shift: x ^= x >> 32;. A bitwise rotation bringing the high bits down also works, which is what I opted for.

Initially I was mixing it into h[i] but I noticed that mixing it into the next index sped things up noticeably. Probably due breaking up some dependency and allowing for better instruction-level-parallelism (ILP).

Speaking of ILP, with compiler optimization, the inner loop should get unrolled, which would open up more avenue for utilizing ILP better. This is similar principle to how xxhash achieves it's high speed.

The way it loads the 8 bytes is also important. The correct way is to load via shift+or:

static inline uint64_t
chibihash64__load64le(const uint8_t *p)
{
    return (uint64_t)p[0] <<  0 | (uint64_t)p[1] <<  8 |
           (uint64_t)p[2] << 16 | (uint64_t)p[3] << 24 |
           (uint64_t)p[4] << 32 | (uint64_t)p[5] << 40 |
           (uint64_t)p[6] << 48 | (uint64_t)p[7] << 56;
}

This is free of any UB, works on any alignment and on any machine regardless of it's endianness. It's also fast, gcc and clang recognize this pattern and optimize it into a single mov instruction on x86 targets.

h[0] += ((uint64_t)len << 32) | ((uint64_t)len >> 32);
if (l & 1) {
    h[0] ^= k[0];
    --l, ++k;
}
h[0] *= P2; h[0] ^= h[0] >> 31;

Now we mix in the length along with a byte if the remaining length is odd (you'll see why in a minute).

for (int i = 1; l >= 8; l -= 8, k += 8, ++i) {
    h[i] ^= chibihash64__load64le(k);
    h[i] *= P2; h[i] ^= h[i] >> 31;
}

Keep chopping of 8 bytes until there's less than that remaining.

for (int i = 0; l > 0; l -= 2, k += 2, ++i) {
    h[i] ^= (k[0] | ((uint64_t)k[1] << 8));
    h[i] *= P3; h[i] ^= h[i] >> 31;
}

Now we chop off the last remaining bytes, 2 bytes at a time. Since we mixed in a byte in case of an odd length already, the remaining length now must be even, thus it's safe to read 2 bytes at a time. This gives a decent dump in small string performance compared to doing it one byte at a time.

uint64_t x = seed;
x ^= h[0] * ((h[2] >> 32)|1);
x ^= h[1] * ((h[3] >> 32)|1);
x ^= h[2] * ((h[0] >> 32)|1);
x ^= h[3] * ((h[1] >> 32)|1);

Now we combine all the state into a single 64 bit value. I'm not sure if this is a good way to do it, but it seems to work fine.

x ^= x >> 27; x *= UINT64_C(0x3C79AC492BA7B653);
x ^= x >> 33; x *= UINT64_C(0x1C69B3F74AC4AE35);
x ^= x >> 27;

And finally mix it in well for some good avalancing behavior. The finisher used here is a direct copy of moremur, which should be a good finisher with decent speed.

Mathemetical foundation

There are none. Everything here is "empirically derived" (I kept making changes until the tests passed).

When NOT to use

The introduction should make it clear on why you'd want to use this. Here are some reasons to avoid using this:

Conclusion

I started writing this because all the 64 bit hash functions I came across were either too slow (FNV-1a, one byte at a time processing), or too large spanning hundreds of lines of code, or non-portable due to using hardware specific instructions. Being small and portable, the goal is to be able to use ChibiHash as a good "default" for non-cryptographic 64-bit hashing needs.

Tags: [ c, hashing ]



RSS Feed