Decoding UTF8 with Parallel Extract

23 Mar 2024

As a side-quest I recently decided to write a branchless utf8 decoder utilizing the pext or "parallel extract" instruction. It's compliant with rfc-3629, meaning that it doesn't just naively decode the code-point but also checks for overlong encoding, surrogate pairs and such. Compiled with gcc -O3 -march=x86-64-v3 the entire decoder results in just about 29 instructions. That's pretty sweet, if you ask me.

The full source code can be found here: utf8-pext.c.

UTF8: what is it?

A small refresher on how utf8 works:

length	bit pattern
1 byte	0xxxxxxx
2 byte	110xxxxx 10xxxxxx
3 byte	1110xxxx 10xxxxxx 10xxxxxx
4 byte	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Ascii characters are encoded as is - with the highest bit set to zero. This makes utf8 fully backwards compatible with ascii.
A multibyte sequence of length N will have the highest N bits set to 1 followed by a 0. E.g a 2 byte sequence starts with 110.. as the first byte.
Continuation bytes all have the highest 2 bits set to 10.

The bits marked with x are the important bits that we need to extract out. We also need to validate the input to make sure the continuation markers exists, check against overlong encoding and surrogate pairs.

The interface

The decoder interface is the following. You give it a buffer, length and a int pointer to accumulate errors into. The function returns back a struct which contains the decoded codepoint and the utf8 encoded length of it. If an error occurs, err will be non-zero and the return value will be ill-formed.

typedef struct {
    u32 codepoint;
    int len;
} U8Result;

static U8Result
utf8_dec(uint8_t *str, ptrdiff_t slen, int *err)
{
    // ...
}

One interesting choice I've made is to make the err field sticky across function calls. This means that the decoder doesn't clear it to zero when called, the caller is responsible for doing it. This allows you to check for error - once - at the end of your loop, if that's appropriate.

    int err = 0;
    for ... {
        U8Result r = utf8_dec(..., &err);
    }
    if (err) {
        // appropriate error handling
    }

But you could also just clear it to zero before each call and check for error on each iteration as well if that's what your use-case demands. Making the error field sticky across function calls allows for both use-cases to be satisfied.

Loading the bytes

There's two variations of this. One is fully branchless the other trades away branchlessness in favor of a more practical interface.

If you want to be fully branchless, the decoder needs to load 4 bytes from the buffer unconditionally. The first byte is shifted up to the most significant position followed by the 2nd, 3rd and 4th. Effectively, it's a big-endian load (without any byte swapping confusion). gcc compiles it down to a single movbe instruction.

    u32 v = ((u32)str[3] <<  0) | ((u32)str[2] <<  8) |
            ((u32)str[1] << 16) | ((u32)str[0] << 24);

This means that the caller is responsible for making sure the buffer has 4 bytes of nul padding at the end. For performance critical code, over-aligning or padding a buffer is not uncommon. But for regular code, this makes the interface non-practical. The following version makes use of the buffer's length and avoids reading out of bound:

    u32 v = 0;
    switch (MIN(slen, 4)) { // branchy
    case 4: v |= ((u32)str[3] <<  0); // fallthrough
    case 3: v |= ((u32)str[2] <<  8); // fallthrough
    case 2: v |= ((u32)str[1] << 16); // fallthrough
    case 1: v |= ((u32)str[0] << 24); break;
    default: ASSERT(!"unreachable"); break; // we don't accept empty strings
    }

This is no longer branchless, but it makes the interface more practical to use. A fair trade-off, usually.

Determine the length

If you look back to the table above, you'll see that the largest leading marker is when you have a 4 byte sequence: 11110xxx. So in order to determine and validate the sequence length, we need to look at the top 5 bits.

But if we simply assume that the input is well-formed, then we'll only need the top 4 bits to determine the length. This cuts down the lookup table size by half.

    static const uint8_t len_table[] = {
        /* 0XXX */ 1, 1, 1, 1, 1, 1, 1, 1,
        /* 10XX */ 1, 1, 1, 1,  /* invalid */
        /* 110X */ 2, 2,
        /* 1110 */ 3,
        /* 1111 */ 4,  /* maybe, but also could be invalid */
    };
    int len = len_table[str[0] >> 4];

For well-formed input, this will give us the length of the sequence. But obviously, a robust decoder cannot simply assume the input to be valid. Since we're not checking for 0 at the 5th top bit, bytes such as 0xFF will incorrectly get detected as 4 byte sequence.

Also notice that bytes with 10.. bit prefix are continuation bytes and thus invalid as leading byte. But in our table, we treat them as single byte sequence.

Keep these two cases in mind, I'll come back to this later.

Parallel Extraction

Now that we've loaded the bytes and determined the length of the sequence, it's time to extract the code-point. Let's look at this 2 byte sequence to get an idea of how this works. The first byte starts with 110 indicating it's a 2 byte sequence and the 2nd byte starts with 10 indicating it's a continuation byte. We're interested in filtering out the unwanted bits (the markers) and placing the wanted bits (a and b below) into the proper place.

110aaaaa 10bbbbbb => aaaaabbbbbb

Using an AND mask, we can accomplish the former.

   110aaaaa 10bbbbbb
&  00011111 00111111
--------------------
   000aaaaa 00bbbbbb

This filtered out the unwanted bits turning them to 0. But it also left behind a gap between the as and bs which makes the result incorrect. If only there was some instruction that extracts the wanted bits and fills up the gaps at the same time... And that's precisely what pext does.

      110aaaaa 10bbbbbb
pext  00011111 00111111
-----------------------
            aaaaabbbbbb

If we know the length of the sequence, then we can pre-compute the pext mask that's required to extract the codepoint out. Keep in mind that the length can be [1, 4], but arrays are 0-based, so I'm subtracting one from len to get the index. _pext_u32 is part of intel intrinsics which compiles down to a pext instruction.

    static const u32 pextmask[] = {
        0x7F000000, 0x1F3F0000, 0x0F3F3F00, 0x073F3F3F,
    };
    int idx = len - 1;
    u32 cp = _pext_u32(v, pextmask[idx]);

And that's all there is to it. We've now extracted the code-point in a single instruction! But our job isn't done just yet, we'll need to do some validation to make sure the result is actually well formed.

Validation

First things first, let's make sure that the code-point is within range:

    *err |= cp > 0x10FFFF; // out of range

If the condition is true, the lowest bit of *err will get set to 1. If the condition is false, nothing happens since OR-ing something with 0 is a no-op.

Now time to check for overlong encoding. Again, we'll use a table for this.

    static const u32 overlong[] = { 0x0, 0x80, 0x0800, 0x10000 };
    *err |= cp < overlong[idx];

This will trigger an error if - for example - the byte sequence had a length of 2 but the decoded codepoint is below 0x80.

Now check for surrogate pairs, i.e codepoint in the range [0xD800, 0xDFFF]:

    *err |= (cp >> 11) == 0x1B; // surrogate

Or you could also check for cp >= 0xD800 && cp <= 0xDFFF which is a bit more direct way to accomplish the same check. And despite looking like it does 2 comparisons, gcc is actually smart enough to transform it into a single comparison utilizing unsigned wraparound: (cp - 0xD800) < 0x7FF. The magic number 0x7FF here is the result of 0xDFFF - 0xD800. If the codepoint was above 0xDFFF the subtraction will leave the result above 0x7FF and thus returning false. If the codepoint was under 0xD800 then it will wraparound and become some large number and also return false.

Now only 3 more things are remaining:

Ensure that the continuation bytes actually had the continuation marker.
Ensure there was a 0 at the 5th top bit in case of a 4 byte sequence.
In case of a single byte sequence, ensure we actually got valid ascii and not some invalid continuation byte.

    static const u32 emask[] = {
        0x80000000, 0xE0C00000, 0xF0C0C000, 0xF8C0C0C0,
    };
    static const u32 eexpect[] = {
        0x00000000, 0xC0800000, 0xE0808000, 0xF0808080,
    };
    *err |= (v & emask[idx]) ^ (eexpect[idx]);

The above code checks all 3 of these condition at once. The emask table contains a bitmask to select the bits that we're interested in inspecting and the eexpect table contains what we're expecting to see. For single byte sequence, emask[0] (0x80000000) will select the highest bit, and we're expecting it to be 0. If it isn't zero, then it was an invalid continuation byte.

For double and triple byte sequence, we only need to validate that the continuation marker exists on the continuation bytes. The leading byte doesn't require validation since our length lookup table did so already, but I redundantly check for it again anyways since it's practically free to do so.

And finally in the quad byte sequence case, we need to validate the continuation marker and also make sure the 5th top bit on the leading byte was zero - since our length lookup didn't validate it.

Although I've described the algorithm in terms of (v & emask[idx]) != eexpect[idx] the code snippet above contains an xor in place of a not-equals, what's going on with that? It's a neat trick to save an instruction. With != it would result in an and, cmp and setne instruction since != forces the result to be either 0 or 1. But with xor, if the value is equal to what we expect, then the result will be 0 indicating no error. But if the value was something else, then the result will be non-zero since the only way xor would ever return 0 is when both operands are the same. This results in only an and and xor instruction while still giving us the semantic that we want.

And with that, all the error cases have been handled and our decoder is complete.

Benchmarks

Time for some benchmarks. The benchmarking code can be found here. The contenders are the fully branchless pext version described here, Wellons's branchless decoder, Hoehrmann's DFA and a simple straightforwards implementation with if statements. Here are the results on an intel 14900KS:

name	speed (MB/s)
Wellons branchless	795
Hoehrmann DFA	803
Simple	527
Pext (branchless)	873

Hoehrmann has pointed out that the compiler was skipping the code-point calculation for the DFA decoder in the benchmark due to the code-point being unused.

Adding a simple checksum to force the code-point to be used slows down the DFA decoder by a bit more than 2x. The other decoders seem mostly unaffected.

The pext version managed to be roughly 8~10% faster than the DFA and branchless version. But hold your horses, on my personal desktop using a 3700x the situation is quite different:

name	speed (MB/s)
Wellons branchless	709
Hoehrmann DFA	669
Simple	441
Pext (branchless)	121

What happened here? Not only is it underperforming compared to the DFA and the branchless decoder, it's just straight up ~4x slower than even the simple version (!!)

Turns out that on older ryzen (zen 2 and below) pext is not implemented in the hardware but rather emulated at microcode level, giving it really horrible latency and throughput of 18 (reportedly up to ~300 depending on the operands) and 19 compared with intel where latency is 3 and throughput is 1. I rarely do low/assembly level optimization and so this was news to me.

There seem to be some effort at making software versions of pext that's faster than the microcode emulation. I haven't looked too deeply into it since the entire premise was making a decoder utilizing a single (and fast) instruction.

Conclusion

This was a fun little side-quest. Though unfortunately I don't see this being worthwhile to use. Depending on hardware specific instruction was already sketchy but the older zen issue puts the final nail in the coffin for me. Portable implementations are already nearly as fast, and if you're looking for maximum speed with no regards to portability then SIMD based approaches are likely better. The pext approach sits in a space where it doesn't quite make the right trade-off for any use-case I can think of.

Tags: [ performance ]

RSS Feed