As a side-quest I recently decided to write a branchless utf8 decoder utilizing
the pext
or "parallel extract" instruction.
It's compliant with rfc-3629, meaning that it doesn't just naively decode
the code-point but also checks for overlong encoding, surrogate pairs and such.
Compiled with gcc -O3 -march=x86-64-v3
the entire decoder results in just
about 29 instructions.
That's pretty sweet, if you ask me.
The full source code can be found here: utf8-pext.c.
A small refresher on how utf8 works:
length | bit pattern |
---|---|
1 byte | 0xxxxxxx |
2 byte | 110xxxxx 10xxxxxx |
3 byte | 1110xxxx 10xxxxxx 10xxxxxx |
4 byte | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
110..
as the first byte.10
.The bits marked with x
are the important bits that we need to extract out.
We also need to validate the input to make sure the continuation markers exists,
check against overlong encoding and surrogate pairs.
The decoder interface is the following.
You give it a buffer, length and a int
pointer to accumulate errors into.
The function returns back a struct which contains the decoded codepoint and the
utf8 encoded length of it.
If an error occurs, err
will be non-zero and the return value will be
ill-formed.
typedef struct {
u32 codepoint;
int len;
} U8Result;
static U8Result
utf8_dec(uint8_t *str, ptrdiff_t slen, int *err)
{
// ...
}
One interesting choice I've made is to make the err field sticky across function calls. This means that the decoder doesn't clear it to zero when called, the caller is responsible for doing it. This allows you to check for error - once - at the end of your loop, if that's appropriate.
int err = 0;
for ... {
U8Result r = utf8_dec(..., &err);
}
if (err) {
// appropriate error handling
}
But you could also just clear it to zero before each call and check for error on each iteration as well if that's what your use-case demands. Making the error field sticky across function calls allows for both use-cases to be satisfied.
There's two variations of this. One is fully branchless the other trades away branchlessness in favor of a more practical interface.
If you want to be fully branchless, the decoder needs to load 4 bytes from the
buffer unconditionally.
The first byte is shifted up to the most significant position followed by the
2nd, 3rd and 4th.
Effectively, it's a big-endian load (without any byte swapping confusion).
gcc
compiles it down to a single movbe
instruction.
u32 v = ((u32)str[3] << 0) | ((u32)str[2] << 8) |
((u32)str[1] << 16) | ((u32)str[0] << 24);
This means that the caller is responsible for making sure the buffer has 4 bytes of nul padding at the end. For performance critical code, over-aligning or padding a buffer is not uncommon. But for regular code, this makes the interface non-practical. The following version makes use of the buffer's length and avoids reading out of bound:
u32 v = 0;
switch (MIN(slen, 4)) { // branchy
case 4: v |= ((u32)str[3] << 0); // fallthrough
case 3: v |= ((u32)str[2] << 8); // fallthrough
case 2: v |= ((u32)str[1] << 16); // fallthrough
case 1: v |= ((u32)str[0] << 24); break;
default: ASSERT(!"unreachable"); break; // we don't accept empty strings
}
This is no longer branchless, but it makes the interface more practical to use. A fair trade-off, usually.
If you look back to the table above, you'll see that the largest leading marker
is when you have a 4 byte sequence: 11110xxx
.
So in order to determine and validate the sequence length, we need to look at
the top 5 bits.
But if we simply assume that the input is well-formed, then we'll only need the top 4 bits to determine the length. This cuts down the lookup table size by half.
static const uint8_t len_table[] = {
/* 0XXX */ 1, 1, 1, 1, 1, 1, 1, 1,
/* 10XX */ 1, 1, 1, 1, /* invalid */
/* 110X */ 2, 2,
/* 1110 */ 3,
/* 1111 */ 4, /* maybe, but also could be invalid */
};
int len = len_table[str[0] >> 4];
For well-formed input, this will give us the length of the sequence.
But obviously, a robust decoder cannot simply assume the input to be valid.
Since we're not checking for 0 at the 5th top bit, bytes such as 0xFF
will
incorrectly get detected as 4 byte sequence.
Also notice that bytes with 10..
bit prefix are continuation bytes and thus
invalid as leading byte.
But in our table, we treat them as single byte sequence.
Keep these two cases in mind, I'll come back to this later.
Now that we've loaded the bytes and determined the length of the sequence, it's
time to extract the code-point.
Let's look at this 2 byte sequence to get an idea of how this works.
The first byte starts with 110
indicating it's a 2 byte sequence and the 2nd
byte starts with 10
indicating it's a continuation byte.
We're interested in filtering out the unwanted bits (the markers) and placing
the wanted bits (a
and b
below) into the proper place.
110aaaaa 10bbbbbb => aaaaabbbbbb
Using an AND mask, we can accomplish the former.
110aaaaa 10bbbbbb
& 00011111 00111111
--------------------
000aaaaa 00bbbbbb
This filtered out the unwanted bits turning them to 0.
But it also left behind a gap between the a
s and b
s which makes the result
incorrect.
If only there was some instruction that extracts the wanted bits and fills up
the gaps at the same time... And that's precisely what pext
does.
110aaaaa 10bbbbbb
pext 00011111 00111111
-----------------------
aaaaabbbbbb
If we know the length of the sequence, then we can pre-compute the pext mask
that's required to extract the codepoint out.
Keep in mind that the length can be [1, 4]
, but arrays are 0-based, so I'm
subtracting one from len
to get the index.
_pext_u32
is part of intel intrinsics which compiles down to a
pext
instruction.
static const u32 pextmask[] = {
0x7F000000, 0x1F3F0000, 0x0F3F3F00, 0x073F3F3F,
};
int idx = len - 1;
u32 cp = _pext_u32(v, pextmask[idx]);
And that's all there is to it. We've now extracted the code-point in a single instruction! But our job isn't done just yet, we'll need to do some validation to make sure the result is actually well formed.
First things first, let's make sure that the code-point is within range:
*err |= cp > 0x10FFFF; // out of range
If the condition is true, the lowest bit of *err
will get set to 1.
If the condition is false, nothing happens since OR-ing something with 0 is a
no-op.
Now time to check for overlong encoding. Again, we'll use a table for this.
static const u32 overlong[] = { 0x0, 0x80, 0x0800, 0x10000 };
*err |= cp < overlong[idx];
This will trigger an error if - for example - the byte sequence had a length of
2 but the decoded codepoint is below 0x80
.
Now check for surrogate pairs, i.e codepoint in the range [0xD800, 0xDFFF]
:
*err |= (cp >> 11) == 0x1B; // surrogate
Or you could also check for cp >= 0xD800 && cp <= 0xDFFF
which is a bit more
direct way to accomplish the same check.
And despite looking like it does 2 comparisons, gcc is actually smart enough to
transform it into a single comparison utilizing unsigned wraparound: (cp - 0xD800) < 0x7FF
.
The magic number 0x7FF
here is the result of 0xDFFF - 0xD800
.
If the codepoint was above 0xDFFF
the subtraction will leave the result above
0x7FF
and thus returning false.
If the codepoint was under 0xD800
then it will wraparound and become some
large number and also return false.
Now only 3 more things are remaining:
static const u32 emask[] = {
0x80000000, 0xE0C00000, 0xF0C0C000, 0xF8C0C0C0,
};
static const u32 eexpect[] = {
0x00000000, 0xC0800000, 0xE0808000, 0xF0808080,
};
*err |= (v & emask[idx]) ^ (eexpect[idx]);
The above code checks all 3 of these condition at once.
The emask
table contains a bitmask to select the bits that we're interested in
inspecting and the eexpect
table contains what we're expecting to see.
For single byte sequence, emask[0] (0x80000000)
will select the highest bit,
and we're expecting it to be 0.
If it isn't zero, then it was an invalid continuation byte.
For double and triple byte sequence, we only need to validate that the continuation marker exists on the continuation bytes. The leading byte doesn't require validation since our length lookup table did so already, but I redundantly check for it again anyways since it's practically free to do so.
And finally in the quad byte sequence case, we need to validate the continuation marker and also make sure the 5th top bit on the leading byte was zero - since our length lookup didn't validate it.
Although I've described the algorithm in terms of
(v & emask[idx]) != eexpect[idx]
the code snippet above contains an xor in
place of a not-equals, what's going on with that?
It's a neat trick to save an instruction.
With !=
it would result in an and
, cmp
and setne
instruction since !=
forces the result to be either 0 or 1.
But with xor, if the value is equal to what we expect, then the result will be 0
indicating no error.
But if the value was something else, then the result will be non-zero since the
only way xor would ever return 0 is when both operands are the same.
This results in only an and
and xor
instruction while still giving us the
semantic that we want.
And with that, all the error cases have been handled and our decoder is complete.
Time for some benchmarks.
The benchmarking code can be found here.
The contenders are the fully branchless pext version described here,
Wellons's branchless decoder, Hoehrmann's DFA and a simple
straightforwards implementation with if
statements.
Here are the results on an intel 14900KS:
name | speed (MB/s) |
---|---|
Wellons branchless | 795 |
Hoehrmann DFA | 803 |
Simple | 527 |
Pext (branchless) | 873 |
Hoehrmann has pointed out that the compiler was skipping the code-point calculation for the DFA decoder in the benchmark due to the code-point being unused.
Adding a simple checksum to force the code-point to be used slows down the DFA decoder by a bit more than 2x. The other decoders seem mostly unaffected.
The pext version managed to be roughly 8~10% faster than the DFA and branchless version. But hold your horses, on my personal desktop using a 3700x the situation is quite different:
name | speed (MB/s) |
---|---|
Wellons branchless | 709 |
Hoehrmann DFA | 669 |
Simple | 441 |
Pext (branchless) | 121 |
What happened here? Not only is it underperforming compared to the DFA and the branchless decoder, it's just straight up ~4x slower than even the simple version (!!)
Turns out that on older ryzen (zen 2 and below) pext
is not implemented in the
hardware but rather emulated at microcode level, giving it really horrible
latency and throughput of 18 (reportedly up to ~300 depending on the
operands) and 19 compared with intel where latency is 3 and
throughput is 1.
I rarely do low/assembly level optimization and so this was news to me.
There seem to be some effort at making software versions of pext that's faster than the microcode emulation. I haven't looked too deeply into it since the entire premise was making a decoder utilizing a single (and fast) instruction.
This was a fun little side-quest. Though unfortunately I don't see this being worthwhile to use. Depending on hardware specific instruction was already sketchy but the older zen issue puts the final nail in the coffin for me. Portable implementations are already nearly as fast, and if you're looking for maximum speed with no regards to portability then SIMD based approaches are likely better. The pext approach sits in a space where it doesn't quite make the right trade-off for any use-case I can think of.
Tags: [ performance ]