isdigit
is a very simple function which tells you weather a character is a
digit or not. But the simplicity is deceiving, at least on the C standard
library.
You see, pretty much all the functions in <ctype.h>
is either useless or
rather difficult to use correctly due to the following reasons:
is*
variants don't actually return a boolean (i.e 0
or 1
), instead
they return 0 on failure but any non-zero value on success. if (isdigit(ch))
, however code like this: int v[2]; v[isdigit(ch)] = 0;
is broken and has to be converted to v[!!isdigit(ch)]
.unsigned char
(or EOF
) then the behavior is undefined. char ch = getchar(); if (isdigit(ch))
is actually UB since user might input some negative character and you must
cast the argument if (isdigit((unsigned char)ch))
to avoid triggering UB.Due to these braindeaths, it's very common for people to roll their own ctype
macros, such as #define ISDIGIT(X) ((X) >= '0' && (X) <= '9')
. While this is
the obvious implementation, is this actually the best one?
Today I want to take a look at a couple different ways to implement the
function isdigit
and analyze some of the performance characteristic of them.
Such nano-optimizations is unlikely to matter on most software, but this was
done mostly out of scratching my own curiosity.
Before diving in, I'd like to assert a couple different things about the functions we're going to implement which differ from the standard C ones:
unsigned char
as our argument.EOF
is not necessary. With these out of the way, let's begin.
The simple version simply checks weather ch
falls between '0' to '9' or not:
int
isdigit_simple(unsigned char ch)
{
return ch >= '0' && ch <= '9';
}
This has a couple nice properties:
Pretty much the only downside of this approach is that the code is "branch-y", which can cause performance penalties if the processor mispredicts the branch.
Assuming we change the function to take a uint32_t
argument instead of an
unsigned char
. And of course, when I say it would "work" on Unicode
codepoints, I mean that it's going to pick out the ASCII digits out of there,
not digits of other languages.
To overcome the branching, a rather popular approach is to have a "Look-up-table" (aka LUT) so that no branching needs to occur:
int
isdigit_lut256(unsigned char ch)
{
static const unsigned char lut[256] = {
['0'] = 1, ['1'] = 1, ['2'] = 1, ['3'] = 1, ['4'] = 1,
['5'] = 1, ['6'] = 1, ['7'] = 1, ['8'] = 1, ['9'] = 1,
};
return lut[ch];
}
This solves the branching problem, however it introduces a whole bunch of other problems:
That's a lot of downsides just to avoid a branch.
This version compresses the LUT by assigning each character it's own bit rather than a whole byte. Unlike the above two, this version is not very commonly found in practice (and for good reasons).
int
isdigit_lut32(unsigned char ch)
{
enum { UBITS = sizeof(unsigned int) * 8 }; /* assumes CHAR_BIT == 8 */
static const unsigned int lut[256 / UBITS] = {
['0' / UBITS] = 1023U << ('0' % UBITS),
};
return lut[ch / UBITS] & (1U << (ch % UBITS));
}
The code seem a bit scary if you're unfamiliar with bit-twiddling, but it's basically converting the byte table above to a bit table instead.
This version basically inherits all of the problem with original LUT approach:
Overall I suspect this version to be the worst out of all.
This version uses some basic bithacking to achieve branchlessness (is this a word?) and thus avoids all the memory access related problems found on the LUT variants. Lo and behold:
int
isdigit_bithacks(unsigned char ch)
{
unsigned int ge0 = (unsigned int)ch + ((0xFFU - '0') + 1);
unsigned int gt9 = (unsigned int)ch + (0xFFU - '9');
return (ge0 ^ gt9) >> 8;
}
Are you beholding? Have you beheld? In case you're done behelding, I'd like to explain how this variants works. While the trick used here is rather basic, I have never seen this variant be used anywhere and thus think it's worth an explanation.
The very first thing to realize is that the logic used here is kinda retarded.
Instead of checking weather ch
is between '0' and '9' it instead:
ch
is greater than or equal to '0'.ch
is greater than '9'.I say this is retarded because I don't think anyone naturally thinks like this. But it works out because:
ch
is less than '0', both the condition would be false.
0 ^ 0
gives us 0
.ch
is greater-or-eq to '0' but no greater than '9', it means ch
falls
between '0' and '9'. Thus ge0
check would be true and gt9
check would be
false. And XOR-ing 1 ^ 0
will give us 1
.ch
is greater than '9' then it must be also greater-or-eq to '0'. Meaning
both the flags will be true. And XOR-ing 1 ^ 1
gives us 0
.Okay, so now that the logic is out of the way, it's time to explain the magic numbers.
Since we're talking in an unsigned char
as argument, it means that when ch
gets promoted to an unsigned int
, the 9th bit will be clear. We use this bit
as a "trap bit".
Now to prepare our "magic number". We need an integer with all the least
significant 8 bits set to 1, 0xFF
. Now if we want to check weather ch
is
greater than '9' (which is 57 in decimal) we simply need to subtract it from
0xFF
.
0xFF: 00000000 11111111
'9': 00000000 00111001
0xFF - '9': 00000000 11000110 /* magic */
This gives us our "magic number", which is setup in such a way that if you add anything greater than '9' to it, the carry will propagate onto the 9th bit.
Let's see this in action, first let's add '4' (52 in decimal) to our magic number.
magic: 00000000 11000110
+ '4': 00000000 00110100
----------------------------
00000000 11111010
As expected, the 9th bit remains 0 since '4' is not greater than '9'. Let's try it with a value which is greater than '9'. I've picked 'A' for this purpose, which is 65 in decimal:
magic: 00000000 11000110
+ 'A': 00000000 01000001
---------------------------
00000001 00000111
Tada! The 9th bit is set! This should explain how the gt9
check works. The
ge0
check works the same way, but since we want to check greater or equal we
add 1 to our magic number instead.
And finally the >> 8
simply clears out the garbage bits, and as a consequence
gives us a nice boolean return value. And now with that out of the way, we can
return back to analyzing the performance implications.
This pretty much has all the good quality we're looking for:
So what's the catch? Well, it's branchless...
While being branchless is typically a good thing since we avoid paying for misprediction penalties this also means that we cannot get the benefit of good predictions either. So the branch predictor cannot hurt us, but neither can it help us. This may end up costing us some throughput on predicable inputs.
Enough hypothesis, time for some benchmarks!
Now, keep in mind that in order to benchmark properly you need to benchmark from multiple different angle (both internal and external), do enough iterations and then do proper statistical analysis. But because I'm not writing a research paper, I've opted not to do that, so take these results with a grain of salt.
System info: 3700x processor, 3200mhz 16-18-18-32 8x2 memory, GNU/Linux OS.
First I want to see how these functions perform when called just once. This is to try and see weather the hypothesis of unreliable performance of the LUT variants due to cache-miss was correct or not. The results:
Variant | Avg | 2% lows |
---|---|---|
simple | 0.000042 | 0.000054 |
lut256 | 0.000379 | 0.000491 |
lut32 | 0.000393 | 0.000749 |
bithacks | 0.000041 | 0.000051 |
The results for the "simple" and "bithacks" version are quite noisy and vary a good amount between runs even with 1024 iterations. However a couple things are consistent:
Now let's see how they perform in a loop on a 64KiB buffer filled with random
bytes. Note that this is highly impractical since very rarely, if ever you will
need to run a loop with only isdigit
in there.
Variant | Avg |
---|---|
simple | 0.038810 |
lut256 | 0.017070 |
lut32 | 0.037540 |
bithacks | 0.036480 |
These results are much more consistent. Surprising, perhaps no one, the "lut256" variant crushes it and the "simple" and "lut32" variant performs the worst.
The "bithacks" variant seems to beat the "simple" and "lut32" variant, but keep in mind that the random input makes things harder to predict and thus makes this benchmark biased against the "simple" variant.
All the source code as well as the benchmarking code in this article can be found here.
Tags: [ c, performance ]