Recently one of my older post about strlcpy has sparked some discussion on various forums. Presumably the recently released POSIX edition had something to do with it. One particular counter-argument was raised by multiple posters - and it's an argument that I've heard before as well:
strlcpy
would only traverse the string once whereas strlen + memcpy
would
traverse it twice always.Hidden in this argument is the assumption that traversing the string once is faster. Which - to be clear - is not at all an unreasonable assumption. But is it actually true? That's the focus of today's article.
Computers do not have common sense. Computers are surprising.
- Tony Hoare to Lomuto
The following is from openbsd, where strlcpy
originated - modified a bit for brevity.
size_t strlcpy(char *dst, const char *src, size_t dsize)
{
const char *osrc = src;
size_t nleft = dsize;
if (nleft != 0) while (--nleft != 0) { /* Copy as many bytes as will fit. */
if ((*dst++ = *src++) == '\0')
break;
}
if (nleft == 0) { /* Not enough room in dst, add NUL and traverse rest of src. */
if (dsize != 0) *dst = '\0'; /* NUL-terminate dst */
while (*src++) ;
}
return(src - osrc - 1); /* count does not include NUL */
}
It starts by copying from src
to dst
as much as it can, and if it has to
truncate due to insufficient dst
size, then traverses the rest of src
in
order to get the strlen(src)
value for returning.
And so if the source string fits, it will be traversed only once.
Now if you try to take a look at the glibc implementation of strlcpy
,
immediately you'll notice that the first line is this...
size_t src_length = strlen (src);
... followed by the rest of the code using memcpy
to do the copying.
This already shatters the illusion that strlcpy
will traverse the string once,
there's no requirement for that to happen, and as you can see in practice, one
of the major libcs will always traverse the string twice, once in strlen
and
once in memcpy
.
But before you open a bug report against glibc for being inefficient, here's some benchmark number when copying a 512 byte string repeatedly in a loop:
512 byte
openbsd: 242us
glibc: 12us
Perhaps the string is so small that the double traversal doesn't matter? How about a string of 1MiB?
1MiB
openbsd: 501646us
glibc: 31793us
The situation only gets worse for the openbsd version here, not better.
To be fair, this huge speed up is coming from the fact that glibc punts all the
work over to strlen
and memcpy
which on glibc are SIMD optimized.
But regardless, we can already see that doing something fast, twice - is faster
than doing it once but slowly.
In order to do an apples to apples comparison I've written the following strlcpy
implementation, which is pretty close to the glibc implementation except with
the strlen
and memcpy
calls written out in for loops.
size_t bespoke_strlcpy(char *dst, const char *src, size_t size)
{
size_t len = 0;
for (; src[len] != '\0'; ++len) {} // strlen() loop
if (size > 0) {
size_t to_copy = len < size ? len : size - 1;
for (size_t i = 0; i < to_copy; ++i) // memcpy() loop
dst[i] = src[i];
dst[to_copy] = '\0';
}
return len;
}
It's important to note that in order to do a truly apples to apples comparison,
you'd need to also use -fno-builtin
when compiling.
Otherwise gcc will realize that the "strlen loop" can be "optimized" down to a
strlen
call and emit that.
-fno-builtin
avoids that from happening and keeps the comparison fair.
So how does this version, which traverses src
twice, perform against the
openbsd's variant which traverses src
only once?
512 byte
openbsd: 237us
bespoke: 139us
It's almost twice as fast. How about on bigger strings?
1MiB
openbsd: 488469us
bespoke: 277183us
Still roughly twice as fast. How come?
The importance of cache misses (rightfully) gets plenty of spotlight, dependencies on the other hand are not talked about as much. Your cpu has multiple cores, and each core has multiple ports (or logic units) capable of executing instructions. Which means that if you have some instructions like this (in pseudo assembly, where upper case alphabet denotes a register):
A <- add B, C
X <- add Y, Z
E <- add A, X
The computation of A
and X
are independent, and thus can be executed in
parallel.
But computation of E
requires the result of A
and X
and thus cannot be
parallelized.
This process of being able to execute independent instructions simultaneously is
called instruction-level-parallelism (or ILP).
And dependencies are its kryptonite.
If you try to profile the "bespoke" strlcpy version, you'll notice that nearly
100% of the cpu time is spent on the "strlen loop" while the copy loop is
basically free.
Indeed if you replace the "strlen loop" with an actual strlen
call (reminder:
that it's SIMD optimized on glibc) then the bespoke version starts competing
with the glibc version quite well even though we aren't using an optimized
memcpy.
In order to understand why this is happening, let's look at the "strlen loop",
written in a verbose manner below:
len = 0;
while (true) {
if (src[len] == '\0')
break; // <- this affects the next iteration
else
++len;
}
In the above loop, whether or not the next iteration of the loop will execute
depends on the result of the previous iteration (whether src[len]
was nul or
not).
We pay for this in our strlen loop.
But our memcpy loop is free of such loop-carried-dependencies, the current
iteration happens regardless of what happened on the last iteration.
for (size_t i = 0; i < to_copy; ++i) // memcpy() loop
dst[i] = src[i]; // <- does NOT depend on previous iteration
In the openbsd version, because the length and copy loop are fused together, whether or not the next byte will be copied depends on the byte value of the previous iteration.
while (--nleft != 0) { // openbsd copy loop
if ((*dst++ = *src++) == '\0') // <- the branch taken here affect the next iteration
break;
}
Effectively the cost of this dependency is now not just imposed on the length computation but also on the copy operation. And to add insult to injury, dependencies are not just difficult for the CPU, they are also difficult for the compiler to optimize/auto-vectorize resulting in worse code generation - a compounding effect.
The key to making programs fast is to make them do practically nothing.
- Mike Haertel, why GNU grep is fast
2 years ago when I wrote the strlcpy
article I was still of the opinion that
nul-terminated strings were "fine" and the problem was due to the standard
library being poor.
But even with better nul-string routines, I noticed that a disproportionate
amount of mental effort was spent, and bugs written, trying to program with
them.
Two very important observations since then:
Without knowing the length, strings become more closer to a linked list - forcing a serial access pattern - rather than an array that can be randomly accessed. Many common string functions are better expressed (read: less error-prone) when the length can be cheaply known. Nul-terminated strings on the other hand encourages you to continuously keep throwing this very valuable information away - leading to having to spuriously recompute it again and again and again (the GTA loading screen incident always comes to mind).
They get rid of a lot of spurious copies (i.e more efficiency) as well as allocations (i.e avoids unnecessary memory management). And as a result, a great deal of logic and code that were necessary when managing nul-terminated strings simply disappear.
With these two in mind, nowadays I just use sized-strings (something akin to
C++'s std::string_view
) and only convert to nul-string when an external API
demands it.
This topic is worth an article on it's own, but since that is not the focus of
this article, I'll digress.
But the good news is that aside from a group of C crowd, where the "default
fallacy" (if something is default, it must be the right thing to use) is running
high, majority of the world has more or less realized nul-strings to be a
mistake.
This is evident when you look at most other programming languages, including a
lot of the newer system programming ones, where nul-strings are not used by
default (if at all).
Even languages with a C heritage are moving away from nul-strings, recall C++'s
string_view
.
When talking about performance, it's important to make it clear whether we are talking about it in an academic setting or in a practical setting because CPUs do not care about common sense or big-O notation. A modern CPU is incredibly complex and full of surprises. And so the performance of an algorithm doesn't just depend on high level algorithmic factors - lower level factors such as cache misses, ILP, branch mispredictions etc, also need to taken into account. Many things which seems to be faster from a common sense perspective might in practice end up being slower and vice versa.
Tags: [ c, performance, toolchain ]