Since I wanted to play around a bit with C11 atomics and lock-free algorithms, here's a fun little exercise: a worker thread that does some work and a snooper thread that periodically snoops on that work (perhaps to print some progress bar?).
The easiest and most practical way to accomplish this is probably just to use locks, since we're assuming the snooper thread will be slower than the worker, there won't be much contention either.
But as said, I wanted to mess around with lock-free and the first thing that came to mind is a queue, but with the following twists :
First let's take a look at a simple (although incorrect) version of this just to gain a better understanding of the problem.
u/Vannaka420 has pointed out that the common name for this is multiple buffering.
The naive version uses a 2 slot array and an atomic int which represents where the writer is working on.
typedef struct { /* ... */ } Work;
struct {
Work buf[2];
_Atomic int widx;
} ctx;
From the writer's end, he'll need to acquire the widx
and write his work into
that index:
// writer
int w = atomic_load(&ctx->widx);
ctx->buf[w] = ...; // write the work into the buffer
And from the reader's end, he'll need to toggle the widx
(with an atomic
fetch_xor
) so that the writer switches where it's writing to and read the old
index.
// reader
int r = atomic_fetch_xor(&ctx->widx, 0x1); // make the writer switch it's index
snapshot = ctx->buf[r]; // read the latest work
This of course, doesn't really work. There's 3 main defects here.
What I am about to show was actually my 4th or 5th design, but out of all the ones I came up with, this was also the simplest (while still being correct).
Let's start first by fixing issue #1. Let's ignore issue #2 and #3 completely for now. Essentially what we need to fix #1 is two guarantees:
To make these guarantees, we'll need an array with 3 slots instead of two. One of the slots will be owned by the writer, one by the reader and the remaining one will be the "next" slot.
From the writer's POV, this "next" slot will be where he needs to write into next. Once the writer is done writing into his current slot, he needs to make his current slot available for the reader to grab at the same time grabbing the "next" slot for himself.
[ W | R | N ] // writer writes workN into index 0
[ N | R | W ] // he now gives up index 0 as the "next" slot while grabbing
// index 2 where he will write workN+1 into
And from the reader's POV, he needs to grab the "next" slot, as it will be the latest work published by the writer while at the same time giving up his current read slot - making it available to the writer to grab next.
[ N | R | W ] // reader wakes up and wants to read a snapshot
[ R | N | W ] // reader grabs index 0, where workN was published
// writer is still writing workN+1 into index 2
[ R | W | N ] // writer finished writing workN+1 into index 2 and now grabbed
// index 1 to write workN+2 into
This seems solid so far (once again, ignoring the other two issues). The writer can no longer interfere with the reader, and the reader won't ever read anything that might be getting actively written to.
How can we implement this? With the atomic_exchange
function! The atomic
exchange function sets a variable to a new value while returning the old value
atomically. Let's put it to use.
I'll assume that at startup, the local w
will be initialized to 0, local r
initialized to 1 and the atomic next
will be initialized to the remaining
slot, 2. The specifics don't matter much, as long as they don't overlap.
Remember, we don't want the reader or writer to ever end up "owning" the same
slot at the same time.
struct {
Work buf[3];
_Atomic int next;
} ctx;
// writer
buf[w] = ...; // uses his local `w` index to write into his slot
w = atomic_exchange(&ctx->next, w); // makes his `w` slot available as "next"
// while grabbing the current "next slot"
// reader
r = atomic_exchange(&ctx->next, r); // makes his current read slot available as "next"
// while grabbing the current "next slot" for reading
read = buf[r];
Now that we fixed issue #1, time to take a look at issue #2. If the reader
"wakes up" before the writer has written anything, he'll end up grabbing index 2
(the next
index) thinking that it's the writer's latest work.
We could, of course, initialize all the work slots to something "invalid" such
as 0 or UINT_MAX
depending on the application. So that the reader can be aware
of when it's reading actual work vs initial state. However, I promised a generic
version, which means not making such application specific assumptions.
So let's use one extra bit in the next
variable to indicate whether it was
written to or not. Since our indexes can only be 0, 1 and 2 (which fit within 2
bits) we'll use the 3rd bit for this purpose, 0b100
or 0x4
in hex. I'll give
this a name, UNINIT_BIT
, to make reading the code easier.
So now the next
variable needs to be initialized to 2 but with the
UNINIT_BIT
set as well.
struct {
Work buf[3];
_Atomic int next;
} ctx = { .next = 2 | UNINIT_BIT }; /* next is initialized at startup to have
the UNINIT_BIT set */
// reader
r = atomic_exchange(&ctx->next, r);
if (!(r & UNINIT_BIT)) { // if the UNINIT_BIT is set, we were too fast, the writer didn't write anything yet
// so before reading anything, ensure that UNINIT_BIT is _not_ set
read = buf[r];
}
This takes care of the reader, now what about the writer? The writer doesn't
really care about the UNINIT_BIT
at all. It doesn't matter to the writer
whether the slot he got was "uninitialized" or not. So the writer simply needs
to discard that bit.
// writer
buf[w] = ...; // write as usual
w = atomic_exchange(&ctx->next, w); // exchange as usual
w &= ~UNINIT_BIT; // remove UNINIT_BIT, we don't care about it.
This takes care of the 2nd issue.
The 3rd issue was what happens when the reader "pops" twice, he'll just end up acquiring what he just gave up.
[ R | N | W ] // reader wakes up
[ N | R | W ] // takes the "next" slot
[ R | N | W ] // reader wakes up again, but the writer made no progress
// and thus the reader ended up back on a stale entry
We solved the uninitialized issue by using one extra bit, perhaps we can now use the 4th bit to solve this? Turns out, 3 bits are all that's needed. All we really need to do is extend the idea of an uninitialized slot. When a reader gives up a slot, we can count that as "uninitialized" as well!
Effectively, what this means is that the only change that needs to happen is on
the reader's side, when doing the exchange, the reader needs to set the
UNINIT_BIT
as well.
// reader
r = atomic_exchange(&ctx->next, r | UNINIT_BIT); // mark the slot we're giving up as "uninitialized"
if (r & UNINIT_BIT) // either uninitialized, or stale slot
// don't read anything
else
buf[r]; // otherwise safe to read
Since the writer always clears the UNINIT_BIT
, we can be sure that if the
UNINIT_BIT
is not set, we're reading something that was left by the writer.
No other change needs to happen on the writer's code.
So far, I've carefully avoided mentioning any specific memory order. Everything above assumes a sequentially consistent memory ordering, which is what you get by default using the C11 atomics.
However, sequentially consistent memory ordering might be expensive, especially on relaxed hardware such as ARM. Can we use a less strict ordering? This is something that I've initially gotten subtly wrong.
Initially I was using release
on the writer's end and acquire
on the
reader's end, but running the code with clang and TSan showed a data
race. Funnily enough, this data race goes away when I enable -O2
(which
inlines the snooper function into main()
). For a full day, I ping-ponged back
and forth between "this is a TSan bug" vs "there's actually a bug in the code".
At the end the conclusion that I finally came to was that TSan was correct and
the release
+ acquire
memory ordering wasn't strict enough. This becomes
apparent when you unroll the writer loop:
buf[w] = ; // write1
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg1
buf[w] = ; // write2
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg2
The release
memory order says that anything that happens before the
exchange, really does happen before it . However, it makes no guarantees
about anything that happens after it. In other words, it should be fair game to
make write2
visible before xchg1
.
buf[w] = ; // write1
buf[w] = ; // write2 .. OH NO!
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg1
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg2
This still preserves the guarantee that write1
happens before xchg1
and
write2
happens before xchg2
, but this is clearly not what we want to have
happen.
Using memory_order_acq_rel
(on both the writer and reader's end) guarantees
that the write2
cannot become visible before xchg1
. It also shuts up TSan,
so that's a good indication that we're on the right path.
If all these sounds complete voodoo to you, don't worry you're not alone. Even
after spending a significant time thinking about this, I'm still not 100%
certain whether my explanation was actually what was going on or not. But it's
better to be safe than sorry, so I've gone with acq_rel
.
Keep in mind that the guarantees made by memory_order_release
only applies if
the other thread uses memory_order_acquire
on it's end too!
And there you have it, a lock-free generic single element LIFO queue with just 3bits of atomics and about 4 lines of code. Here's a simple demonstration that you can try out at home. Is this practical? Probably not, and I doubt it's anything novel either. But regardless it's been a really fun exercise.
Here's a couple takeaways for me, for the time being:
Tags: [ c, concurrency ]