A lock-free single element generic queue

22 Mar 2023

Since I wanted to play around a bit with C11 atomics and lock-free algorithms, here's a fun little exercise: a worker thread that does some work and a snooper thread that periodically snoops on that work (perhaps to print some progress bar?).

The easiest and most practical way to accomplish this is probably just to use locks, since we're assuming the snooper thread will be slower than the worker, there won't be much contention either.

But as said, I wanted to mess around with lock-free and the first thing that came to mind is a queue, but with the following twists :

It will be a last in first out (LIFO) queue.
There will only be a single element on the queue (since the reader only wants to see latest work).
When the writer pushes onto a "full" queue, he will also pop off whatever was there (once again, reader only wants latest work).

First let's take a look at a simple (although incorrect) version of this just to gain a better understanding of the problem.

u/Vannaka420 has pointed out that the common name for this is multiple buffering.

Naive version

The naive version uses a 2 slot array and an atomic int which represents where the writer is working on.

typedef struct { /* ... */ } Work;
struct {
    Work buf[2];
    _Atomic int widx;
} ctx;

From the writer's end, he'll need to acquire the widx and write his work into that index:

// writer
int w = atomic_load(&ctx->widx);
ctx->buf[w] = ...; // write the work into the buffer

And from the reader's end, he'll need to toggle the widx (with an atomic fetch_xor) so that the writer switches where it's writing to and read the old index.

// reader
int r = atomic_fetch_xor(&ctx->widx, 0x1); // make the writer switch it's index
snapshot = ctx->buf[r]; // read the latest work

This of course, doesn't really work. There's 3 main defects here.

The atomic fetch xor only means that the writer will write into the new index on his next writes. But the old index is not safe to read because the writer might be "in-flight" writing to the old index.
If the reader is the first one to "pop" the queue, he'll see uninitialized values.
If the reader pops twice without the writer pushing anything in between, the reader will just see stale values.

What I am about to show was actually my 4th or 5th design, but out of all the ones I came up with, this was also the simplest (while still being correct).

Fixing the data race

Let's start first by fixing issue #1. Let's ignore issue #2 and #3 completely for now. Essentially what we need to fix #1 is two guarantees:

The writer can never write into a slot that might be being read.
The reader can never acquire a slot which might be being written to.

To make these guarantees, we'll need an array with 3 slots instead of two. One of the slots will be owned by the writer, one by the reader and the remaining one will be the "next" slot.

From the writer's POV, this "next" slot will be where he needs to write into next. Once the writer is done writing into his current slot, he needs to make his current slot available for the reader to grab at the same time grabbing the "next" slot for himself.

[ W | R | N ] // writer writes workN into index 0
[ N | R | W ] // he now gives up index 0 as the "next" slot while grabbing
              // index 2 where he will write workN+1 into

And from the reader's POV, he needs to grab the "next" slot, as it will be the latest work published by the writer while at the same time giving up his current read slot - making it available to the writer to grab next.

[ N | R | W ] // reader wakes up and wants to read a snapshot
[ R | N | W ] // reader grabs index 0, where workN was published
              // writer is still writing workN+1 into index 2
[ R | W | N ] // writer finished writing workN+1 into index 2 and now grabbed
              // index 1 to write workN+2 into

This seems solid so far (once again, ignoring the other two issues). The writer can no longer interfere with the reader, and the reader won't ever read anything that might be getting actively written to.

How can we implement this? With the atomic_exchange function! The atomic exchange function sets a variable to a new value while returning the old value atomically. Let's put it to use.

I'll assume that at startup, the local w will be initialized to 0, local r initialized to 1 and the atomic next will be initialized to the remaining slot, 2. The specifics don't matter much, as long as they don't overlap. Remember, we don't want the reader or writer to ever end up "owning" the same slot at the same time.

struct {
    Work buf[3];
    _Atomic int next;
} ctx;

// writer
buf[w] = ...; // uses his local `w` index to write into his slot
w = atomic_exchange(&ctx->next, w); // makes his `w` slot available as "next"
                                    // while grabbing the current "next slot"

// reader
r = atomic_exchange(&ctx->next, r); // makes his current read slot available as "next"
                                    // while grabbing the current "next slot" for reading
read = buf[r];

Fixing uninitialized reads

Now that we fixed issue #1, time to take a look at issue #2. If the reader "wakes up" before the writer has written anything, he'll end up grabbing index 2 (the next index) thinking that it's the writer's latest work.

We could, of course, initialize all the work slots to something "invalid" such as 0 or UINT_MAX depending on the application. So that the reader can be aware of when it's reading actual work vs initial state. However, I promised a generic version, which means not making such application specific assumptions.

So let's use one extra bit in the next variable to indicate whether it was written to or not. Since our indexes can only be 0, 1 and 2 (which fit within 2 bits) we'll use the 3rd bit for this purpose, 0b100 or 0x4 in hex. I'll give this a name, UNINIT_BIT, to make reading the code easier.

So now the next variable needs to be initialized to 2 but with the UNINIT_BIT set as well.

struct {
    Work buf[3];
    _Atomic int next;
} ctx = { .next = 2 | UNINIT_BIT }; /* next is initialized at startup to have
                                       the UNINIT_BIT set */

// reader
r = atomic_exchange(&ctx->next, r);
if (!(r & UNINIT_BIT)) { // if the UNINIT_BIT is set, we were too fast, the writer didn't write anything yet
    // so before reading anything, ensure that UNINIT_BIT is _not_ set
    read = buf[r];
}

This takes care of the reader, now what about the writer? The writer doesn't really care about the UNINIT_BIT at all. It doesn't matter to the writer whether the slot he got was "uninitialized" or not. So the writer simply needs to discard that bit.

// writer
buf[w] = ...; // write as usual
w = atomic_exchange(&ctx->next, w); // exchange as usual
w &= ~UNINIT_BIT; // remove UNINIT_BIT, we don't care about it.

This takes care of the 2nd issue.

Fixing reading stale entry

The 3rd issue was what happens when the reader "pops" twice, he'll just end up acquiring what he just gave up.

[ R | N | W ] // reader wakes up
[ N | R | W ] // takes the "next" slot
[ R | N | W ] // reader wakes up again, but the writer made no progress
              // and thus the reader ended up back on a stale entry

We solved the uninitialized issue by using one extra bit, perhaps we can now use the 4th bit to solve this? Turns out, 3 bits are all that's needed. All we really need to do is extend the idea of an uninitialized slot. When a reader gives up a slot, we can count that as "uninitialized" as well!

Effectively, what this means is that the only change that needs to happen is on the reader's side, when doing the exchange, the reader needs to set the UNINIT_BIT as well.

// reader
r = atomic_exchange(&ctx->next, r | UNINIT_BIT); // mark the slot we're giving up as "uninitialized"
if (r & UNINIT_BIT) // either uninitialized, or stale slot
    // don't read anything
else
   buf[r]; // otherwise safe to read

Since the writer always clears the UNINIT_BIT, we can be sure that if the UNINIT_BIT is not set, we're reading something that was left by the writer. No other change needs to happen on the writer's code.

Memory ordering

So far, I've carefully avoided mentioning any specific memory order. Everything above assumes a sequentially consistent memory ordering, which is what you get by default using the C11 atomics.

However, sequentially consistent memory ordering might be expensive, especially on relaxed hardware such as ARM. Can we use a less strict ordering? This is something that I've initially gotten subtly wrong.

Initially I was using release on the writer's end and acquire on the reader's end, but running the code with clang and TSan showed a data race. Funnily enough, this data race goes away when I enable -O2 (which inlines the snooper function into main()). For a full day, I ping-ponged back and forth between "this is a TSan bug" vs "there's actually a bug in the code".

At the end the conclusion that I finally came to was that TSan was correct and the release + acquire memory ordering wasn't strict enough. This becomes apparent when you unroll the writer loop:

buf[w] = ; // write1
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg1
buf[w] = ; // write2
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg2

The release memory order says that anything that happens before the exchange, really does happen before it . However, it makes no guarantees about anything that happens after it. In other words, it should be fair game to make write2 visible before xchg1.

buf[w] = ; // write1
buf[w] = ; // write2 .. OH NO!
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg1
w = xchg(&next, w, release) & ~UNINIT_BIT; // xchg2

This still preserves the guarantee that write1 happens before xchg1 and write2 happens before xchg2, but this is clearly not what we want to have happen.

Using memory_order_acq_rel (on both the writer and reader's end) guarantees that the write2 cannot become visible before xchg1. It also shuts up TSan, so that's a good indication that we're on the right path.

If all these sounds complete voodoo to you, don't worry you're not alone. Even after spending a significant time thinking about this, I'm still not 100% certain whether my explanation was actually what was going on or not. But it's better to be safe than sorry, so I've gone with acq_rel.

Keep in mind that the guarantees made by memory_order_release only applies if the other thread uses memory_order_acquire on it's end too!

Conclusion

And there you have it, a lock-free generic single element LIFO queue with just 3bits of atomics and about 4 lines of code. Here's a simple demonstration that you can try out at home. Is this practical? Probably not, and I doubt it's anything novel either. But regardless it's been a really fun exercise.

Here's a couple takeaways for me, for the time being:

C11 atomics are neat, but way too verbose for my taste.
C11 thread support is rather poor as of now. TSan doesn't even support it.
Lock-free algorithms not easy, but might be worthwhile because it really forces you to think about synchronization between threads and discourages building up a massive web of brittle data-sharing with the "Oh, I'll just wrap it around a lock" mindset.
Memory ordering is really tricky.

RSS Feed