nix-compress: fast LZW decoding via unrolled linked list

31 Jul 2024

A couple months ago, I took some interest into LZW compression scheme. It's one of the few compression scheme that had seen some practical usage while still being relatively simple. I say "had" because nowadays I cannot think of a single use-case where LZW would be the best pick. From speed, to compression ratio, to ease of implementation, algorithms making much better trade-offs exists.

Only place LZW is still used are legacy formats such as gif or .Z compressed files. The focus of this article is on the latter - giving an overview of some of the high level choices made while writing nix-compress, my clone of the ancient compress(1) tool.

One interesting part about nix-compress is it's decompressor: very fast while keeping memory usage bounded to a modest amount using an unrolled linked list for the decompressor dictionary. The rest of it is fairly standard stuff. The code can be found in the codeberg repo if you'd like to follow along.

A brief rundown

One intuitive way to compress data is to use a dictionary for common substrings. E.g instead of sending over "there" if we can send an index into a dictionary then we might save some bits. The problem is that hardcoded dictionaries are not general, a dictionary of common English words would not be useful when trying to compress binary data. But making a dictionary tailored to the specific data requires somehow sending the dictionary over to the decompressor as well, which takes up bits.

LZ based scheme - which LZW is a derivative of - solve this problem elegantly by sending the dictionary over implicitly within the data itself. In the case of LZW, our dictionary is first initialized with all the single byte values, e.g the string "a" gets a code of 97 corresponding to the ascii value of 'a'. The compressor will now try to match the longest possible substring within the dictionary and when it encounters a new substring it will append it to the dictionary. Describing LZW in full details a out of scope for this article, and there already exists good resources on it.

The key observation here is that a "new substring" only occurs when you have a string that already exists in the dictionary followed by a new byte. This allows us to compactly store our dictionary strings as a fixed sized <dict_index_for_prefix, new_byte> pair instead of variable sized strings. For example, initially when you only have single byte strings in the dictionary, any two byte string will be a new substring and can be described by an index into the dictionary for the prefix (i.e the first byte) followed by the 2nd byte.

Encoding

For encoding, we need a string -> dict_index map. More specifically we need the following two operations:

Query if the substring is in the dictionary or not. If it exists, query it's dictionary index.
Insert new substrings into the dictionary.

Notice that the compressor does not need to care about the original string itself, the prefix-byte pair is sufficient for identification, avoiding the need to store variable length strings inside the dictionary. The .Z format limits dictionary size to 16 bits (2¹⁶-1 entries) - which means dictionary indexes will take 16 bits maximum and prefix-byte pair 24 bits (16 + 8). When packed tightly, we can get away with using just 5 byte for each dictionary entry.

My choice for the dictionary was to go with an open-addressed hashtable using double hashing for collision resolution (see "MSI" hashtable). I gave the table a size of 2¹⁷ which gives it a load-factor of 50% at worst. Tightly packed, this costs only 640 KiB. Doubling the table size results in minimal gain. ncompress implementation uses a similar strategy but with a different double hashing scheme for collision resolution. In my non-scientific benchmarks, they both compare more or less equal with one edging out the other depending on the input.

Another option was to use a large 2²⁴ sized array and index directly into it, making it a single lookup as opposed to a hashtable which may require multiple lookups to resolve collisions. It'd take about 32MiB, not cheap but not out of question either on most desktop environments - many modern lz77 based schemes for example use even more memory for their lookback window. In my tests though, this resulted in pretty much the same performance as the hashtable. perf indicates backend stall, which usually implies cache miss causing the cpu to idle away waiting for memory to arrive. I'd be interested to know if there are any other scheme that achieves noticeably better performance - but for now the open-addressed hashtable seems to be a front runner.

Decoding: naive table

Decoding is the inverse of encoding. We'll read dictionary indexes and will need to output the string at that index. In other words, we need a dict_index -> string map. The naive approach would be to use an array of sized strings and index directly into it.

typedef struct { char *string; ptrdiff_t len; } String;
String table[1 << 16];

Note that the string cannot be nul-terminated since we need to be able to decode embedded nul-bytes too (besides, nul-terminated strings are a bad idea in general anyways). Each string in the table will be freshly allocated, which should raise some eyebrows, what's the worst case memory usage here? Short answer: 2GiB.

Long answer is that in order to trigger the worst case, each dictionary entry needs to be the largest possible. And in order to do that we need each new entry to be a prefix of the previous entry followed by a new byte. Which means each string is +1 the size of the previous, this gives us a nice arithmetic progression where each term increases by 1. Using the sum formula we get: (2¹⁶ * (1 + 2¹⁶)) / 2 == ~2GiB. (Note that this is slightly larger than the actual worst case due to not accounting the first 256 pre-filled codes).

2GiB is not an impossible ask on modern desktop computers but it's also unnecessarily large considering there are ways to do it with significantly smaller memory footprint (as low as 192KiB) outlined below.

Index into output buffer

One other observation is that since the dictionary is built from the input data itself, our substrings must also exist in the output data. This means we can represent the dictionary as an index + length pair into the output buffer.

typedef struct { uint16_t index, length; } String;
String table[1 << 16];

// extracting substring of index `i`
char *substring = &output_buffer[table[i].index];
int len = table[i].length;

This takes only 16 + 16 == 32 bits per entry and thus 256KiB in total. This IMO is the ideal choice when your output data will be held entirely in memory, e.g when decoding a gif frame.

Unfortunately in our case the decompressed file size may very well exceed the amount of total ram available. And so we need to periodically flush the output buffer in order to not run out of memory. This makes this approach not suitable for nix-compress.

Linked list

Going back to our prefix-byte representation, we can represent the string inside the dictionary in the same way - effectively creating a linked list where the "prefix index" acts as the "next pointer". For example the substring "cab" at index 512 may be decoded in the following way:

table[512] == <prefix: 288, new_byte: b>. Append 'b' to output and go to the prefix index of 288.
table[288] == <prefix: 99, new_byte: a>. Append 'a' to output and go to the prefix index of 99.
table[99] == <prefix: nil, new_byte: c>. Append 'c' to output. No more prefix index.
Output is now "bac". Reverse it, and we're done: "cab".

The entire dictionary can be represented in 3 bytes per entry and 192KiB in total. And unlike the index into output buffer technique, we can flush the output buffer when we want since the dictionary is now self contained. This seems to be the standard choice for implementing streaming LZW decoder in almost all the implementation I've looked at.

Unrolled linked list

One big problem with the linked list method is that in the worst case, decoding the string will require traversing around 2¹⁶ nodes. Worse, each of these iteration are serially dependant on the previous one - in order to fetch the next node, you need to load the current one. Cache misses are often cited as the primary reason for linked lists being slow but 192KiB should comfortably fit into the L2 of most modern machines.

Since these hops are expensive, one simple idea is to make them more "worth it" by putting more data into each node, i.e an unrolled linked list. So in our case, instead of <prefix, new_byte> we can use <prefix, suffix_str[N]> where suffix_str will hold not a single new byte but multiple of them.

If N was 2, the same "cab" substring would be decoded in the following manner (notice it takes 2 hops opposed to 3):

table[512] == <prefix: 99, suffix: "ab">. Append "ab" to output and go to the prefix index of 99.
table[99] == <prefix: nil, suffix: "c">. Append "c" to output. No more prefix index.
Output is now ["ab", "c"]. Reverse the "buckets" and concatenate them: ["ab", "c"] => ["c", "ab"] => "cab"

With N being 32, we've taken the worst case of ~65k hops and reduced it down to ~4k. In practice, this can end up being 8x faster compared to the traditional linked list approach. The cost is that now the dictionary takes more memory, roughly 2MiB with N being 32. A worthwhile trade off in my books.

Implementation details

While the above gives a high level overview of the algorithm, there are a few implementation details I've glossed over. Below are some of the interesting bits.

Node representation

If we have a linked list, we need some way to represent a "nil" prefix. In the traditional linked list approach, we didn't need to have an explicit nil index, if our current index is below 256 we know that it doesn't have a prefix. And we know that every substring's last node will be below 256 since each node only has a single new suffix byte.

In the unrolled list, it's perfectly valid to have a node such as <prefix: nil, suffix: "ca"> so we need to have some explicit way to encode a nil prefix. 0 cannot be used since a 0 index is a valid prefix index. However UINT16_MAX cannot be a prefix to anything else since it's the highest index, so we can use that.

We will also need to track the length of the string explicitly (reminder: nul-terminated strings cannot be used here). In nix-compress the nodes in the decoding dictionary is structured like this:

#define LZW_DEC_NODE_SIZE 32   // can be anything between [4, 127]
struct {
    uint8_t str[LZW_DEC_NODE_SIZE - 1];
    uint8_t tag;
};

This may seem a bit odd. Where is the string length or the prefix index? The string length is in the lowest 7 bits of tag. The highest bit of tag tells us what kind of node we have, if it's 0 then that means it's a leaf node with no prefix. And if it's 1 then the prefix index is stored in the last 2 bytes of str. The rest of str is the suffix string.

In hindsight, this might have been a bit overcomplicated, but it does allow calculating the size of the dictionary more easily since LZW_DEC_NODE_SIZE macro configures the total node size, not just the embedded string. It also allows a cool optimization that makes decoding 2x faster, more on that later.

Inserting a new substring

Inserting a new substrings is also a bit tricky compared to the traditional list approach. Previously we'd just insert the new byte, set a prefix and be done. But doing that in unrolled list doesn't make sense, if we're trying to insert "a" with a prefix "c" then it doesn't make sense to do it like this:

nil <- "c" <- "a"

We'd want to condense it down to a single node:

nil <- "ca"

The way I've done it is to check if the prefix node still has space for one more byte in it's string. If yes, then we copy over the prefix string into our node (e.g "c") and append the new byte (e.g "c" + "a" => "ca") and also copy over the prefix index of the prefix node (e.g "c" node had a prefix of nil, so our new node inherits it). If no, then we just use the prefix node as our prefix and insert a 1 byte suffix string.

Reversing the output

Because we're walking the nodes backwards, from suffix to prefix, we'll end up with a "reversed" string. In nix-compress I deal with this by not outputting the suffix right away, instead while walking I push the node's indices into a stack. Once done, the nodes in the stack are going to be in the proper order, I can just keep popping from the top and copying the node's string like normal. No explicit "reversal" step is needed.

Faster string copy

Once we've gotten a stack of nodes that puts them in the correct order, it's time to start copying the suffix strings over to the output buffer. A straight forward implementation might look like this:

int length = node->tag & 0x7F; // mask out the low 7 bits
for (int i = 0; i < length; ++i)
    out[out_len++] = node->str[i];

We assume out has enough space and out_len is the current length of the output buffer. (If there wasn't enough space, we can flush the buffer to make space). Nothing interesting is happening here, except that in nix-compress the copy loop looks like this:

int length = node->tag & 0x7F; // mask out the low 7 bits
for (int i = 0; i < LZW_DEC_NODE_SIZE; ++i)  // copy the whole node over
    out[out_len + i] = ((uint8_t *)node)[i];
out_len += length;

Instead of copying just the length of the substring, it copies over the entire node (along with the tag and prefix index at the end). However, we don't want to output whatever trailing garbage might be at the end, so we advance out_len only by length, whatever garbage at the end got copied will be discarded since it's outside of out_len. This speeds up the decoding by slightly more than 2x.

This may seem counter productive, how can copying more bytes than necessary be faster? But modern CPUs are counter intuitive, if the task is cheap to compute - and copying a mere 32 bytes is cheap - then it's faster to do it unconditionally than to introduce conditionals to avoid that work. Moreover, code that is easier for the CPU is generally also easier for the compiler to optimize. And in our case, the unconditional 32 byte copy can be optimized down to using only a few AVX instruction.

But why doesn't the compiler do this for the straight forward code? Because it doesn't know that it's safe to overwrite the trailing parts of out. Compilers must be conservative, it cannot do an optimization if there's even a 1% chance of it producing incorrect result, it needs to prove that the optimization it's performing cannot change the behavior of a valid program - otherwise it's a bug in the compiler.

And bugs...

This is not part of LZW scheme but rather issues in the .Z format - which, mind you, is entirely undocumented and needs to be reverse engineered from the original compress implementation.

The first bug is when shifting gears (incrementing code width, emitting clear code): instead of writing out n remaining bits, it dumps n bytes of the bit buffer entirely onto the output. The comment says something about "writing whole buffer so that the other side can discover size increase", but that's complete nonsense. There's no need to dump the whole buffer, the decoder can just read n bits off the stream the say way the encoder writes n bits onto it.

Unfortunately, pretty much every .Z implementation has copied over this buggy behavior. And so in order to be compatible, in nix-compress I also follow this buggy behavior by default. There's a branch which adds an --unaligned flag to avoid this bug, but files created with this flag won't be compatible with other .Z decompressors, so use it at your own peril.

The second bug is when using -b9 to limit the code width to 9 bits, in ncompress instead of "check then increment" it does "increment then check" causing it to start outputting 10 bit codes even though it's supposed to be limited to 9.

This has been buggy since almost forever. In fact, the produced compressed file cannot even be decoded using the same version of ncompress itself. This makes me think that no one really used (let alone tested) -b9, and thus in nix-compress I don't follow this buggy behavior by default. If you use -b9 in nix-compress it will correctly limit the code width to 9 bits. A --compat9 compatibility flag is provided if you need to emulate the buggy behavior or decompress files created with the bug.

Closing thoughts

nix-compress was a fun little toy project to tinker around with. The cli interface mostly conforms to the POSIX specification minus the interactive parts. There's also a working (but not polished ) windows branch using native win32 api.

The encoder and decoder are isolated into a minimalist single file library: lzw.c. Documentation is included in the source via triple slash comments (e.g /// This is api doc). Though, as I've said in the introduction, I don't think anyone should be using it given better alternatives exist.

It will sometime randomly fail to copy metadata (mtime etc) from the input to the output file for some reason I haven't figured out yet.

Tags: [ c, compression, performance ]

RSS Feed