[Toybox] Performance on compress.c

Thu Jul 9 16:49:42 PDT 2015

On 07/09/2015 05:41 AM, Yeongdeok Suh wrote:
> I discovered slow decompression while testing gunzip.
> I want to discuss about this improvement.

Ok...

> I gunzipped an 15MB *.tar.gz file on the ARM board, and it takes around
> 17 sec.

Not knowing which arm board you're using, or how other gunzip
implementations perform on it, this tells me very little.

How much faster were the other inflate implementations you tried on that
hardware?

> It's too slow, so I want to improve this for ARM based environment.

Patches welcome.

> (x86 was okay)

I don't know what okay means here. Did the problem with the algorithm
not manifest on x86, or was the x86 system just so much faster you
didn't care?

> As I debugged the src,
> below while loop takes too much time when decoded huffman symbol is over
> 256.

The while loop is block copying a previous chunk of data. Symbols over
256 are control symbols rather than literals, one of the things they do
is repeat earlier sequences.

> I'm looking for the better methods but it's not easy.
> Could you share your opinion?
> 
> (toys/pending/compress.c)
> --Line:
> 381-------------------------------------------------------------------------------------
>         // Use huffman tables to decode block of compressed symbols
>       for (;;) {
>         int sym = huff_and_puff(bb, lithuff);
> 
>         // Literal?
>         if (sym < 256) data_crc(sym);
> 
>         // Copy range?
>         else if (sym > 256) {
>           int len, dist;
> 
>           sym -= 257;
>           len = TT.lenbase[sym] + bitbuf_get(bb, TT.lenbits[sym]);
>           sym = huff_and_puff(bb, disthuff);
>           dist = TT.distbase[sym] + bitbuf_get(bb, TT.distbits[sym]);
>           sym = TT.pos & 32767;
> 
>     //---------Slow! O(n^2)--------------------------
>           while (len--) data_crc(TT.data[(TT.pos-dist) & 32767]);

How is it O(n^2)? What's n here? We consume a "repeat earlier output"
sequence, and then do that...

The data_crc() function is misnamed, it's basically output_byte(). It
copies the byte to the output buffer, advances the output buffer pointer
(increments TT.pos), does the crc32 update (in 32k chunks), and
determines when the buffer is full and thus the need to write() the
also determines when the buffer is full and thus you need to write() the
next 32k of data.

Since data_crc() increments TT.pos internally, the rounded
TT.data[TT.pos-dist] that we're using for input advances once with each
output byte. So it's basically doing a bytewise memcpy(), which isn't
ideal but uses the same code for both codepaths. The output a literal
codepath needs to deal with single bytes, using memcpy() for single
bytes would be just as bad the other way.

Inlined (which I assume the compiler is doing), the codepath looks like:

  while (len--) {
    int sym = TT.data[(TT.pos-dist) & 32767];

    TT.data[TT.pos++ & 32767] = sym;

    if (!(TT.pos & 32767)) {
      xwrite(TT.outfd, TT.data, 32768);
      if (TT.crcfunc) TT.crcfunc(TT.data, 32768);
    }
  }

So inside the loop you you have an array read, an array write, and an
if() that only triggers every 32k bytes.

It's not as good as memcpy (because it's copying bytes and we
recalculate both addresses each time to avoid extra if() statements we'd
need to wrap both the from and the to pointer otherwise), but the math
is simple and parallelizeable for multicore systems, and it's all L1
cache local.

(I actually wondered if the crc update should be in the same loop to
avoid re-faulting those L1 cache lines for a second pass, but the
function pointer lets us substitute different crc functions and that
turns out to matter because gzip and zlib and such don't quite agree on
what they should be doing; one function pointer traversal per 32k is a
rounding error, one per loop could be significant, can't say I've
benched it though.)

The masking constant (32767) should stick around in a register as a
literal if the compiler is doing _any_ optimization, and the TT.pos &
mask value can be reused for the assignment and the test if the compiler
is doing even basic optimizations. (TT.pos is global, but not volatile.
If the compiler is that dumb I can have "int pos = TT.pos++ & 32767;" at
the start of data_crc)

If I make a "block" specific output function, it would have two memcpy
calls (one before the flush and the one after if there's more to write;
the maximum repeat size is small enough it can't fill the buffer up
twice in the same write). But I'm reluctant to have code doing the same
thing in two places. (Ok, I already _do_, the flush at the end of
inflate is the same flush as in data_crc() but that has a variable
length and the other is hardwired to full buffer size, it's small enough
and different enough that merging them isn't a clear win, plus adding a
test to a tight inner loop is something to avoid...)

Anyway, that's what _I_ think this code is doing, could you explain to
me what you think the code is doing?

Thanks,

Rob

 1436485782.0