[Toybox] md5sum cleanup

Fri May 16 02:34:47 PDT 2014

On 05/14/14 23:56, Daniel Verkamp wrote:
> Here's a quick cleanup of md5sum. Executive summary: smaller and faster.
> 
> On my machine, for a 2.2 GB file of random bytes, the timings with
> warm cache are:
> toybox before:     11.4 seconds
> toybox after:      8.3 seconds
> GNU md5sum:        3.9 seconds
> openssl dgst -md5: 3.5 seconds
> 
> This is clearly better than before (3x openssl), but still slow (2x openssl).
> 
> I suspect there is more low-hanging fruit to be had by eliminating the
> memcpy in hash_update (maybe not too much - hash_update accounts for
> about 4% of total runtime versus 92% for md5_transform according to
> perf - but this would also help sha1sum).
> 
> make bloatcheck on x86_64 gcc 4.8.2 -Os:
> name                                           old       new      delta
> -----------------------------------------------------------------------
>  md5rot                                          0        64         64
>  md5_transform                                 365       223       -142
> -----------------------------------------------------------------------
>                                                                     -78 total

Indeed.

> Rationale for the changes:
> 
> Move definition of 'rol' up so it can be used in md5_transform. This
> is purely cosmetic; it expands to exactly the same code.
> 
> Put rotation counts in a lookup table instead of calculating them on
> the fly. This is mostly a wash size-wise, +5 bytes total, but
> worthwhile for readability and speed.

I tried removing the repetition from the table and indexing it by
(i&3)+((i>>2)&~3) and an md5sum of a dvd image I had lying around when
from 120 seconds to 132 seconds. :P

Sigh. I guess it's a form of loop unrolling. I wasn't trying for the
fastest implementation out there (I've seen ones with inline assembly),
I was going for "readable". I guess if your idea of readable includes
that much repetition...

> Instead of accessing the state array using a rotating index (the
> variable formerly known as 'a'), access the state with constant
> offsets and rotate the contents of the array instead.  This is the big
> win - it eliminates all the crazy memory addressing math inside the
> loop.

Ah yes, crazy that I didn't hide the math in a macro to generate the index.

I guess x86-64 can keep it all in registers so the extra assignments are
cheaper than the extra math operations that avoid the need to move the data.

*shrug* Isn't always obvious to me what "simple" is for something like
this. Not gonna argue with actual measurements...

Rob

 1400232887.0