[Toybox] md5sum cleanup
Rob Landley
rob at landley.net
Fri May 16 02:34:47 PDT 2014
On 05/14/14 23:56, Daniel Verkamp wrote:
> Here's a quick cleanup of md5sum. Executive summary: smaller and faster.
>
> On my machine, for a 2.2 GB file of random bytes, the timings with
> warm cache are:
> toybox before: 11.4 seconds
> toybox after: 8.3 seconds
> GNU md5sum: 3.9 seconds
> openssl dgst -md5: 3.5 seconds
>
> This is clearly better than before (3x openssl), but still slow (2x openssl).
>
> I suspect there is more low-hanging fruit to be had by eliminating the
> memcpy in hash_update (maybe not too much - hash_update accounts for
> about 4% of total runtime versus 92% for md5_transform according to
> perf - but this would also help sha1sum).
>
> make bloatcheck on x86_64 gcc 4.8.2 -Os:
> name old new delta
> -----------------------------------------------------------------------
> md5rot 0 64 64
> md5_transform 365 223 -142
> -----------------------------------------------------------------------
> -78 total
Indeed.
> Rationale for the changes:
>
> Move definition of 'rol' up so it can be used in md5_transform. This
> is purely cosmetic; it expands to exactly the same code.
>
> Put rotation counts in a lookup table instead of calculating them on
> the fly. This is mostly a wash size-wise, +5 bytes total, but
> worthwhile for readability and speed.
I tried removing the repetition from the table and indexing it by
(i&3)+((i>>2)&~3) and an md5sum of a dvd image I had lying around when
from 120 seconds to 132 seconds. :P
Sigh. I guess it's a form of loop unrolling. I wasn't trying for the
fastest implementation out there (I've seen ones with inline assembly),
I was going for "readable". I guess if your idea of readable includes
that much repetition...
> Instead of accessing the state array using a rotating index (the
> variable formerly known as 'a'), access the state with constant
> offsets and rotate the contents of the array instead. This is the big
> win - it eliminates all the crazy memory addressing math inside the
> loop.
Ah yes, crazy that I didn't hide the math in a macro to generate the index.
I guess x86-64 can keep it all in registers so the extra assignments are
cheaper than the extra math operations that avoid the need to move the data.
*shrug* Isn't always obvious to me what "simple" is for something like
this. Not gonna argue with actual measurements...
Rob
1400232887.0
More information about the Toybox
mailing list