[Toybox] md5sum cleanup

Sun Jun 1 10:00:08 PDT 2014

On 05/15/14 11:50, Ivo van Poorten wrote:
> On Wed, 14 May 2014 21:56:13 -0700 Daniel Verkamp <daniel at drv.nu> wrote:
>> Here's a quick cleanup of md5sum. Executive summary: smaller and
>> faster.
> 
> Nice! It got me excited to see if I could get it faster.

Bigger and faster.

Sorry for the delay evaluating this, but A) it rolls up the previous
patch I already applied so I have to separate it out to evaluate it, B)
#define COMMON isn't going in.

I _really_ dislike having a macro like that repeat bits of code. I see
why you did it, but ew. I'm also confused why the do { XXX } while (0);
wrapper is needed when the only users aren't in if/else blocks? The for
has their own curly brackets (for code that's all on the same line...)

So backing up to a higher level of evaluation: is indeed faster, my
problem is balancing the goals of the project. Toybox should be simple,
correct, fast, and small. More or less in that order. We can sacrifice
one for enough of another, but they're weighted.

What simple _means_ can be hard to work out in a given context, but one
of the things I think of is somebody just learning C, using this code as
an example to figure out how common things are implemented. The md5sum
code I checked in was designed with that in mind: there's as little as
possible of it to read through, and you can see what it's _doing_.
There's a complicated "wheels within wheels" nature to the thing, but
that's the task it's trying to achieve (crypto's always been like that,
going back to the Enigma machines).

When I first saw the tables in RFC 1321 I had no idea what they _meant_
and had to poke at them for a while to tease out meaning. You at least
have a comment about what each row is for, which is good...

That said, simple isn't the only criteria. If it's not correct you can't
use it, and if it's not fast enough it's useless for the big machines
that care about performance (possibly as transactions per second in a
cluster), and it's useless for the small battery powered devices that
"race to quiescence" so they can power down and stop wasting battery.

I see what you're doing here, and now that it's been pointed out how
much slower it is than other implementations I agree speeding it up is
good. (That's why I applied the first patch, even though that table
repeating each stanza 4 times is just painful. But adding math to the
dereference to repeat table sections added enough cycles to the inner
loop to slow it back down to about where it was before...)

> Times in seconds on my machine (IA32 Sempron 1.66MHz):

The sempron is AMD, and it's 64 bit. Saying "IA32" for a sempron is
wrong in two distinct ways.

Intel Architecture 32 bit is Intel's way of saying "No, a _Bud_ lite",
it was a new acronym they came up with in the 90's when Athlons started
kicking their ass to say "mine! mine! mine!" about x86. The hilarious
part is that IA64 is the Itanic instruction set, when Intel was finally
forced more or less at gunpoint to implement x86-64 (they were about to
lose _Dell_ as a customer) they ripped out the iommu to make _sure_ it
was slightly incompatible, and called it "IA-32e" (which is an insane
name for a 64 bit instruction set that already had a name).

The shenanigans were enough to piss off Linux:

http://lkml.iu.edu//hypermail/linux/kernel/0402.2/1847.html

*shrug* Fun little historical footnote...

> cur-hg      8.6
> daniel      6.7
> ivo         4.9
> md5sum      2.8
> openssl     2.7

The "md5sum" there is your host's existing command line version?

I strongly suspect openssl has inline assembly. I'm not doing that here,
I want it to work on all the different types of processors Linux supports.

(I remember busybox had a config knob for various speeds in their
config. Never looked to see what they were doing with it, but I remember
thinking that having a simple version buried under #ifdefs that triple
the size of the code isn't really helping matters...)

> I removed the conditionals out of the inner loop, introduced another
> LUT, made the LUTs align.

One of the reasons I care about speeding it up is I'd vaguely like to
implement rsync at some point, and the _current_ rsync implementation is
using md5sum instead of md4sum. (Presumably just because everybody's
gone "md4sum isn't viable for crypto anymore, therefore you can't use it
for any _other_ purpose, alas alack woe betide etc, etc...")

That's a performance critical use of md5sum, and being 1/3 the speed of
the implementation rsync is currently using would make our performance
_suck_. But...

> bloatcheck:
> 
> -----------------------------------------------------------------------
>  intab                                           0       256        256
>  md5rot                                          0       256        256
>  md5_transform                                 366       365         -1
> -----------------------------------------------------------------------
>                                                                     511
> 
> The tables can be downsized to 64 bytes, but it'll make it a bit
> slower, especially on architectures where non-aligned reads are slower.

Um, example?

> The patch is relative to current hg and includes Daniel's changes.

That's the problem, I evaluated daniel's changes separately and applied
them. (After fiddling with shrinking the table and finding most of the
speed gains went away again.)

Could you get me a patch on top of what's currently there?

Thanks,

Rob