[Toybox] xmemcmp()

Fri Jan 6 18:09:44 PST 2023

Note to self: remember to hit "send".

On 1/5/23 19:10, enh wrote:
>     > even though one _could_ write a byte-by-byte memcmp(), the standard does not
>     > require that, and i'm aware of no non-C implementation that works that way.
> 
>     A non-C implementation of a C library function?
> 
> well, "assembler" if you must. my distinction being "regardless of architecture"
> (so not specifically arm64 or whatever).

Sounds like selection bias to me: no reason to implement an assembly version
that does the same thing the C version does.

(All the 1950's cars still on the road today are MUCH more durable than modern
cars, most of which wouldn't still run after 70 years.)

>     > (musl may have misled you here?

And uClibc, libc5, klibc, the kernel's nolibc, Keith Packard's picolibc...

> strictly BSD also has a memcmp.c that's
>     > byte-by-byte, but all the real architectures have assembler versions they use
>     > instead.)

I just checked and current-ish glibc is a horrific nest of #ifdefs in C with .S
alternative versions for a half-dozen architectures on top of that, yes. But gnu
was already like that when I first tried to read their code 30 years ago...

gnu/newlib has an #ifdef SMALL with the simple one, and an #else with the loop
over the long as a prefix to the loop over char... and I think that
implementation wouldn't break? It will only do the long loop on two aligned
pointers, and only while there are >=sizeof(long) bytes left, meaning it can run
off the end of the shorter string constant but can't run off the end of the
page. So while it can fetch garbage bytes past the end of the string within the
same page, that info won't affect the result nor will it page fault.

It'd still false positive hwsan, of course. Like my ls code did way back when...

This was the rough mental model I had of the "optimization" all along, by the
way. It CAN be done without breaking the semantics. The question is whether the
constant time check up front and the extra cache line pollution for code you
jump over a net negative in real world use. It's PROBABLY a wash? I suspect your
real limiting factor on all this performance is cache line fetches anyway, what
the CPU does is mostly "wait for DRAM fetch" when handling nontrivial string
anything. Hence the aggressive prefetching and cacheing leaking security state
until "do not run security critical code on the same physical CPU as sandboxed
anything" gives us the nightmare that is TPM. Putting "trusted" on a chip is
like putting "unsinkable" on a ship.

>     I agree that xmemcmp() is not the ideal name. The x prefix means "exits", and
>     this doesn't.
> 
>     memscmp() maybe? (memstrcmp?)
> 
> safememcmp()?

Nope. Not calling it unsinkablememcmp() either. (I went with smemcmp(), you can
decide for yourself what the s means.)

>     > for arm64, the SVE memcmp() will load as many bytes as your vector size :-)
> 
>     Which is not optimizing for the common case, but ok...
> 
> as a libc maintainer, "don't get me started". the number of times i've had
> optimized memory/string routines that are improvements for the very large cases
> that mostly only happen in microbenchmarks while regressing the more common
> short copies/compares... (though given the arm64 SVE context, i should say that
> i think "arm ltd" themselves might be the sole exception that's never wasted my
> time with such a thing.)

Perceived improvement vs actual improvement.

If you include "string.h" to get memcmp() but can't give it a string as a known
not-matching argument, I personally think somebody missed part of their mission
briefing. The "optimization" has very obvious side effects.

>     Further increasing complexity to mitigate the fallout from a previous
>     unnecessary optimization is not my preferred approach, I tend to rip OUT stuff
>     with sharp edges and little to no benefit. But to each their own...
> 
> this kind of thing is what lets you do things like adding fake cat eats to your
> head live when you're recording stupid videos to clog the intertubes with.

I very vaguely recall meeting the people making reactive cat ears at the first
hot chips I attended in tokyo back in... 2015? (There was a pandemic, who
knows.) For a definition of "met" that was "saw model wearing cool thing, read
the english side of a glossy brochure, everybody at the booth only spoke
japanese", but still. If that's the one, it was a tiny microcontroller. Battery
powered. Not a bandaid-on-bandaid-on-bandaid situation.

The bolt-more-on approach piles up Pentium 4 and Itanium and eventually gets its
legs cut out from under it by a not-doing-that. It can go quite a ways first, of
course, but Cortex-M is not "more than armv8", it's a subset.

"Perfection is achieved not when there is nothing left to add, but nothing left
to take away." - Antoine De Saint-Exupery. Except he said it in french.

> oddly to you and me, that's an in-demand use case for "real people"...

My grumbling about perceived improvement vs actual improvement is because I
question and requestion my approach a lot, and a common fallback is "small and
simple examples that work are seldom actually useless". But that's not the world
you're in. :)

Rob