[Toybox] xmemcmp()

Fri Jan 6 18:09:56 PST 2023

On Fri, Jan 6, 2023 at 5:57 PM Rob Landley <rob at landley.net> wrote:

> Note to self: remember to hit "send".
>
> On 1/5/23 19:10, enh wrote:
> >     > even though one _could_ write a byte-by-byte memcmp(), the
> standard does not
> >     > require that, and i'm aware of no non-C implementation that works
> that way.
> >
> >     A non-C implementation of a C library function?
> >
> > well, "assembler" if you must. my distinction being "regardless of
> architecture"
> > (so not specifically arm64 or whatever).
>
> Sounds like selection bias to me: no reason to implement an assembly
> version
> that does the same thing the C version does.
>
> (All the 1950's cars still on the road today are MUCH more durable than
> modern
> cars, most of which wouldn't still run after 70 years.)
>
> >     > (musl may have misled you here?
>
> And uClibc, libc5, klibc, the kernel's nolibc, Keith Packard's picolibc...
>
> > strictly BSD also has a memcmp.c that's
> >     > byte-by-byte, but all the real architectures have assembler
> versions they use
> >     > instead.)
>
> I just checked and current-ish glibc is a horrific nest of #ifdefs in C
> with .S
> alternative versions for a half-dozen architectures on top of that, yes.
> But gnu
> was already like that when I first tried to read their code 30 years ago...
>
> gnu/newlib has an #ifdef SMALL with the simple one, and an #else with the
> loop
> over the long as a prefix to the loop over char... and I think that
> implementation wouldn't break? It will only do the long loop on two aligned
> pointers, and only while there are >=sizeof(long) bytes left, meaning it
> can run
> off the end of the shorter string constant but can't run off the end of the
> page. So while it can fetch garbage bytes past the end of the string
> within the
> same page, that info won't affect the result nor will it page fault.
>
> It'd still false positive hwsan, of course. Like my ls code did way back
> when...
>
> This was the rough mental model I had of the "optimization" all along, by
> the
> way. It CAN be done without breaking the semantics. The question is
> whether the
> constant time check up front and the extra cache line pollution for code
> you
> jump over a net negative in real world use. It's PROBABLY a wash? I
> suspect your
> real limiting factor on all this performance is cache line fetches anyway,
> what
> the CPU does is mostly "wait for DRAM fetch" when handling nontrivial
> string
> anything. Hence the aggressive prefetching and cacheing leaking security
> state
> until "do not run security critical code on the same physical CPU as
> sandboxed
> anything" gives us the nightmare that is TPM. Putting "trusted" on a chip
> is
> like putting "unsinkable" on a ship.
>
> >     I agree that xmemcmp() is not the ideal name. The x prefix means
> "exits", and
> >     this doesn't.
> >
> >     memscmp() maybe? (memstrcmp?)
> >
> > safememcmp()?
>
> Nope. Not calling it unsinkablememcmp() either. (I went with smemcmp(),
> you can
> decide for yourself what the s means.)
>
> >     > for arm64, the SVE memcmp() will load as many bytes as your vector
> size :-)
> >
> >     Which is not optimizing for the common case, but ok...
> >
> > as a libc maintainer, "don't get me started". the number of times i've
> had
> > optimized memory/string routines that are improvements for the very
> large cases
> > that mostly only happen in microbenchmarks while regressing the more
> common
> > short copies/compares... (though given the arm64 SVE context, i should
> say that
> > i think "arm ltd" themselves might be the sole exception that's never
> wasted my
> > time with such a thing.)
>
> Perceived improvement vs actual improvement.
>
> If you include "string.h" to get memcmp() but can't give it a string as a
> known
> not-matching argument, I personally think somebody missed part of their
> mission
> briefing. The "optimization" has very obvious side effects.
>

at the risk of sounding like rich felker ... no, you're relying on
undefined behavior. the library function says it compares n bytes of both
regions. you lied to the library by claiming that the first n bytes of
_both_ regions are valid, when they're not. ironically, _you're_ assuming
an "optimization" that it won't look past the first non-matching byte, and
you're annoyed that implementations aimed at chips that work well with
larger quanta have chosen a different equally valid optimization instead.
neither is _wrong_, but they're incompatible, and your mental model is
assuming something that the specification doesn't guarantee you.

"The memcmp() function shall compare the first n bytes (each interpreted as
unsigned char) of the object pointed to by s1 to the first n bytes of the
object pointed to by s2."
https://pubs.opengroup.org/onlinepubs/9699919799/functions/memcmp.html

(and it is enough to matter to performance in practice, not just for
stupidly-large regions. bionic wouldn't do this otherwise, because i'd have
rejected the patches :-) )

> >     Further increasing complexity to mitigate the fallout from a previous
> >     unnecessary optimization is not my preferred approach, I tend to rip
> OUT stuff
> >     with sharp edges and little to no benefit. But to each their own...
> >
> > this kind of thing is what lets you do things like adding fake cat eats
> to your
> > head live when you're recording stupid videos to clog the intertubes
> with.
>
> I very vaguely recall meeting the people making reactive cat ears at the
> first
> hot chips I attended in tokyo back in... 2015? (There was a pandemic, who
> knows.) For a definition of "met" that was "saw model wearing cool thing,
> read
> the english side of a glossy brochure, everybody at the booth only spoke
> japanese", but still. If that's the one, it was a tiny microcontroller.
> Battery
> powered. Not a bandaid-on-bandaid-on-bandaid situation.
>
> The bolt-more-on approach piles up Pentium 4 and Itanium and eventually
> gets its
> legs cut out from under it by a not-doing-that. It can go quite a ways
> first, of
> course, but Cortex-M is not "more than armv8", it's a subset.
>
> "Perfection is achieved not when there is nothing left to add, but nothing
> left
> to take away." - Antoine De Saint-Exupery. Except he said it in french.
>
> > oddly to you and me, that's an in-demand use case for "real people"...
>
> My grumbling about perceived improvement vs actual improvement is because I
> question and requestion my approach a lot, and a common fallback is "small
> and
> simple examples that work are seldom actually useless". But that's not the
> world
> you're in. :)
>
> Rob
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.landley.net/pipermail/toybox-landley.net/attachments/20230106/43fdaa53/attachment.htm>