<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 6, 2023 at 5:57 PM Rob Landley <<a href="mailto:rob@landley.net">rob@landley.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Note to self: remember to hit "send".<br>

<br>

On 1/5/23 19:10, enh wrote:<br>

>     > even though one _could_ write a byte-by-byte memcmp(), the standard does not<br>

>     > require that, and i'm aware of no non-C implementation that works that way.<br>

> <br>

>     A non-C implementation of a C library function?<br>

> <br>

> well, "assembler" if you must. my distinction being "regardless of architecture"<br>

> (so not specifically arm64 or whatever).<br>

<br>

Sounds like selection bias to me: no reason to implement an assembly version<br>

that does the same thing the C version does.<br>

<br>

(All the 1950's cars still on the road today are MUCH more durable than modern<br>

cars, most of which wouldn't still run after 70 years.)<br>

 <br>

>     > (musl may have misled you here?<br>

<br>

And uClibc, libc5, klibc, the kernel's nolibc, Keith Packard's picolibc...<br>

<br>

> strictly BSD also has a memcmp.c that's<br>

>     > byte-by-byte, but all the real architectures have assembler versions they use<br>

>     > instead.)<br>

<br>

I just checked and current-ish glibc is a horrific nest of #ifdefs in C with .S<br>

alternative versions for a half-dozen architectures on top of that, yes. But gnu<br>

was already like that when I first tried to read their code 30 years ago...<br>

<br>

gnu/newlib has an #ifdef SMALL with the simple one, and an #else with the loop<br>

over the long as a prefix to the loop over char... and I think that<br>

implementation wouldn't break? It will only do the long loop on two aligned<br>

pointers, and only while there are >=sizeof(long) bytes left, meaning it can run<br>

off the end of the shorter string constant but can't run off the end of the<br>

page. So while it can fetch garbage bytes past the end of the string within the<br>

same page, that info won't affect the result nor will it page fault.<br>

<br>

It'd still false positive hwsan, of course. Like my ls code did way back when...<br>

<br>

This was the rough mental model I had of the "optimization" all along, by the<br>

way. It CAN be done without breaking the semantics. The question is whether the<br>

constant time check up front and the extra cache line pollution for code you<br>

jump over a net negative in real world use. It's PROBABLY a wash? I suspect your<br>

real limiting factor on all this performance is cache line fetches anyway, what<br>

the CPU does is mostly "wait for DRAM fetch" when handling nontrivial string<br>

anything. Hence the aggressive prefetching and cacheing leaking security state<br>

until "do not run security critical code on the same physical CPU as sandboxed<br>

anything" gives us the nightmare that is TPM. Putting "trusted" on a chip is<br>

like putting "unsinkable" on a ship.<br>

<br>

>     I agree that xmemcmp() is not the ideal name. The x prefix means "exits", and<br>

>     this doesn't.<br>

> <br>

>     memscmp() maybe? (memstrcmp?)<br>

> <br>

> safememcmp()?<br>

<br>

Nope. Not calling it unsinkablememcmp() either. (I went with smemcmp(), you can<br>

decide for yourself what the s means.)<br>

<br>

>     > for arm64, the SVE memcmp() will load as many bytes as your vector size :-)<br>

> <br>

>     Which is not optimizing for the common case, but ok...<br>

> <br>

> as a libc maintainer, "don't get me started". the number of times i've had<br>

> optimized memory/string routines that are improvements for the very large cases<br>

> that mostly only happen in microbenchmarks while regressing the more common<br>

> short copies/compares... (though given the arm64 SVE context, i should say that<br>

> i think "arm ltd" themselves might be the sole exception that's never wasted my<br>

> time with such a thing.)<br>

<br>

Perceived improvement vs actual improvement.<br>

<br>

If you include "string.h" to get memcmp() but can't give it a string as a known<br>

not-matching argument, I personally think somebody missed part of their mission<br>

briefing. The "optimization" has very obvious side effects.<br></blockquote><div><br></div><div>at the risk of sounding like rich felker ... no, you're relying on undefined behavior. the library function says it compares n bytes of both regions. you lied to the library by claiming that the first n bytes of _both_ regions are valid, when they're not. ironically, _you're_ assuming an "optimization" that it won't look past the first non-matching byte, and you're annoyed that implementations aimed at chips that work well with larger quanta have chosen a different equally valid optimization instead. neither is _wrong_, but they're incompatible, and your mental model is assuming something that the specification doesn't guarantee you.</div><div><br></div><div>"The memcmp() function shall compare the first n bytes (each interpreted as unsigned char) of the object pointed to by s1 to the first n bytes of the object pointed to by s2."</div><div><a href="https://pubs.opengroup.org/onlinepubs/9699919799/functions/memcmp.html">https://pubs.opengroup.org/onlinepubs/9699919799/functions/memcmp.html</a></div><div><br></div><div>(and it is enough to matter to performance in practice, not just for stupidly-large regions. bionic wouldn't do this otherwise, because i'd have rejected the patches :-) )</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

>     Further increasing complexity to mitigate the fallout from a previous<br>

>     unnecessary optimization is not my preferred approach, I tend to rip OUT stuff<br>

>     with sharp edges and little to no benefit. But to each their own...<br>

> <br>

> this kind of thing is what lets you do things like adding fake cat eats to your<br>

> head live when you're recording stupid videos to clog the intertubes with.<br>

<br>

I very vaguely recall meeting the people making reactive cat ears at the first<br>

hot chips I attended in tokyo back in... 2015? (There was a pandemic, who<br>

knows.) For a definition of "met" that was "saw model wearing cool thing, read<br>

the english side of a glossy brochure, everybody at the booth only spoke<br>

japanese", but still. If that's the one, it was a tiny microcontroller. Battery<br>

powered. Not a bandaid-on-bandaid-on-bandaid situation.<br>

<br>

The bolt-more-on approach piles up Pentium 4 and Itanium and eventually gets its<br>

legs cut out from under it by a not-doing-that. It can go quite a ways first, of<br>

course, but Cortex-M is not "more than armv8", it's a subset.<br>

<br>

"Perfection is achieved not when there is nothing left to add, but nothing left<br>

to take away." - Antoine De Saint-Exupery. Except he said it in french.<br>

<br>

> oddly to you and me, that's an in-demand use case for "real people"...<br>

<br>

My grumbling about perceived improvement vs actual improvement is because I<br>

question and requestion my approach a lot, and a common fallback is "small and<br>

simple examples that work are seldom actually useless". But that's not the world<br>

you're in. :)<br>

<br>

Rob<br>

</blockquote></div></div>