[Toybox] FYI musl's support horizon.

Fri Aug 27 06:38:46 PDT 2021

On 8/26/21 5:56 PM, enh wrote:
>     I keep telling people I could spend a focused year on JUST the test suite and
>     they don't believe me. When people talk about function testing vs regression
>     testing vs coverage testing I get confused because it's all the same thing? 
> 
> 
> i'll include the main failure modes of each, to preempt any "yes, but"s by
> admitting that _of course_ you can write fortran in any language, but the idea
> is something like:
> 
> integration testing - answers "does my product work for real use cases?". you
> definitely want this, for obvious reasons, and since your existing testing is
> integration tests, i'll say no more. other than that the failure mode here is
> relying only on integration tests and spending a lot more time/effort debugging
> failures than you would if you could have caught the same issue with a unit test.

I'm relying on the fact I wrote almost all the code myself, and thoroughly
reviewed the rest, to be able to mentally model what everything is doing.

That said, I'm trying to get the bus number up so you don't NEED me to do this
sort of thing...

> unit testing - reduces the amount of digging you have to do _when_ your
> integration tests fail. (also makes it easier to asan/tsan or whatever, though
> this is much more of a problem on large systems than it is for something like
> toybox, where everything's small and fast anyway, versus "30mins in to
> transcoding this video, we crash" kinds of problem.) for something like toybox
> you'd probably be more interested in the ability to mock stuff out --- your "one
> day i'll have qemu with a known set of processes" idea,

It's kinda hard to test things like ps/ifconfig/insmod outside of a carefully
controlled known environment.

> but done by swapping
> function pointers. one nice thing about unit tests is that they're very easily
> parallelized. on a Xeon desktop i can run all several thousand bionic unit tests
> in less than 2s... whereas obviously "boot a device" (more on the integration
> test side) takes a lot longer. the main failure mode here (after "writing good
> tests is at least as hard as writing good code", which i'm pretty sure you
> already agree with, and might even be one of your _objections_ to unit tests),

Eh, the toys/example/demo_$THINGY commands are sort of intended to do this kind
of thing for chunks of shared infrastructure (library code, etc).

My objection here is really granularity, if you test at TOO detailed a level
you're just saying "this code can't change". I recently changed xabspath() to
have a flag based interface, changed its existing users in the commands,  and
have the start of a toys/example/demo_abspath.c (which I mentioned in my blog I
was too exhausted to properly finish at the time). Granular tests directly
calling the functions would have been invalidated by the change, meaning I'd
either have deleted them or rewritten them.

With other people's test suites I often encounter test failures that don't MEAN
anything. Some test is failing because the semantics of something somewhere
changed, and none of the users care, and the test suite accumulates "known
failures" like code emitting known warnings.

A libc has an API with a lot of stable documented entry points. Toybox's entry
points are almost entirely command line utilities with a shared entry codepath
(including option parsing) and a shared library of common functions. I don't
want to test the lib/*.c code directly from something that ISN'T a command (some
other main() in its own .c function, possibly accessing it via dlopen() or
something) because the top level main.c initializes toy_list[] and has
toy_init() and toy_find() and toy_exec() and so on. If I factored that out I'd
_only_ be doing so for the test suite, not because it made design sense. I don't
want to duplicate plumbing and test in a different environment than I'm running in.

The mkroot images are "tiny but valid". It's a theoretically real system you
could build up from, and tells me "how does this behave under musl on a bunch of
targets", using a real Linux kernel and so on.

> is writing over-specific unit tests. rather than writing tests to cover "what
> _must_ this do to be correct?" people cover "what does this specific
> implementation happen to do right now, including accidental implementation
> details?".

Yup. Seen a lot of that. :(

> (i've personally removed thousands of lines of misguided tests that
> checked things like "if i pass _two_ invalid parameters to this function, which
> one does it report the error about?", where the correct answer is either "both"
> or "who cares?", but never "one specific one".)

I've bumped into some of that in toysh because I want to match bash's behavior
and alas bash is one of those "the implementation is currently the standard"
things where every implementation detail hiccup IS the current spec.

That said, I've blogged about making a few digressions anyway just because my
plumbing doesn't work like bash's does (they're gratuitously making multiple
passes over the data and I'm doing it all in one pass, and there are some places
where "all x happens before all y" bubbles visibly to the surface and I just
went no.) For example the "order of operations" issue in
https://landley.net/notes-2021.html#18-03-2021

> coverage - tells you where your arse is hanging out the window _before_ your
> users notice. (i've had personal experiences of tests i've written and that two
> other googlers have code reviewed that -- when i finally got the coverage data
> -- turned out to be missing important stuff that [i thought] i'd explicitly
> written tests for.

This is what I mean by testing the error paths. If I have a statement the code
flow doesn't ever go through in testing, I'd like to know why. There's
presumably tools for this (I think valgrind has something), but that's waaaaaay
down the road.

> Android's still working on "real time" coverage data showing
> up in code reviews, but "real Google" has been there for years, and you'd be
> surprised how many times your tests don't test what you thought they did.)

Sadly, I would not by surprised. :(

> the main failure mode i've seen here is that you have to coach people that "90% is
> great", and that very often chasing the last few percent is not a good use of
> time,

https://en.wikipedia.org/wiki/Pareto_principle

And here is an excellent mathematical walkthrough of the math behind it:

  https://www.youtube.com/watch?v=sPQViNNOAkw#t=6m43s

Which is why the old saying "the fist 90% of the work takes 90% of the time, the
remaining 10% of the work takes the other 90% of the time" is ALMOST right, it's
that the next 9% takes another 90% in a xeno's paradox manner (addressing 90% of
what's left takes a constant amount of time) until you shoot the engineers and
go into production.

> and in the extreme can make code worse. ("design for testability" is good,
> but -- like all things -- you can take it too far.)

My grumble is I'm trying to write a lot of tests that toybox and the debian host
utilities can BOTH pass. I want to test the same code linked against glibc,
musl, and bionic. I want to test it on big endian and little endian, 32 bit and
64 bit, systems that throw unaligned access faults, nommu...

>     You
>     have to test every decision point (including the error paths), you have to
>     exercise every codepath (or why have that codepath?) and you have to KEEP doing
>     it because every distro upgrade is going to break something.
> 
> yeah, which is why you want all this stuff running in CI, on all the platforms
> you care about.

People use "continuous integration" as an excuse not to have releases. No two
people should ever run quite the same version and see quite the same behavior,
we're sure random git snapshot du jour is fine...

I object on principle.

>     In my private emails somebody is trying to make the last aboriginal linux
>     release work and the old busybox isn't building anymore because makedev() used
>     to be in #include <sys/types.h> and now it's moved to <sys/sysmacros.h>. (Why? I
>     dunno. Third base.) 
> 
> the pain of dealing with that pointless deckchair crap with every glibc update
> is one reason why (a) i've vowed never to do that kind of thing again in bionic
> [we were guilty of the same crime in the past, even me personally; the most
> common example being transitive includes] and (b) i'm hoping musl will care a
> bit more about not breaking source compatibility ... but realize he's a bit
> screwed because code expecting glibc might come to rely on the assumption that
> <sys/types.h> *doesn't* contain makedev(), say --- i've had to deal with that
> kind of mess myself too. sometimes you can't win.

He has a very active mailing list and IRC channel (now on libre.net like
everybody else) where they argue about that sort of thing ALL THE TIME. (That
said, I poked him to see if he wants to make a policy statement about this. Or
has one somewhere already.)

My complaint way back when was objecting to the need to #define
GNU_GNU_ALL_HAIL_STALLMAN in order to get the definition for linux syscall
wrappers (which have NOTHING to do with the gnu project). I made puppy eyes at
Rich until he added the _ALL_SOURCE define so musl headers could just give me
everything they knew how to do do without micromanaging feature macros. (I'm
already #including some headers and not including others, that's the granularity
that makes SENSE...)

And then I wound up doing:

  #define unshare(flags) syscall(SYS_unshare, flags)
  #define setns(fd, nstype) syscall(SYS_setns, fd, nstype)

anyway. :)

Rob