[Toybox] Toybox test image / fuzzing

Mon Mar 14 18:58:55 PDT 2016

Your previous email definitely clarified how you want the test suite to
work, thank you.

I tried to answer your questions while avoiding duplication. I realize
this thread is getting towards bikeshedding territory, so I've attempted
to focus on the more factual/neutral/useful parts.

On 03/14/2016 12:52 AM, Rob Landley wrote:
> On 03/13/2016 05:04 PM, Samuel Holland wrote:
>> On 03/13/2016 03:32 PM, Rob Landley wrote:
>>> Because science is about reducing variables and isolating to test
>>> specific things?
>>
>> If you want to reduce variables, see the suggestion about unit
>> testing.
>
> That said, what specifically was the suggestion about unit testing.
> "We should have some?" We should export a second C interface to
> something that isn't isn't a shell command for the purpose of
> telling us... what, exactly?

I was referencing this:

On 03/13/2016 01:06 PM, enh wrote:
> only having integration tests is why it's so hard to test toybox ps
> and why it's going to be hard to fuzz the code: we're missing the
> boundaries that let us test individual pieces. it's one of the major
> problems with the toybox design/coding style. sure, it's something
> all the existing competition in this space gets wrong too, but it's
> the most obvious argument for the creation of the _next_ generation
> tool...

There is only so much variable-reduction you can do if you test the
whole program at once. If you want to, as you suggested, "test specific
things", like the command infrastructure, thoroughly, they have to be
tested apart from the limits of the commands they are used in.

On 03/13/2016 04:56 PM, Rob Landley wrote:
> If we need to test C functions in ways that aren't easily allowed by
> the users of those C functions, we can write a toys/example command
> that calls those C functions in the way we want to check.

I think we actually agree with each other here.

>> Considering how many times this pattern is already used, I don't
>> see it adding much complexity. It's trading an ad hoc pattern used
>> in ~17% of the tests for something more consistent and
>> well-defined.
>
> Because 17% of the tests use it, 100% of the tests should get an
> extra argument?

It's not adding any more features, just refactoring the existing
behavior behind a common function instead of repeating it throughout the
testsuite. For how to avoid adding complexity where it's not used, I'll
have to defer to Andy's idea:

On 03/13/2016 02:54 PM, Andy Chu wrote:
> Yes, that is exactly what I was getting at.  Instead of "testing",
> there could be another function "testing-errors" or something.  But
> it's not super important right now.

>> I have seen a couple of tests that pass because they expect
>> failure, but the command is failing for the wrong reason.
>
> Point them out please?

I don't remember specifics at this point. I haven't looked at the test
suite in much detail (other than reading the mailing list) since the end
of 2014 or so when I was working on using it in a toy distro.

http://thread.gmane.org/gmane.linux.toybox/1709
https://github.com/smaeul/escapist/commits/master

If I remember correctly, one of them failed because it got a SIGSEGV,
but to a shell that's just false. The other one was not crashing, but
failing for another reason than expected. If I had to guess, one of them
was cp, but that's because it's the one I spent the most time on. I'm
positive they are both fixed now.

> you can go
>
> VERBOSE=fail make test_ls
>
> And have it not only stop at the first failure, but show you the diff
> between actual and expected, plus show you the command line it ran.
>
> <snip>
>
>> (As a side note, the test harness I've written recently even gives
>> you a diff of the expected and actual outputs when the test
>> fails.)
>
> So does this one, VERBOSE=1 shows the diff for all of them,
> VERBOSE=fail stops after the first failure. It's not the DEFAULT
> output because it's chatty.
>
> Type "make help" and look at the "test" target. I think it's some of
> the web documentation too, and it's also in the big comment block at
> the start of scripts/runtest.sh.

Okay, to some extent, I actually like way that better than mine. It
gives you an overview of how close you are to conformance (you can count
the failing tests, instead of quitting at the first failure), yet lets
you drill down when desired. Like I said, I haven't studied the test
infrastructure recently; I should go do that.

>>> Also, "the return code" implies none of the tests are pipelines,
>>> or multi-stage "do thing && examine thing" (which _already_ fails
>>> if do thing returned failure, and with the error_msg() stuff
>>> would have said why to stderr already). Yesterday I was poking at
>>> mv tests which have a lot of "mv one two && [ -e two ] && [ ! -e
>>> one ] && echo yes" sort of constructs. What is "the exit code"
>>> from that?
>>
>> Well, if we are testing mv, then the exit code is the exit code of
>> mv.
>
> Not in the above test it isn't. "mv" isn't necessarily the first
> thing we run, or the last thing we run, in a given pipeline.

Right. The whole point was that (in my ideal test suite) mv (or any
other program being tested) should never _be_ in a pipeline. That way
you don't have to even consider how the pipeline works in xyz shell.

> We have a test for "xargs", which is difficult to run _not_ in a
> pipeline.

Redirecting stdin from a file (which could be temporary) doesn't do
weird things with return values like a shell pipeline does.

> When you test "nice" or "chroot" or "time", the command has an exit
> code and its child could have an exit code. It's NOT THAT SIMPLE.

"nice true", "xargs echo", "chroot . true", etc. I'm not sure how "true"
or "echo" would have any other exit code than 0. (If it does, your
shell/echo/true is majorly broken, and you might as well give up.)

>>> Keep in mind that error_msg() and friends produce output, and the
>>> tests don't catch stderr by default but pass it through. If we
>>> catch stderr by default and a test DOESN'T check it, then it's
>>> ignored instead of visibile to the caller.
>>
>> I'm not sure how you could _not_ check stderr. The test case has a
>> string, the command generates a string, you compare the strings.
>
> By default it intercepts stdout and stderr goes to the terminal. The
> shell won't care what gets produced on stderr if the resulting exit
> code is then 0 either.
>
>> If you want to pass it through, nothing prevents that.
>
> I don't understand what you're saying here. I already pointed out you
> can redirect and intercept it and make it part of your test.

I should have been more clear: I was confused by why you were
considering "If we catch stderr by default and a test DOESN'T check
it..." If stderr is caught by the test infrastructure the test doesn't
specify anything for it, it would be compared against the empty string.
The test would have to actively throw it away (2>/dev/null or something)
for it to not be checked.

I am aware you can pass stderr through to the terminal without checking
it, and that that's what the toybox test suite currently does. "If you
want to pass it through, nothing prevents that." was meant to point out
that, even if stderr was caught (for checking) by default, it could
_also_ be sent to the terminal if you wanted to.

(I think a lot of my writing suffers from "it makes sense to me...".)

> That said, perror_msg appends a translated error string so exact
> matches on english will fail in other locales.

Set LC_MESSAGES=C in the test infrastructure? By this time, I've
realized that checking stderr for an expected value is often going to be
impossible...

> Plus kernel version changes have been known to change what errno a
> given syscall failure returns. Heck, different filesystem types
> sometimes do that too. (Reiserfs was notorious for that.)

...and apparently errno isn't reliable either. I thought the kernel
didn't break userspace? I guess that contract doesn't include "why you
can't do that."

Okay, point taken. I wasn't aware that return codes were so loosely
specified. I was under the impression that programs would generally just
exit with the last errno (or 1 for some other error), and that errno
values were well-specified at the libc/kernel level.

>> and therefore the goal of the test suite is to compare the actual
>> output to the correct output and ensure they match. If you don't
>> check the exit code, you are missing part of the output.
>
> Remember the difference between android and toybox uptime output? Or
> how about pmap, what should its output show? The only nonzero return
> code base64 can do is failure to write to stdout, but I recently
> added tests to check that === was being wrapped by -w properly
> (because previously it wasn't). Is error return code the defining
> characteristic of an nbd-client test? (How _do_ you test that?)
>
> Here is a cut and paste of the _entire_ man page of setsid:
>
> SETSID(1)              User Commands                       SETSID(1)
>
> NAME setsid - run a program in a new session
>
> SYNOPSIS setsid program [arg...]
>
> DESCRIPTION setsid runs a program in a new session.
>
> SEE ALSO setsid(2)
>
> AUTHOR Rick Sladkey <jrs at world.std.com>
>
> AVAILABILITY The  setsid  command is part of the util-linux package
> and is available from
> ftp://ftp.kernel.org/pub/linux/utils/util-linux/.
>
> util-linux             November 1993                       SETSID(1)
>
> Now tell me: what error return codes should it produce, and under
> what circumstances? Are the error codes the man page doesn't bother
> to mention an important part of testing this command, or is figuring
> out how to distinguish a session leader (possibly with some sort of
> pty wrapper plumbing to signal it through) more important to testing
> this command?

Of course I don't claim that return codes are the most important, by any
means. I just think^Wthought they were a relatively low-overhead thing
to test _in_addition_ to the important stuff, that might catch some
additional corner cases. As for "setsid", in my opinion, it should
return the errno from setsid() or exec*() if either fails. After it
execs, it doesn't really have a say.

Amusingly, my setsid (from util-linux 2.26.2, which has two real
options!) manages to fail rather spectacularly:

setsid: child 8695 did not exit normally: Success
setsid: failed to execute htop: Invalid or incomplete multibyte or wide
character

> I'd like to figure out how to test the commands we've got so that if
> they break in a way we care about, the test suite tells us rather
> than us having to find it out. I don't care if false returns 3 and
> nothing will ever notice.

Hmmm, difference of viewpoint. I see the command line interface of these
programs as an API, just like any other. You mention their use in shell
scripts. It would be a regression to gratuitously change the output of a
command, even if it is still within the relevant standard. The argument
against that is that shell scripts should follow the standard, not a
specific implementation; but as you often bring up, the standards are
too underspecified to look at only them. You mentioned earlier:

> One of the failure cases I've seen in contributed tests is they're
> testing what toybox does, not what the command is expected to do.

and (as I see it) such is the difference between regression testing and
testing for conformance. And both are useful. I'm all for continuous
refactoring of internal logic, but externally-visible behavior makes
more promises. Of course, toybox isn't 1.0 yet, so users should expect
changes in behavior... I give up.

What would be nice is if there was a POSIX test suite for commands and
utilities... Apparently there was at one point, some website mentioned
it, but it's not listed on the downloads page:

http://www.opengroup.org/testing/downloads.html

Doing some URL fiddling got me here:

http://www.opengroup.org/testing/downloads/vsclite.html

It's gone. Of course you can't just use the new one, you have to try to
get certified to even download it.

> If you would like to write a completely different test suite from the
> one I've done, feel free. I'm not stopping you.

I will probably end up trying that, at least for POSIX, because a freely
available test suite is generally useful (and I'm young enough to enjoy
writing something for the educational value even if it gets all thrown
away). How? grep, regexec(), I'll figure something out.

Again, thank you for setting me straight.

--
Regards,
Samuel Holland

 1458007135.0