[Toybox] Toybox test image / fuzzing

Sun Mar 13 22:52:39 PDT 2016

On 03/13/2016 05:04 PM, Samuel Holland wrote:
> On 03/13/2016 03:32 PM, Rob Landley wrote:
>>> but having the return code in the output string can be confusing
>>> to the human looking at the test case. Plus, why would you not want
>>> to verify the exit code for every test?
>>
>> Because science is about reducing variables and isolating to test
>> specific things?
> 
> If you want to reduce variables, see the suggestion about unit testing.

I want the any complexity to justify itself. Complexity is a cost and I
want to get a reasonable return for it.

That said, what specifically was the suggestion about unit testing. "We
should have some?" We should export a second C interface to something
that isn't isn't a shell command for the purpose of telling us... what,
exactly?

>> Because "we can so clearly we should" is the gnu approach to things?
>> Because you're arguing for adding complexity to the test suite to do
>> things it can already do, and in many cases is already doing?
> 
> find tests -name *.test -print0 | xargs -0 grep 'echo.*yes' | wc -l
> 182
> 
> Considering how many times this pattern is already used, I don't see it
> adding much complexity. It's trading an ad hoc pattern used in ~17% of
> the tests for something more consistent and well-defined.

Because 17% of the tests use it, 100% of the tests should get an extra
argument?

> Never mind the
> fact that using &&/|| doesn't tell you _what_ the return code was, only
> a binary success or failure.

In most cases I don't CARE what the return code was.

The specification of
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/false.html
says it returns "a non-zero error code". It never specifies what that
error code should _be_. If I return 3 I'm conformant. If the test suite
FAILS because it was expecting 1 and got 2, it's a bad test.

> I have seen a couple of tests that pass because they expect failure, but
> the command is failing for the wrong reason.

Point them out please?

>> Because we've already got 5 required arguments for each test and
>> you're proposing adding a couple more, and clearly that's going to
>> make the test suite easier to maintain and encourage additional
>> contributions?
> 
> Personally, yes, I think so.

I don't.

> If everything is explicit, there is less
> "hmmm, how do I test that? I guess I could throw something together
> using the shell. Now I have to sift through the existing tests to see if
> there is precedent, and what that is."

The test is a shell script with some convenience functions. It's always
been a shell script with some convenience functions, because it's
testing commands most commonly called from the shell.

> Instead, you put the inputs here,
> the outputs there, and you're done.

You want the test not to be obviously a shell script.

One of the failure cases I've seen in contributed tests is they're
testing what toybox does, not what the command is expected to do. The
command could produce different conformat behavior, but the test would
break.

> I admit that filesystem operations are a whole new can of worms, and I
> do not have a good answer to those.

I could list testing cans of worms here for a long time, but I've been
listing cans of worms all day and am tired. The more rigid your
infrastructure is, the less range of situations it copes with.

>> I'm not saying it's _not_ a good extension, but you did ask.
>> Complexity is a cost, spend it wisely. You're saying each test
>> should test more things, which means potential false positives and
>> more investigation about why a test failed (and/or more complex
>> reporting format).
> 
> On the other hand, splitting up the outputs you check gives you _more_
> information to help you investigate the problem. Instead of
> "FAIL: foobar" you get something like "FAIL (stdout mismatch): foobar".

Right now all fail is stdout mismatch, since that's the only thing it's
checking and you can go.

VERBOSE=fail make test_ls

And have it not only stop at the first failure, but show you the diff
between actual and expected, plus show you the command line it ran.

Your solution is to load more information into each test and write more
infrastructure to output which informat failed. I.E. make everything
bigger and more complicated without actually adding new capabilities.

> (As a side note, the test harness I've written recently even gives you a
> diff of the expected and actual outputs when the test fails.)

So does this one, VERBOSE=1 shows the diff for all of them, VERBOSE=fail
stops after the first failure. It's not the DEFAULT output because it's
chatty.

Type "make help" and look at the "test" target. I think it's some of the
web documentation too, and it's also in the big comment block at the
start of scripts/runtest.sh.

>> Also, "the return code" implies none of the tests are pipelines, or
>> multi-stage "do thing && examine thing" (which _already_ fails if do
>>  thing returned failure, and with the error_msg() stuff would have
>> said why to stderr already). Yesterday I was poking at mv tests
>> which have a lot of "mv one two && [ -e two ] && [ ! -e one ] && echo
>> yes" sort of constructs. What is "the exit code" from that?
> 
> Well, if we are testing mv, then the exit code is the exit code of mv.

Not in the above test it isn't. "mv" isn't necessarily the first thing
we run, or the last thing we run, in a given pipeline.

We have a test for "xargs", which is difficult to run _not_ in a
pipeline. When you test "nice" or "chroot" or "time", the command has an
exit code and its child could have an exit code. It's NOT THAT SIMPLE.

> The rest is just checking filesystem state after the fact. I claimed to
> not know what to do about it, but in the interest of avoiding punting
> the question, here's the answer off the top of my head (even though I
> know you aren't going to like it much):
> 
> Add an argument to the testing command that contains a predicate to eval
> after running the test. If the predicate returns true, the test
> succeeded; if the predicate returns false, the test failed. That way,
> the only command that is ever in the "command" argument is the toy to test.

If you would like to write a completely different test suite from the
one I've done, feel free. I'm not stoping you.

>>> It's a lot of duplication to write "echo $?" in all of the test
>>> cases.
>>
>> I don't. Sometimes I go "blah && yes" or "blah || yes", when the
>> return code is specifically what I'm testing, and sometimes checking
>> the return code and checking the output are two separate tests.
> 
> Okay, so it's a lot of duplication to write "&& yes" all over the place. :)

"&& echo yes", and no it isn't really. Compared to having an extra
argument in the other 3/4 of the tests that don't currently care about
the exit code, plus an exception mechanism for "we don't actualy care
what this exit code is, just that it's nonzero"...

>> Keep in mind that error_msg() and friends produce output, and the
>> tests don't catch stderr by default but pass it through. If we catch
>> stderr by default and a test DOESN'T check it, then it's ignored
>> instead of visibile to the caller.
> 
> I'm not sure how you could _not_ check stderr. The test case has a
> string, the command generates a string, you compare the strings.

By default it intercepts stdout and stderr goes to the terminal. The
shell won't care what gets produced on stderr if the resulting exit code
is then 0 either.

> If you want to pass it through, nothing prevents that.

I don't understand what you're saying here. I already pointed out you
can redirect and intercept it and make it part of your test. That said,
perror_msg appends a translated error string so exact matches on english
will fail in other locales. Plus kernel version changes have been known
to change what errno a given syscall failure returns. Heck, different
filesystem types sometimes do that too. (Reiserfs was notorious for that.)

>> Also, keep in mind I want the host version to pass most of these
>> tests too, and if there are gratuitous differences in behavior I
>> don't WANT the test to fail based on something I don't care about
>> and wasn't trying to test. You're arguing for a tighter sieve with
>> smaller holes when I've already received a bunch of failing tests
>> that were written against gnu and never _tried_ against the toybox
>> version, and failed for reasons that aren't real failures.
> 
> If you want to do that, then yes, you definitely need a looser sieve. I
> get 60 more failures here with TEST_HOST than I do with allyesconfig. I
> agree that checking stderr on toybox vs. busybox or GNU is going to be
> impossible because of differing error messages.

I've annotated a few tests with 'expected to fail with non-toybox', grep
for SKIP_HOST=1 in tests/*.test.

> Possible solutions
> include not checking stderr by default (only formalizing the exit code
> check), or simply not checking stderr when TEST_HOST=1.

I.E. add another special case context with different behavior.

>>> As for stdout/stderr, it helps make sure diagnostic messages are
>>> going to the right stream when not using the helper functions.
>>
>> Right now diagnostic messages are visible in the output when running
>> the test. There shouldn't be any by default, when there is it's
>> pretty obvious because those lines aren't colored.
>>
>> I'm all for improving the test suite, but "what I think the test
>> suite should be trying to do differs from what you think the test
>> suite should be trying to do, therefore I am right" is missing some
>> steps.
> 
> Then I guess it's not exactly clear to me what you are trying to do with
> the test suite.

I had a list of 3 reasons in a previous email.

I run a bunch of tests manually when developing a command to determine
whether or not I'm happy with the command's behavior. My rule of thumb
is if I run a test command line during development, I should have an
equivalent test in the test suite for regression testing purposes. (I
had to run this test to check the behavior of the command, therefore it
is a necessary test and should be in the regression test suite. Usually
I cut and paste the command lines I ran to a file, along with the
output, and throw it on the todo list. Running "find" or "sed" tests
against the toybox source isn't stable for reasons listed earlier, or
grep -ABC against the README with all the weird possible -- placements,
so translating the tests into the format I need isn't always obvious.)

Then I want to do a second pass reading the specs closely (posix, man
page, whatever the spec I'm implementing from is) and do tests checking
every constraint in the spec. (That's one of my "run up to the 1.0
release" todo items.)

This is why so much of the test suite is still on the todo list. There's
a lot of work in doing it _right_...

> My interpretation of the purpose was to verify
> correctness (mathematically, string transformations, etc.) and standards
> compliance.

Alas, as I know from implementing a lot of this stuff, determining what
"correctness" _means_ is often non-obvious and completely undocumented.

> In that sense, for each set of inputs to a command
> (arguments, stdin, filesystem state), there is exactly one set of
> correct outputs (exit code, stdout/stderr, filesystem state)

This isn't how reality works.

See "false" above. 3 is an acceptable return code, and that's a TRIVIAL
case. The most interesting things to test are the error paths, and the
error output is almost never rigidly specified, and in the toybox case
perror_exit output is partly translated. And then you get into "toybox,
busybox, and ubuntu produce different output but I want at least _most_
of the tests to pass in common"...

> , and
> therefore the goal of the test suite is to compare the actual output to
> the correct output and ensure they match. If you don't check the exit
> code, you are missing part of the output.

Remember the difference between android and toybox uptime output? Or how
about pmap, what should its output show? The only nonzero return code
base64 can do is failure to write to stdout, but I recently added tests
to check that === was being wrapped by -w properly (because previously
it wasn't). Is error return code the defining characteristic of an
nbd-client test? (How _do_ you test that?)

Here is a cut and paste of the _entire_ man page of setsid:

SETSID(1)              User Commands                       SETSID(1)

NAME
       setsid - run a program in a new session

SYNOPSIS
       setsid program [arg...]

DESCRIPTION
       setsid runs a program in a new session.

SEE ALSO
       setsid(2)

AUTHOR
       Rick Sladkey <jrs at world.std.com>

AVAILABILITY
       The  setsid  command is part of the util-linux package and is
available
       from ftp://ftp.kernel.org/pub/linux/utils/util-linux/.

util-linux             November 1993                       SETSID(1)

Now tell me: what error return codes should it produce, and under what
circumstances? Are the error codes the man page doesn't bother to
mention an important part of testing this command, or is figuring out
how to distinguish a session leader (possibly with some sort of pty
wrapper plumbing to signal it through) more important to testing this
command?

> I won't tell you that you have to do it any one way. It's your project.
> Of course, complexity is to some extent a value judgment. If you think
> it adds too much complexity/strictness for your taste, that's fine. I
> was just trying to explain the reasoning behind the suggestion, and why
> I think it's a reasonable suggestion.

I'd like to figure out how to test the commands we've got so that if
they break in a way we care about, the test suite tells us rather than
us having to find it out. I don't care if false returns 3 and nothing
will ever notice. I _do_ care that the perl build broke because sed
wasn't doing a crazy thing I didn't know it had to do, which is why
commit 32b3587af261 added a test. If somebody has to implement a new sed
in future, that test shows them a thing it needs to do to handle that
crazy situation. (Unless perl gets fixed, which seems unlikely. But if
so, git annotate on the test suite shows why the test was added,
assuming the comment itself before the test isn't enough.)

Part of what the test suite does is make me re-think through what the
correct behavior _is_ in various corner cases, and I'm not sure setsid's
current behavior is remotely correct. I always meant to revisit it when
doing the shell...

>> Rob
> 
> P.S. Your other reply came in just as I had finished typing. Sorry if
> some of this is already addressed.

It's fine. Figuring out the right thing to do is often hard.

Rob

 1457934759.0